1. Language Processing and Python
Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).
1 Computing with Language: Texts and Words
1.1 Getting Started with Python
# print hello world
print('hello world')
output:

1.2 Getting Started with NLTK
# import nltk
import nltk
nltk
output:

# import gutenberg plaintext corpus reader
from nltk.corpus import gutenberg
gutenberg
output:

# download gutenberg files
nltk.download('gutenberg')
output:

# get gutenberg file ids
gutenberg.fileids()
output:

# test open using regular approach
f = open("/root/nltk_data/corpora/gutenberg/melville-moby_dick.txt", "r")
print(f.read()[0:300])
output:

# create a corpus for moby dick
corpus1 = gutenberg.words('melville-moby_dick.txt')
text1 = nltk.Text(corpus1)
text1
output:

# create a corpus for sense and sensibility
corpus2 = gutenberg.words('austen-sense.txt')
text2 = nltk.Text(corpus2)
text2
output:

1.3 Searching Text
# get concordance for a sample word monstrous
text1.concordance("monstrous")
output:

# compare between two corpus
text1.similar("monstrous")
text2.similar("monstrous")
output:

# examine just the contexts
# that are shared by two or more words,
# such as monstrous and very
text2.common_contexts(["monstrous", "very"])
output:

# determine the location of words ie dispersion
text1.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
output:

import nltk
nltk.download('punkt')
# just for fun, generate text from corpus
text2.generate()

1.4 Counting Vocabulary
# get length of corpus text
len(text2)
output:

# create unique set of corpus text and sort in ascending order
sorted(set(text2))
output:

# measure lexical richness of text
len(set(text2)) / len(text2)
output:

Note: the number of distinct words is just 4% of the total number of words i.e 96% of the corpus text contains repeated words.
# count how often a word occurs in a text, and
# compute what percentage of the text is taken up by a specific word
print(text1.count("whale"))
print(100 * text1.count('whale') / len(text1))
output:

Using functions for repetitive tasks.
Define functions:
# define a function, lexical_diversity
def lexical_diversity(text):
return len(set(text)) / len(text)
# define a function, percentage
def percentage(count, total):
return 100 * count / total
Call functions:
# call lexical_diversity
print( lexical_diversity(text2) )
# call percentage
print( percentage (text1.count('whale') , len(text1) ) )
2 A Closer Look at Python: Texts as Lists of Words
2.1 Lists
# declare a list of words
sent1 = ['Call', 'me', 'Ishmael', '.']
# print the list content
print ( sent1 )
['Call', 'me', 'Ishmael', '.']
# compute the length of sent1
print ( len(sent1) )
# get the lexical diversity of sent1
print ( lexical_diversity(sent1) )
output:

# append new word to an existing list
sent1.append("Some")
print( sent1 )
output:

2.2 Indexing Lists
# get a word in index no, 173 of text1 list
print( text1[1] )
'Moby'
# get an index no. given the word 'awaken'
print( text1.index('Moby') )
output:

# print words in the index range that starts at index 0 and end before index 6
print( text1[0:6] )
# print words in the index range that starts at index 4 and end before index 6
print( text1[4:6] )
# print words in the index range that starts at second last index and ends at the last index
print( text1[-2:] )
output:

3 Computing with Language: Simple Statistics
3.1 Frequency Distributions
from nltk import FreqDist
# calculate frequency distribution
fdist1 = FreqDist([w.lower() for w in text1])
print(fdist1)
# get top 50 common words
fdist1.most_common(50)

3.2 Fine-grained Selection of Words
#Fine-grained Selection of Words
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
output:

# select specific length of words
fdist2 = FreqDist(text2)
sorted(w for w in set(text2) if len(w) > 7 and fdist2[w] > 7)

3.3 Collocations and Bigrams
from nltk import bigrams
list(bigrams(['more', 'is', 'said', 'than', 'done']))

3.4 Frequency Distribution Functions
Functions Defined for NLTK's Frequency Distributions
| Example | Description |
| fdist = FreqDist(samples) | create a frequency distribution containing the given samples |
| fdist[sample] += 1 | increment the count for this sample |
| fdist['monstrous'] | count of the number of times a given sample occurred |
| fdist.freq('monstrous') | frequency of a given sample |
| fdist.N() | total number of samples |
| fdist.most_common(n) | the n most common samples and their frequencies |
| for sample in fdist: | iterate over the samples |
| fdist.max() | sample with the greatest count |
| fdist.tabulate() | tabulate the frequency distribution |
| fdist.plot() | graphical plot of the frequency distribution |
| fdist.plot(cumulative=True) | cumulative plot of the frequency distribution |
| fdist1 | \= fdist2 |
| fdist1 < fdist2 | test if samples in fdist1 occur less frequently than in fdist2 |
fdist = FreqDist(text1 )
print(type(fdist))
print(len(fdist))
print(list(fdist))
print(fdist.N())
print(fdist.most_common(5))
fdist['and']
output:

# plot word frequency distribution
fdist.plot()
output:

4 Back to Python: Making Decisions and Taking Control
4.1 Conditionals
4.2 Operating on Every Element
4.3 Nested Code Blocks
4.4 Looping with Conditions
Colab Notebook:
https://colab.research.google.com/drive/1K9RVe42Kp79RyOlHC749wIdXVQR0eysI?usp=sharing

