Skip to main content

Command Palette

Search for a command to run...

1. Language Processing and Python

Published
4 min read
M

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).

1 Computing with Language: Texts and Words

1.1 Getting Started with Python

# print hello world
print('hello world')

output:

1.2 Getting Started with NLTK

# import nltk
import nltk
nltk

output:

# import gutenberg plaintext corpus reader
from nltk.corpus import gutenberg
gutenberg

output:

# download gutenberg files
nltk.download('gutenberg')

output:

# get gutenberg file ids
gutenberg.fileids()

output:

# test open using regular approach
f = open("/root/nltk_data/corpora/gutenberg/melville-moby_dick.txt", "r")
print(f.read()[0:300])

output:

# create a corpus for moby dick
corpus1 = gutenberg.words('melville-moby_dick.txt')
text1 = nltk.Text(corpus1)
text1

output:

# create a corpus for sense and sensibility
corpus2 = gutenberg.words('austen-sense.txt')
text2 = nltk.Text(corpus2)
text2

output:

1.3 Searching Text

# get concordance for a sample word monstrous
text1.concordance("monstrous")

output:

# compare between two corpus
text1.similar("monstrous")
text2.similar("monstrous")

output:

# examine just the contexts 
# that are shared by two or more words, 
# such as monstrous and very
text2.common_contexts(["monstrous", "very"])

output:

# determine the location of words ie dispersion
text1.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

output:

import nltk
nltk.download('punkt')
# just for fun, generate text from corpus
text2.generate()

1.4 Counting Vocabulary

# get length of corpus text
len(text2)

output:

# create unique set of corpus text and sort in ascending order
sorted(set(text2))

output:

# measure lexical richness of text
len(set(text2)) / len(text2)

output:

Note: the number of distinct words is just 4% of the total number of words i.e 96% of the corpus text contains repeated words.

# count how often a word occurs in a text, and 
# compute what percentage of the text is taken up by a specific word
print(text1.count("whale"))
print(100 * text1.count('whale') / len(text1))

output:

Using functions for repetitive tasks.

Define functions:


# define a function, lexical_diversity
def lexical_diversity(text):
  return len(set(text)) / len(text)

# define a function, percentage
def percentage(count, total):
  return 100 * count / total

Call functions:

# call lexical_diversity
print(  lexical_diversity(text2)  )

# call percentage
print(  percentage (text1.count('whale') , len(text1) )  )

2 A Closer Look at Python: Texts as Lists of Words

2.1 Lists

# declare a list of words
sent1 = ['Call', 'me', 'Ishmael', '.']
# print the list content
print ( sent1 )
['Call', 'me', 'Ishmael', '.']
# compute the length of sent1
print ( len(sent1) )
# get the lexical diversity of sent1
print ( lexical_diversity(sent1) )

output:

# append new word to an existing list
sent1.append("Some")
print( sent1 )

output:

2.2 Indexing Lists

# get a word in index no, 173 of text1 list
print( text1[1] )
'Moby'
# get an index no. given the word 'awaken' 
print( text1.index('Moby') )

output:

# print words in the index range that starts at index 0 and end before index 6
print( text1[0:6] )
# print words in the index range that starts at index 4 and end before index 6
print( text1[4:6] )
# print words in the index range that starts at second last index and ends at the last index
print( text1[-2:] )

output:

3 Computing with Language: Simple Statistics

3.1 Frequency Distributions

from nltk import FreqDist
# calculate frequency distribution
fdist1 = FreqDist([w.lower() for w in text1])
print(fdist1)
# get top 50 common words
fdist1.most_common(50)

3.2 Fine-grained Selection of Words

#Fine-grained Selection of Words
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

output:

# select specific length of words
fdist2 = FreqDist(text2)
sorted(w for w in set(text2) if len(w) > 7 and fdist2[w] > 7)

3.3 Collocations and Bigrams

from nltk import bigrams

list(bigrams(['more', 'is', 'said', 'than', 'done']))

3.4 Frequency Distribution Functions

Functions Defined for NLTK's Frequency Distributions

ExampleDescription
fdist = FreqDist(samples)create a frequency distribution containing the given samples
fdist[sample] += 1increment the count for this sample
fdist['monstrous']count of the number of times a given sample occurred
fdist.freq('monstrous')frequency of a given sample
fdist.N()total number of samples
fdist.most_common(n)the n most common samples and their frequencies
for sample in fdist:iterate over the samples
fdist.max()sample with the greatest count
fdist.tabulate()tabulate the frequency distribution
fdist.plot()graphical plot of the frequency distribution
fdist.plot(cumulative=True)cumulative plot of the frequency distribution
fdist1\= fdist2
fdist1 < fdist2test if samples in fdist1 occur less frequently than in fdist2
fdist = FreqDist(text1    )

print(type(fdist))

print(len(fdist))

print(list(fdist))

print(fdist.N())

print(fdist.most_common(5))

fdist['and']

output:

# plot word frequency distribution
fdist.plot()

output:

4 Back to Python: Making Decisions and Taking Control

4.1 Conditionals

4.2 Operating on Every Element

4.3 Nested Code Blocks

4.4 Looping with Conditions

Colab Notebook:

https://colab.research.google.com/drive/1K9RVe42Kp79RyOlHC749wIdXVQR0eysI?usp=sharing

Source:

https://www.nltk.org/book/ch01.html

More from this blog

Natural Language Tool Kit notes

7 posts