1 Accessing Text Corpora

A text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres.

1.1 Gutenberg Corpus

import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

output:

opening a text

# opening a text
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

shorter codes

# alternative, shorter way
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
len(emma)

display other information about each text; average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (lexical diversity score).

# using loop control structure to display other information about each text
import nltk
nltk.download('punkt')
for fileid in gutenberg.fileids():
  num_chars = len(gutenberg.raw(fileid))
  num_words = len(gutenberg.words(fileid))
  num_sents = len(gutenberg.sents(fileid))
  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

output:

Using sents() function to divide the text up into its sentences, where each sentence is a list of words

# Using sents() function to divide the text up into its sentences, 
# where each sentence is a list of words
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
print(macbeth_sentences)
print(macbeth_sentences[1116])
longest_len = max(len(s) for s in macbeth_sentences)
print( longest_len )

output:

1.2 Web and Chat Text

Web text

# download webtext
import nltk
nltk.download('webtext')

# open web corpus
from nltk.corpus import webtext
for fileid in webtext.fileids():
  print(fileid, webtext.raw(fileid)[:65], '...')

Chat text

# download nps_chat corpus
import nltk
nltk.download('nps_chat')

# open nps_chat     
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',',
'I', 'can', 'look', 'in', 'a', 'mirror', '.']

1.3 Brown Corpus

# download brown corpus
import nltk
nltk.download('brown')

# open brown corpus    
from nltk.corpus import brown
print( brown.categories() )
print( brown.words(categories='news') )
print( brown.words(fileids=['cg22']) )
print( brown.sents(categories=['news', 'editorial', 'reviews']) )

1.4 Reuters Corpus

# download reuters corpus
import nltk
nltk.download('reuters')

# open reuter corpus
from nltk.corpus import reuters
print( reuters.fileids() )
print( reuters.categories() )

# view detailed information
print( reuters.categories('training/9865') )
print( reuters.categories(['training/9865', 'training/9880']) )
print( reuters.fileids('barley') )
print( reuters.fileids(['barley', 'corn']) )

1.5 Inaugural Address Corpus

# download inaugural corpus
import nltk
nltk.download('inaugural')

# open corpus
from nltk.corpus import inaugural
print( inaugural.fileids() )
print( [fileid[:4] for fileid in inaugural.fileids()] )

# get conditional frequency distribution
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()

output:

1.6 Annotated Text Corpora

https://www.nltk.org/howto/corpus.html

1.7 Corpora in Other Languages

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd17 = nltk.ConditionalFreqDist(
          (lang, len(word))
          for lang in languages
          for word in udhr.words(lang + '-Latin1'))
cfd17.plot(cumulative=True)

../images/word-len-dist.png

1.8 Text Corpus Structure

The simplest kind lacks any structure: it is just a collection of texts.
Often, texts are grouped into categories that might correspond to genre, source, author, language, etc.
Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic.
Occasionally, text collections have temporal structure, news collections being the most common example.

../images/text-corpus-structure.png

NLTK Corpus and Corpus Reader:

Each corpus module defines one or more “corpus reader functions”, which can be used to read documents from that corpus.

corpus: https://www.nltk.org/howto/corpus.html
corpus reader: https://www.nltk.org/api/nltk.corpus.reader.html

fileids() - the files of the corpus
fileids([categories]) - the files of the corpus corresponding to these categories
categories() - the categories of the corpus
categories([fileids]) - the categories of the corpus corresponding to these files
raw() - the raw content of the corpus
raw(fileids=[f1,f2,f3]) - the raw content of the specified files
raw(categories=[c1,c2]) - the raw content of the specified categories
words() - the words of the whole corpus
words(fileids=[f1,f2,f3]) - the words of the specified fileids
words(categories=[c1,c2]) - the words of the specified categories
sents() - the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) - the sentences of the specified fileids
sents(categories=[c1,c2]) - the sentences of the specified categories
abspath(fileid) - the location of the given file on disk
encoding(fileid) - the encoding of the file (if known)
open(fileid) - open a stream for reading the given corpus file root if the path to the root of locally installed corpus
readme() - the contents of the README file of the corpus

Some of the Corpora and Corpus Samples Distributed with NLTK:

(refer https://www.nltk.org/api/nltk.corpus.reader.html)

Corpus	Compiler	Contents
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS Treebanks	CLiC-UB	1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files	Pereira & Warren	World Geographic Database
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	CoNLL	270k words, tagged and chunked
CoNLL 2002 Named Entity	CoNLL	700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)	CoNLL	150k words, dependency parsed (Basque, Catalan)
Dependency Treebank	Narad	Dependency parsed version of Penn Treebank sample
FrameNet	Fillmore, Baker et al	10k word senses, 170k manually annotated sentences
Floresta Treebank	Diana Santos et al	9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists	Various	Lists of cities and countries
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (selections)	Hart, Newby, et al	18 texts, 2M words
Inaugural Address Corpus	CSpan	US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	2k movie reviews with sentiment polarity classification
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (selections)	Garofolo	63k words, newswire and named-entity SGML markup
Nombank	Meyers	115k propositions, 1400 noun frames
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNet	Bond et al	15 languages, aligned to English WordNet
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget's Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
Senseval 2 Corpus	Pedersen	600k words, part-of-speech and sense tagged
SentiWordNet	Esuli, Sebastiani	sentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections)	Bosak	8 books in XML format
State of the Union Corpus	CSPAN	485k words, formatted text
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Swadesh Corpus	Wiktionary	comparative wordlists in 24 languages
Switchboard Corpus (selections)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights	United Nations	480k words, 300+ languages
Penn Treebank (selections)	LDC	40k words, tagged and parsed
TIMIT Corpus (selections)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

Sample usage for WordNet

https://www.nltk.org/howto/wordnet.html

Importing nltk wordnet corpus reader into codes:

# WordNet is just another NLTK corpus reader, and can be imported like this
from nltk.corpus import wordnet

# For more compact code, we recommend:
from nltk.corpus import wordnet as wn

Before using nltk wordnet corpus reader, download the corpus data:

import nltk
nltk.download('wordnet')

Look up a word using synsets():

wn.synsets('dog')

Sample usage for SentiWordNet

https://www.nltk.org/howto/sentiwordnet.html

Importing nltk wordnet corpus reader into codes:

# SentiWordNet can be imported like this:
from nltk.corpus import sentiwordnet as swn

Before using nltk sentiwordnet corpus reader, download the corpus data:

import nltk
nltk.download('sentiwordnet')

Example uses of SentiWordNet:

# Example uses of SentiSynsets

breakdown = swn.senti_synset('breakdown.n.03')
print(breakdown)
print('*****')
print ('pos score:', breakdown.pos_score())
print ()
print ('neg score:', breakdown.neg_score())
print ()
print ('obj score:', breakdown.obj_score())

Example uses of SentiWordNet lookups:

# Example uses of SentiSynset Lookups

print ('list swn senti synsets: ' , list(swn.senti_synsets('slow')))
print ()
print ('swn senti synsets for happy: ' , list(swn.senti_synsets('happy', 'a')) )
print ()
print ('swn senti synsets for angry: ' , list(swn.senti_synsets('angry', 'a')) )
print ()

(additional)using pandas:

import pandas as pd

df_senti_synsets=pd.DataFrame([[a.synset.name(),a.pos_score(),a.neg_score(),a.obj_score()]  for a in list_senti_synset])

df_senti_synsets.rename(columns = {0:'synset',1:'pos',2:'neg',3:'obj'}, inplace = True)

df_senti_synsets.loc[df_senti_synsets.obj<0.5]

1.9 Loading your own Corpus

2 Conditional Frequency Distributions

2.1 Conditions and Events

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',]

pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),]

2.2 Counting Words by Genre

2.3 Plotting and Tabulating Distributions

2.4 Generating Random Text with Bigrams

3 More Python: Reusing Code

3.1 Creating Programs with a Text Editor

3.2 Functions

3.3 Modules

4 Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.
Lexical resources are secondary to texts, and are usually created and enriched with the help of texts.
For example,
- if we have defined a text my_text,
- then vocab = sorted(set(my_text)) builds the vocabulary of my_text,
- while word_freq = FreqDist(my_text) counts the frequency of each word in the text.
Both of vocab and word_freq are simple lexical resources.
Similarly, a concordance gives us information about word usage that might help in the preparation of a dictionary.
Standard terminology for lexicons :
- A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition.
- Two distinct words having the same spelling are called homonyms.

../images/lexicon.png

4.1 Wordlist Corpora

4.2 A Pronouncing Dictionary

4.3 Comparative Wordlists

4.4 Shoebox and Toolbox Lexicons

5 WordNet

5.1 Senses and Synonyms

5.2 The WordNet Hierarchy

5.3 More Lexical Relations

5.4 Semantic Similarity

6 Summary

A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.
Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.
Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.
Some functions, known as "methods", are associated with an object and we give the object name followed by a period followed by the function, like this: x.funct(y), e.g., word.isalpha().
To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.
WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network.
Some functions are not available by default, but must be accessed using Python's import statement.

Colab Notebook:

https://colab.research.google.com/drive/12OWlfeZW9fIWNp7Ob8S8OPPqHgc8gUd5?usp=sharing

Reference:

https://www.nltk.org/book/ch02.html

2. Accessing Text Corpora and Lexical Resources

1 Accessing Text Corpora

1.1 Gutenberg Corpus

1.2 Web and Chat Text

1.3 Brown Corpus

1.4 Reuters Corpus

1.5 Inaugural Address Corpus

1.6 Annotated Text Corpora

1.7 Corpora in Other Languages

1.8 Text Corpus Structure

NLTK Corpus and Corpus Reader:

NLTK Corpus-related functions and their descriptions:

Some of the Corpora and Corpus Samples Distributed with NLTK:

Sample usage for WordNet

Sample usage for SentiWordNet

1.9 Loading your own Corpus

2 Conditional Frequency Distributions

2.1 Conditions and Events

2.2 Counting Words by Genre

2.3 Plotting and Tabulating Distributions

2.4 Generating Random Text with Bigrams

3 More Python: Reusing Code

3.1 Creating Programs with a Text Editor

3.2 Functions

3.3 Modules

4 Lexical Resources

4.1 Wordlist Corpora

4.2 A Pronouncing Dictionary

4.3 Comparative Wordlists

4.4 Shoebox and Toolbox Lexicons

5 WordNet

5.1 Senses and Synonyms

5.2 The WordNet Hierarchy

5.3 More Lexical Relations

5.4 Semantic Similarity

6 Summary

Colab Notebook:

Reference: