2. Accessing Text Corpora and Lexical Resources

1 Accessing Text Corpora

A text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres.

1.1 Gutenberg Corpus

import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

output:

opening a text

# opening a text
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

shorter codes

# alternative, shorter way
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
len(emma)

display other information about each text; average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (lexical diversity score).

# using loop control structure to display other information about each text
import nltk
nltk.download('punkt')
for fileid in gutenberg.fileids():
  num_chars = len(gutenberg.raw(fileid))
  num_words = len(gutenberg.words(fileid))
  num_sents = len(gutenberg.sents(fileid))
  num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

output:

Using sents() function to divide the text up into its sentences, where each sentence is a list of words

# Using sents() function to divide the text up into its sentences, 
# where each sentence is a list of words
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
print(macbeth_sentences)
print(macbeth_sentences[1116])
longest_len = max(len(s) for s in macbeth_sentences)
print( longest_len )

output:

1.2 Web and Chat Text

Web text

# download webtext
import nltk
nltk.download('webtext')
# open web corpus
from nltk.corpus import webtext
for fileid in webtext.fileids():
  print(fileid, webtext.raw(fileid)[:65], '...')

Chat text

# download nps_chat corpus
import nltk
nltk.download('nps_chat')
# open nps_chat     
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',',
'I', 'can', 'look', 'in', 'a', 'mirror', '.']

1.3 Brown Corpus

# download brown corpus
import nltk
nltk.download('brown')
# open brown corpus    
from nltk.corpus import brown
print( brown.categories() )
print( brown.words(categories='news') )
print( brown.words(fileids=['cg22']) )
print( brown.sents(categories=['news', 'editorial', 'reviews']) )

1.4 Reuters Corpus

# download reuters corpus
import nltk
nltk.download('reuters')
# open reuter corpus
from nltk.corpus import reuters
print( reuters.fileids() )
print( reuters.categories() )
# view detailed information
print( reuters.categories('training/9865') )
print( reuters.categories(['training/9865', 'training/9880']) )
print( reuters.fileids('barley') )
print( reuters.fileids(['barley', 'corn']) )

1.5 Inaugural Address Corpus

# download inaugural corpus
import nltk
nltk.download('inaugural')
# open corpus
from nltk.corpus import inaugural
print( inaugural.fileids() )
print( [fileid[:4] for fileid in inaugural.fileids()] )
# get conditional frequency distribution
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()

output:

1.6 Annotated Text Corpora

https://www.nltk.org/howto/corpus.html

1.7 Corpora in Other Languages

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd17 = nltk.ConditionalFreqDist(
          (lang, len(word))
          for lang in languages
          for word in udhr.words(lang + '-Latin1'))
cfd17.plot(cumulative=True)

../images/word-len-dist.png

1.8 Text Corpus Structure

  • The simplest kind lacks any structure: it is just a collection of texts.

  • Often, texts are grouped into categories that might correspond to genre, source, author, language, etc.

  • Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic.

  • Occasionally, text collections have temporal structure, news collections being the most common example.

../images/text-corpus-structure.png

NLTK Corpus and Corpus Reader:

Each corpus module defines one or more “corpus reader functions”, which can be used to read documents from that corpus.

  • fileids() - the files of the corpus

  • fileids([categories]) - the files of the corpus corresponding to these categories

  • categories() - the categories of the corpus

  • categories([fileids]) - the categories of the corpus corresponding to these files

  • raw() - the raw content of the corpus

  • raw(fileids=[f1,f2,f3]) - the raw content of the specified files

  • raw(categories=[c1,c2]) - the raw content of the specified categories

  • words() - the words of the whole corpus

  • words(fileids=[f1,f2,f3]) - the words of the specified fileids

  • words(categories=[c1,c2]) - the words of the specified categories

  • sents() - the sentences of the whole corpus

  • sents(fileids=[f1,f2,f3]) - the sentences of the specified fileids

  • sents(categories=[c1,c2]) - the sentences of the specified categories

  • abspath(fileid) - the location of the given file on disk

  • encoding(fileid) - the encoding of the file (if known)

  • open(fileid) - open a stream for reading the given corpus file root if the path to the root of locally installed corpus

  • readme() - the contents of the README file of the corpus

Some of the Corpora and Corpus Samples Distributed with NLTK:

(refer https://www.nltk.org/api/nltk.corpus.reader.html)

CorpusCompilerContents
Brown CorpusFrancis, Kucera15 genres, 1.15M words, tagged, categorized
CESS TreebanksCLiC-UB1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data FilesPereira & WarrenWorld Geographic Database
CMU Pronouncing DictionaryCMU127k entries
CoNLL 2000 Chunking DataCoNLL270k words, tagged and chunked
CoNLL 2002 Named EntityCoNLL700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)CoNLL150k words, dependency parsed (Basque, Catalan)
Dependency TreebankNaradDependency parsed version of Penn Treebank sample
FrameNetFillmore, Baker et al10k word senses, 170k manually annotated sentences
Floresta TreebankDiana Santos et al9k sentences, tagged and parsed (Portuguese)
Gazetteer ListsVariousLists of cities and countries
Genesis CorpusMisc web sources6 texts, 200k words, 6 languages
Gutenberg (selections)Hart, Newby, et al18 texts, 2M words
Inaugural Address CorpusCSpanUS Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged CorpusKumaran et al60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho CorpusNILC, USP, Brazil1M words, tagged (Brazilian Portuguese)
Movie ReviewsPang, Lee2k movie reviews with sentiment polarity classification
Names CorpusKantrowitz, Ross8k male and female names
NIST 1999 Info Extr (selections)Garofolo63k words, newswire and named-entity SGML markup
NombankMeyers115k propositions, 1400 noun frames
NPS Chat CorpusForsyth, Martell10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNetBond et al15 languages, aligned to English WordNet
PP Attachment CorpusRatnaparkhi28k prepositional phrases, tagged as noun or verb modifiers
Proposition BankPalmer113k propositions, 3300 verb frames
Question ClassificationLi, Roth6k questions, categorized
Reuters CorpusReuters1.3M words, 10k news documents, categorized
Roget's ThesaurusProject Gutenberg200k words, formatted text
RTE Textual EntailmentDagan et al8k sentence pairs, categorized
SEMCORRus, Mihalcea880k words, part-of-speech and sense tagged
Senseval 2 CorpusPedersen600k words, part-of-speech and sense tagged
SentiWordNetEsuli, Sebastianisentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections)Bosak8 books in XML format
State of the Union CorpusCSPAN485k words, formatted text
Stopwords CorpusPorter et al2,400 stopwords for 11 languages
Swadesh CorpusWiktionarycomparative wordlists in 24 languages
Switchboard Corpus (selections)LDC36 phonecalls, transcribed, parsed
Univ Decl of Human RightsUnited Nations480k words, 300+ languages
Penn Treebank (selections)LDC40k words, tagged and parsed
TIMIT Corpus (selections)NIST/LDCaudio files and transcripts for 16 speakers
VerbNet 2.1Palmer et al5k verbs, hierarchically organized, linked to WordNet
Wordlist CorpusOpenOffice.org et al960k words and 20k affixes for 8 languages
WordNet 3.0 (English)Miller, Fellbaum145k synonym sets

Sample usage for WordNet

https://www.nltk.org/howto/wordnet.html

Importing nltk wordnet corpus reader into codes:

# WordNet is just another NLTK corpus reader, and can be imported like this
from nltk.corpus import wordnet

# For more compact code, we recommend:
from nltk.corpus import wordnet as wn

Before using nltk wordnet corpus reader, download the corpus data:

import nltk
nltk.download('wordnet')

Look up a word using synsets():

wn.synsets('dog')

Sample usage for SentiWordNet

https://www.nltk.org/howto/sentiwordnet.html

Importing nltk wordnet corpus reader into codes:

# SentiWordNet can be imported like this:
from nltk.corpus import sentiwordnet as swn

Before using nltk sentiwordnet corpus reader, download the corpus data:

import nltk
nltk.download('sentiwordnet')

Example uses of SentiWordNet:

# Example uses of SentiSynsets

breakdown = swn.senti_synset('breakdown.n.03')
print(breakdown)
print('*****')
print ('pos score:', breakdown.pos_score())
print ()
print ('neg score:', breakdown.neg_score())
print ()
print ('obj score:', breakdown.obj_score())

Example uses of SentiWordNet lookups:

# Example uses of SentiSynset Lookups

print ('list swn senti synsets: ' , list(swn.senti_synsets('slow')))
print ()
print ('swn senti synsets for happy: ' , list(swn.senti_synsets('happy', 'a')) )
print ()
print ('swn senti synsets for angry: ' , list(swn.senti_synsets('angry', 'a')) )
print ()

(additional)using pandas:

import pandas as pd

df_senti_synsets=pd.DataFrame([[a.synset.name(),a.pos_score(),a.neg_score(),a.obj_score()]  for a in list_senti_synset])

df_senti_synsets.rename(columns = {0:'synset',1:'pos',2:'neg',3:'obj'}, inplace = True)

df_senti_synsets.loc[df_senti_synsets.obj<0.5]

1.9 Loading your own Corpus

2 Conditional Frequency Distributions

2.1 Conditions and Events

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',]

pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),]

2.2 Counting Words by Genre

2.3 Plotting and Tabulating Distributions

2.4 Generating Random Text with Bigrams

3 More Python: Reusing Code

3.1 Creating Programs with a Text Editor

3.2 Functions

3.3 Modules

4 Lexical Resources

  • A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.

  • Lexical resources are secondary to texts, and are usually created and enriched with the help of texts.

  • For example,

    • if we have defined a text my_text,

    • then vocab = sorted(set(my_text)) builds the vocabulary of my_text,

    • while word_freq = FreqDist(my_text) counts the frequency of each word in the text.

  • Both of vocab and word_freq are simple lexical resources.

  • Similarly, a concordance gives us information about word usage that might help in the preparation of a dictionary.

  • Standard terminology for lexicons :

    • A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition.

    • Two distinct words having the same spelling are called homonyms.

../images/lexicon.png

4.1 Wordlist Corpora

4.2 A Pronouncing Dictionary

4.3 Comparative Wordlists

4.4 Shoebox and Toolbox Lexicons

5 WordNet

5.1 Senses and Synonyms

5.2 The WordNet Hierarchy

5.3 More Lexical Relations

5.4 Semantic Similarity

6 Summary

  • A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.

  • Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.

  • A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.

  • Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.

  • Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.

  • Some functions, known as "methods", are associated with an object and we give the object name followed by a period followed by the function, like this: x.funct(y), e.g., word.isalpha().

  • To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.

  • WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network.

  • Some functions are not available by default, but must be accessed using Python's import statement.

Colab Notebook:

https://colab.research.google.com/drive/12OWlfeZW9fIWNp7Ob8S8OPPqHgc8gUd5?usp=sharing

Reference:

https://www.nltk.org/book/ch02.html