NLTK stands for Natural Language Toolkit, it is a platform used for building Python programs to work with humans' natural language data.
By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese.
In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules.
Natural Language Processing — or NLP for short —, in a wide sense, covers any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Technologies based on NLP are becoming increasingly widespread. For example, phones and handheld computers support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish; text analysis enables us to detect sentiment in tweets and blogs. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society.
Natural Language Toolkit (NLTK)
NLTK was created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted by courses in dozens of universities, and serves as the basis of many research projects.
Language processing tasks and corresponding NLTK modules with examples of functionality
Language processing task | NLTK modules | Functionality |
Accessing corpora | corpus | standardized interfaces to corpora and lexicons |
String processing | tokenize, stem | tokenizers, sentence tokenizers, stemmers |
Collocation discovery | collocations | t-test, chi-squared, point-wise mutual information |
Part-of-speech tagging | tag | n-gram, backoff, Brill, HMM, TnT |
Machine learning | classify, cluster, tbl | decision tree, maximum entropy, naive Bayes, EM, k-means |
Chunking | chunk | regular expression, n-gram, named-entity |
Parsing | parse, ccg | chart, feature-based, unification, probabilistic, dependency |
Semantic interpretation | sem, inference | lambda calculus, first-order logic, model checking |
Evaluation metrics | metrics | precision, recall, agreement coefficients |
Probability and estimation | probability | frequency distributions, smoothed probability distributions |
Applications | app, chat | graphical concordancer, parsers, WordNet browser, chatbots |
Linguistic fieldwork | toolbox | manipulate data in SIL Toolbox format |
NLTK was designed with four primary goals in mind:
Simplicity: | To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data | |
Consistency: | To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names | |
Extensibility: | To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task | |
Modularity: | To provide components that can be used independently without needing to understand the rest of the toolkit |
Source: