NLTK
From charlesreid1
Contents
Basics
Going through this step by step:
http://www.nltk.org/book/ch03.html
Applying it to Allen Ginsberg's poem "America". Mind is blown.
NLTK Basics
You will need to always include the following in your NLTK scripts:
from __future__ import division
import nltk, re, pprint
Reading Text
Opening File from URL
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
type(raw)
Opening File from Disk
f = open('poems/america.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
First-Pass Analysis: NLTK Book Chapter 3
All The Words
Analyze all the words in this poem:
In [4]: words = [w.lower() for w in tokens] In [5]: type(words) Out[5]: list In [6]: len(words) Out[6]: 898 In [7]: words Out[7]: ['america', 'america', 'i', "'ve", 'given', 'you', 'all', 'and', 'now', 'i', "'m", 'nothing.', 'america', 'two', 'dollars', 'and', 'twenty-seven', 'cents', ',', 'january', '17', ',', '1956.', 'i', 'ca', "n't", 'stand', 'my', 'own', 'mind.', 'america', 'when', 'will', 'we', 'end', 'the', 'human', 'war', ... 'america', 'i', "'m", 'putting', 'my', 'queer', 'shoulder', 'to', 'the', 'wheel.', 'berkeley', ',', 'january', '17', ',', '1956']
Vocabulary
We can compile a vocabulary by sorting our unique words:
In [8]: vocab=sorted(set(words)) In [9]: type(vocab) Out[9]: list In [10]: len(vocab) Out[10]: 436 In [11]: vocab Out[11]: ['$', '&', "'", "''", "'d", "'ll", "'m", "'re", "'s", "'ve", ',', '-', '1400', '17', '1835', '1956', '1956.', '2500', '500', '?', '``', 'a', 'abolished', 'about', 'addressing', 'after', 'again.', 'against', 'alive.', 'all', 'always', 'am', 'ambition', 'america', 'america.', 'amter', 'an', 'and', 'angelic', 'anyway.', 'apiece', 'are', 'argument.', 'army', 'as', 'asia', 'at', 'atom', 'auto', 'automobiles', 'back', 'bad', 'basement', 'be', 'been', 'being', 'berkeley', 'better', 'big', 'black', 'bloor', 'blossoms', 'bomb.', 'bother', 'boys.', 'bureaucracy', 'burroughs', 'businessmen', 'but', 'buy', 'by', 'ca', 'came', 'can', 'candystore.', 'cars', 'catholic.', 'cell', 'cents', 'cere', 'chance', 'chance.', 'chicago.', 'chinaman', 'chinamen.', 'chinatown', 'closet.', 'clothes', 'com-', 'come', 'communist', 'consider', 'consist', 'continue', 'corner', 'correct', 'cosmic', 'costs', 'cover', 'cry', 'day', 'day.', 'days', 'demands.', 'despite', 'did', 'die', 'different', 'digest.', 'do', 'doing.', 'dollars', 'down', 'drunk', 'eat', 'eggs', 'emotional', 'end', 'every', 'everybody', 'fact', 'factories', 'falling.', 'feel', 'fillingstations.', 'five', 'flowerpots', 'for', 'ford', 'form', 'france', 'free', 'from', 'fuck', 'full', 'garages.', 'garbanzos', 'genitals', 'get', 'get.', 'give', 'given', 'go', 'go.', 'goes', 'going', 'good', 'good.', 'got', 'grab', 'grand', 'grave', 'hah.', 'handful', 'have', 'he', 'help.', 'henry', 'her', 'him', 'his', 'holy', 'hours', 'house', 'how', 'human', 'hundred', 'i', 'idea', 'impression', 'in', 'india', 'indians', 'individual', 'insane', 'institutions.', 'into', 'is', 'israel', 'it', 'its', 'january', 'job.', 'join', 'joints', 'joke', 'kid', 'know', 'laid.', 'lathes', 'learn', 'let', 'libraries', 'library.', 'life', 'light', 'like', 'litany', 'literature', 'live', 'look', 'looking', 'looks', 'lord', 'loyalists', 'machinery', 'mad.', 'made', 'magazine', 'magazine.', 'make', 'man', 'marijuana', 'marx.', 'max', 'me', 'me.', 'meetings', 'mensch', 'mental', 'million', 'millions', 'mind', 'mind.', 'momma', 'months', 'mood', 'mooney', 'more', 'mother', 'movie', 'mph', 'much', 'munist', 'murder.', 'must', 'my', 'myself', 'mystical', "n't", 'national', 'nearing', 'nearsighted', 'need', 'needs', 'never', 'newspapers', 'next', 'nickel', 'niggers.', 'no', 'nor', 'not', 'nothing', 'nothing.', 'now', 'obsessed', 'obsession.', 'occurs', 'of', 'off', 'old', 'on', 'once', 'or', 'other', 'our', 'out', 'over', 'own', 'parts', 'party', 'past', 'per', 'perfect', 'perfectly', 'plain.', 'plants', 'plum', 'poem', 'point.', 'power', 'practical', 'prayer.', 'precision', 'president', 'prisons', 'private', 'producers', 'psychoanalyst', 'psychopathic', 'public', 'pushing', 'putting', 'queer', 'quite', 'read', 'read.', 'readers', 'reading', 'real', 'really', 'red', 'refuse', 'resources', 'resources.', 'responsibility', 'right', 'right.', 'rising', 'roses', 'run', 'running', 'russia', 'russia.', 'russians', 'russians.', 'sacco', 'saint.', 'save', 'saw', 'say', 'scott', 'scottsboro', 'seen', 'sell', 'send', 'sentimental', 'serious', 'serious.', 'set.', 'settle', 'seven', 'sexes.', 'she', 'should', 'shoulder', 'siberia.', 'sick', 'silly', 'sin-', 'sinister', 'sinister.', 'sit', 'sixteen', 'slink', 'smoke', 'so', 'sold', 'some', 'somebody', 'sorry.', 'spanish', 'speeches', 'spy.', 'stand', 'stare', 'stares', 'still', 'stop', 'strophe', 'strophes', 'suns.', 'supermarket', 'take', 'talking', 'tangiers', 'tears', 'television', 'telling', 'that', 'the', 'them', 'there', 'they', 'thing', 'think', 'thinks', 'this', 'through', 'ticket', 'till', 'time', 'to', 'told', 'tom', 'too', 'took', 'trial', 'trotskyites', 'trouble.', 'true', 'trying', 'turn', 'twenty-five-thousand', 'twenty-seven', 'two', 'ugh.', 'uncle', 'under', 'underprivileged', 'unpublishable', 'up', 'us', 'used', 'vanzetti', 'vibrations.', 'visions', 'want', 'wants', 'war', 'war.', 'was', 'way', 'we', 'week.', 'were', 'what', 'wheel.', 'when', 'who', 'whorehouses', 'why', 'will', 'with', 'wo', 'wobblies.', 'work', 'workers', 'world.', 'worthy', 'write', 'you', 'you.', 'your', 'yourself']
Vowel/Consonant Analysis
In [12]: cvs = [cv for w in words for cv in re.findall(r'[ptksvr][aeiou]', w)] In [13]: cfd = nltk.ConditionalFreqDist(cvs) In [14]: cfd.tabulate() a e i o u k 0 14 3 0 0 p 6 6 2 5 4 r 13 42 47 13 11 s 8 27 29 11 2 t 14 19 23 36 3 v 2 30 6 0 0 In [16]: cv_word_pairs = [(cv, w) for w in words ... for cv in re.findall(r'[ptksvr][aeiou]', w)] In [17]: cv_index = nltk.Index(cv_word_pairs)
Now we can print out some words associated with various vowel-consonant combinations:
In [18]: cv_index['to'] Out[18]: ['atom', 'to', 'into', 'too', 'to', 'to', 'to', 'to', 'to', 'stop', 'to', 'to', 'chinatown', 'to', 'told', 'to', 'to', 'candystore.', 'to', 'to', 'to', 'to', 'automobiles', 'tom', 'took', 'to', 'to', 'to', 'to', 'to', 'to', 'auto', 'to', 'to', 'factories', 'to'] In [19]: cv_index['so'] Out[19]: ['some', 'some', 'blossoms', 'somebody', 'sorry.', 'resources.', 'resources', 'prisons', 'so', 'sold', 'so'] In [20]: cv_index['po'] Out[20]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power'] In [21]: cv_index['po'] Out[21]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power'] In [23]: cv_index['ri'] Out[23]: ['america', 'america', 'america', 'america', 'write', 'right', 'america', 'america', 'libraries', 'america', 'america', 'america', 'america', 'trial', 'america', 'america', 'marijuana', 'right.', 'america', 'serious.', 'serious.', 'serious', 'america.', 'rising', 'marijuana', 'private', 'prisons', 'underprivileged', 'america', 'write', 'america', 'america', 'america', 'america', 'america', 'america', 'nearing', 'america', 'america', 'siberia.', 'america', 'serious.', 'america', 'america', 'right', 'factories', 'america'] In [24]: cfd.tabulate() a e i o u k 0 14 3 0 0 p 6 6 2 5 4 r 13 42 47 13 11 s 8 27 29 11 2 t 14 19 23 36 3 v 2 30 6 0 0 In [25]: cv_index['re'] Out[25]: ['are', 'are', 'there', 'are', 'refuse', 'are', 'read', 'stare', 'there', 'reading', 'addressing', 'are', 'read', 'stares', 'candystore.', 'read', 'responsibility', 'are', 'are', 'resources.', 'resources', 'literature', 'hundred', 'whorehouses', 'president', 'are', 'more', "'re", 'different', 'free', 'were', 'free', 'cere', 'real', 'really', 'red', 'readers', 'bureaucracy', 'read.', 'impression', 'correct', 'precision']
References
NLTK book, baby steps: http://www.nltk.org/book/ch01.html
The Language Archives (resources): https://tla.mpi.nl/resources/data-archive/ · https://tla.mpi.nl/tools2/tooltype/analysis/page/3/
NLTK corpora readers list: https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders
text generator class: https://github.com/kopertop/nltkfun
multi-part blog series: http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize
lousy but ok: http://www.gilesthomas.com/2010/05/generating-political-news-using-nltk/
Related Software
Spacy: cython + python + ntlk: https://spacy.io/
prodi.gy for annotated machine learning models (like Tinder for data): https://prodi.gy/