NLTK
From charlesreid1
Basics
Going through this step by step:
http://www.nltk.org/book/ch03.html
Applying it to Allen Ginsberg's poem "America". Mind is blown.
NLTK Basics
You will need to always include the following in your NLTK scripts:
from __future__ import division
import nltk, re, pprint
Reading Text
Opening File from URL
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
type(raw)
Opening File from Disk
f = open('poems/america.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
First-Pass Analysis: NLTK Book Chapter 3
All The Words
Analyze all the words in this poem:
In [4]: words = [w.lower() for w in tokens]
In [5]: type(words)
Out[5]: list
In [6]: len(words)
Out[6]: 898
In [7]: words
Out[7]:
['america',
'america',
'i',
"'ve",
'given',
'you',
'all',
'and',
'now',
'i',
"'m",
'nothing.',
'america',
'two',
'dollars',
'and',
'twenty-seven',
'cents',
',',
'january',
'17',
',',
'1956.',
'i',
'ca',
"n't",
'stand',
'my',
'own',
'mind.',
'america',
'when',
'will',
'we',
'end',
'the',
'human',
'war',
...
'america',
'i',
"'m",
'putting',
'my',
'queer',
'shoulder',
'to',
'the',
'wheel.',
'berkeley',
',',
'january',
'17',
',',
'1956']
</pre>
==Vocabulary==
We can compile a vocabulary by sorting our unique words:
<source lang="python">
In [8]: vocab=sorted(set(words))
In [9]: type(vocab)
Out[9]: list
In [10]: len(vocab)
Out[10]: 436
In [11]: vocab
Out[11]:
['$',
'&',
"'",
"''",
"'d",
"'ll",
"'m",
"'re",
"'s",
"'ve",
',',
'-',
'1400',
'17',
'1835',
'1956',
'1956.',
'2500',
'500',
'?',
'``',
'a',
'abolished',
'about',
'addressing',
'after',
'again.',
'against',
'alive.',
'all',
'always',
'am',
'ambition',
'america',
'america.',
'amter',
'an',
'and',
'angelic',
'anyway.',
'apiece',
'are',
'argument.',
'army',
'as',
'asia',
'at',
'atom',
'auto',
'automobiles',
'back',
'bad',
'basement',
'be',
'been',
'being',
'berkeley',
'better',
'big',
'black',
'bloor',
'blossoms',
'bomb.',
'bother',
'boys.',
'bureaucracy',
'burroughs',
'businessmen',
'but',
'buy',
'by',
'ca',
'came',
'can',
'candystore.',
'cars',
'catholic.',
'cell',
'cents',
'cere',
'chance',
'chance.',
'chicago.',
'chinaman',
'chinamen.',
'chinatown',
'closet.',
'clothes',
'com-',
'come',
'communist',
'consider',
'consist',
'continue',
'corner',
'correct',
'cosmic',
'costs',
'cover',
'cry',
'day',
'day.',
'days',
'demands.',
'despite',
'did',
'die',
'different',
'digest.',
'do',
'doing.',
'dollars',
'down',
'drunk',
'eat',
'eggs',
'emotional',
'end',
'every',
'everybody',
'fact',
'factories',
'falling.',
'feel',
'fillingstations.',
'five',
'flowerpots',
'for',
'ford',
'form',
'france',
'free',
'from',
'fuck',
'full',
'garages.',
'garbanzos',
'genitals',
'get',
'get.',
'give',
'given',
'go',
'go.',
'goes',
'going',
'good',
'good.',
'got',
'grab',
'grand',
'grave',
'hah.',
'handful',
'have',
'he',
'help.',
'henry',
'her',
'him',
'his',
'holy',
'hours',
'house',
'how',
'human',
'hundred',
'i',
'idea',
'impression',
'in',
'india',
'indians',
'individual',
'insane',
'institutions.',
'into',
'is',
'israel',
'it',
'its',
'january',
'job.',
'join',
'joints',
'joke',
'kid',
'know',
'laid.',
'lathes',
'learn',
'let',
'libraries',
'library.',
'life',
'light',
'like',
'litany',
'literature',
'live',
'look',
'looking',
'looks',
'lord',
'loyalists',
'machinery',
'mad.',
'made',
'magazine',
'magazine.',
'make',
'man',
'marijuana',
'marx.',
'max',
'me',
'me.',
'meetings',
'mensch',
'mental',
'million',
'millions',
'mind',
'mind.',
'momma',
'months',
'mood',
'mooney',
'more',
'mother',
'movie',
'mph',
'much',
'munist',
'murder.',
'must',
'my',
'myself',
'mystical',
"n't",
'national',
'nearing',
'nearsighted',
'need',
'needs',
'never',
'newspapers',
'next',
'nickel',
'niggers.',
'no',
'nor',
'not',
'nothing',
'nothing.',
'now',
'obsessed',
'obsession.',
'occurs',
'of',
'off',
'old',
'on',
'once',
'or',
'other',
'our',
'out',
'over',
'own',
'parts',
'party',
'past',
'per',
'perfect',
'perfectly',
'plain.',
'plants',
'plum',
'poem',
'point.',
'power',
'practical',
'prayer.',
'precision',
'president',
'prisons',
'private',
'producers',
'psychoanalyst',
'psychopathic',
'public',
'pushing',
'putting',
'queer',
'quite',
'read',
'read.',
'readers',
'reading',
'real',
'really',
'red',
'refuse',
'resources',
'resources.',
'responsibility',
'right',
'right.',
'rising',
'roses',
'run',
'running',
'russia',
'russia.',
'russians',
'russians.',
'sacco',
'saint.',
'save',
'saw',
'say',
'scott',
'scottsboro',
'seen',
'sell',
'send',
'sentimental',
'serious',
'serious.',
'set.',
'settle',
'seven',
'sexes.',
'she',
'should',
'shoulder',
'siberia.',
'sick',
'silly',
'sin-',
'sinister',
'sinister.',
'sit',
'sixteen',
'slink',
'smoke',
'so',
'sold',
'some',
'somebody',
'sorry.',
'spanish',
'speeches',
'spy.',
'stand',
'stare',
'stares',
'still',
'stop',
'strophe',
'strophes',
'suns.',
'supermarket',
'take',
'talking',
'tangiers',
'tears',
'television',
'telling',
'that',
'the',
'them',
'there',
'they',
'thing',
'think',
'thinks',
'this',
'through',
'ticket',
'till',
'time',
'to',
'told',
'tom',
'too',
'took',
'trial',
'trotskyites',
'trouble.',
'true',
'trying',
'turn',
'twenty-five-thousand',
'twenty-seven',
'two',
'ugh.',
'uncle',
'under',
'underprivileged',
'unpublishable',
'up',
'us',
'used',
'vanzetti',
'vibrations.',
'visions',
'want',
'wants',
'war',
'war.',
'was',
'way',
'we',
'week.',
'were',
'what',
'wheel.',
'when',
'who',
'whorehouses',
'why',
'will',
'with',
'wo',
'wobblies.',
'work',
'workers',
'world.',
'worthy',
'write',
'you',
'you.',
'your',
'yourself']Vowel/Consonant Analysis
In [12]: cvs = [cv for w in words for cv in re.findall(r'[ptksvr][aeiou]', w)]
In [13]: cfd = nltk.ConditionalFreqDist(cvs)
In [14]: cfd.tabulate()
a e i o u
k 0 14 3 0 0
p 6 6 2 5 4
r 13 42 47 13 11
s 8 27 29 11 2
t 14 19 23 36 3
v 2 30 6 0 0
In [16]: cv_word_pairs = [(cv, w) for w in words
... for cv in re.findall(r'[ptksvr][aeiou]', w)]
In [17]: cv_index = nltk.Index(cv_word_pairs)
Now we can print out some words associated with various vowel-consonant combinations:
In [18]: cv_index['to']
Out[18]:
['atom',
'to',
'into',
'too',
'to',
'to',
'to',
'to',
'to',
'stop',
'to',
'to',
'chinatown',
'to',
'told',
'to',
'to',
'candystore.',
'to',
'to',
'to',
'to',
'automobiles',
'tom',
'took',
'to',
'to',
'to',
'to',
'to',
'to',
'auto',
'to',
'to',
'factories',
'to']
In [19]: cv_index['so']
Out[19]:
['some',
'some',
'blossoms',
'somebody',
'sorry.',
'resources.',
'resources',
'prisons',
'so',
'sold',
'so']
In [20]: cv_index['po']
Out[20]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power']
In [21]: cv_index['po']
Out[21]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power']
In [23]: cv_index['ri']
Out[23]:
['america',
'america',
'america',
'america',
'write',
'right',
'america',
'america',
'libraries',
'america',
'america',
'america',
'america',
'trial',
'america',
'america',
'marijuana',
'right.',
'america',
'serious.',
'serious.',
'serious',
'america.',
'rising',
'marijuana',
'private',
'prisons',
'underprivileged',
'america',
'write',
'america',
'america',
'america',
'america',
'america',
'america',
'nearing',
'america',
'america',
'siberia.',
'america',
'serious.',
'america',
'america',
'right',
'factories',
'america']
In [24]: cfd.tabulate()
a e i o u
k 0 14 3 0 0
p 6 6 2 5 4
r 13 42 47 13 11
s 8 27 29 11 2
t 14 19 23 36 3
v 2 30 6 0 0
In [25]: cv_index['re']
Out[25]:
['are',
'are',
'there',
'are',
'refuse',
'are',
'read',
'stare',
'there',
'reading',
'addressing',
'are',
'read',
'stares',
'candystore.',
'read',
'responsibility',
'are',
'are',
'resources.',
'resources',
'literature',
'hundred',
'whorehouses',
'president',
'are',
'more',
"'re",
'different',
'free',
'were',
'free',
'cere',
'real',
'really',
'red',
'readers',
'bureaucracy',
'read.',
'impression',
'correct',
'precision']
References
NLTK book, baby steps: http://www.nltk.org/book/ch01.html
The Language Archives (resources): https://tla.mpi.nl/resources/data-archive/ · https://tla.mpi.nl/tools2/tooltype/analysis/page/3/
NLTK corpora readers list: https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders
text generator class: https://github.com/kopertop/nltkfun
multi-part blog series: http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize
lousy but ok: http://www.gilesthomas.com/2010/05/generating-political-news-using-nltk/
Related Software
Spacy: cython + python + ntlk: https://spacy.io/
prodi.gy for annotated machine learning models (like Tinder for data): https://prodi.gy/