From charlesreid1

Basics

Going through this step by step:

http://www.nltk.org/book/ch03.html

Applying it to Allen Ginsberg's poem "America". Mind is blown.

NLTK Basics

You will need to always include the following in your NLTK scripts:

from __future__ import division
import nltk, re, pprint

Reading Text

Opening File from URL

from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
type(raw)

Opening File from Disk

f = open('poems/america.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))

First-Pass Analysis: NLTK Book Chapter 3

All The Words

Analyze all the words in this poem:


In [4]: words = [w.lower() for w in tokens]

In [5]: type(words)
Out[5]: list

In [6]: len(words)
Out[6]: 898

In [7]: words
Out[7]:
['america',
 'america',
 'i',
 "'ve",
 'given',
 'you',
 'all',
 'and',
 'now',
 'i',
 "'m",
 'nothing.',
 'america',
 'two',
 'dollars',
 'and',
 'twenty-seven',
 'cents',
 ',',
 'january',
 '17',
 ',',
 '1956.',
 'i',
 'ca',
 "n't",
 'stand',
 'my',
 'own',
 'mind.',
 'america',
 'when',
 'will',
 'we',
 'end',
 'the',
 'human',
 'war',

...

 'america',
 'i',
 "'m",
 'putting',
 'my',
 'queer',
 'shoulder',
 'to',
 'the',
 'wheel.',
 'berkeley',
 ',',
 'january',
 '17',
 ',',
 '1956']

Vocabulary

We can compile a vocabulary by sorting our unique words:

In [8]: vocab=sorted(set(words))

In [9]: type(vocab)
Out[9]: list

In [10]: len(vocab)
Out[10]: 436

In [11]: vocab
Out[11]:
['$',
 '&',
 "'",
 "''",
 "'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 ',',
 '-',
 '1400',
 '17',
 '1835',
 '1956',
 '1956.',
 '2500',
 '500',
 '?',
 '``',
 'a',
 'abolished',
 'about',
 'addressing',
 'after',
 'again.',
 'against',
 'alive.',
 'all',
 'always',
 'am',
 'ambition',
 'america',
 'america.',
 'amter',
 'an',
 'and',
 'angelic',
 'anyway.',
 'apiece',
 'are',
 'argument.',
 'army',
 'as',
 'asia',
 'at',
 'atom',
 'auto',
 'automobiles',
 'back',
 'bad',
 'basement',
 'be',
 'been',
 'being',
 'berkeley',
 'better',
 'big',
 'black',
 'bloor',
 'blossoms',
 'bomb.',
 'bother',
 'boys.',
 'bureaucracy',
 'burroughs',
 'businessmen',
 'but',
 'buy',
 'by',
 'ca',
 'came',
 'can',
 'candystore.',
 'cars',
 'catholic.',
 'cell',
 'cents',
 'cere',
 'chance',
 'chance.',
 'chicago.',
 'chinaman',
 'chinamen.',
 'chinatown',
 'closet.',
 'clothes',
 'com-',
 'come',
 'communist',
 'consider',
 'consist',
 'continue',
 'corner',
 'correct',
 'cosmic',
 'costs',
 'cover',
 'cry',
 'day',
 'day.',
 'days',
 'demands.',
 'despite',
 'did',
 'die',
 'different',
 'digest.',
 'do',
 'doing.',
 'dollars',
 'down',
 'drunk',
 'eat',
 'eggs',
 'emotional',
 'end',
 'every',
 'everybody',
 'fact',
 'factories',
 'falling.',
 'feel',
 'fillingstations.',
 'five',
 'flowerpots',
 'for',
 'ford',
 'form',
 'france',
 'free',
 'from',
 'fuck',
 'full',
 'garages.',
 'garbanzos',
 'genitals',
 'get',
 'get.',
 'give',
 'given',
 'go',
 'go.',
 'goes',
 'going',
 'good',
 'good.',
 'got',
 'grab',
 'grand',
 'grave',
 'hah.',
 'handful',
 'have',
 'he',
 'help.',
 'henry',
 'her',
 'him',
 'his',
 'holy',
 'hours',
 'house',
 'how',
 'human',
 'hundred',
 'i',
 'idea',
 'impression',
 'in',
 'india',
 'indians',
 'individual',
 'insane',
 'institutions.',
 'into',
 'is',
 'israel',
 'it',
 'its',
 'january',
 'job.',
 'join',
 'joints',
 'joke',
 'kid',
 'know',
 'laid.',
 'lathes',
 'learn',
 'let',
 'libraries',
 'library.',
 'life',
 'light',
 'like',
 'litany',
 'literature',
 'live',
 'look',
 'looking',
 'looks',
 'lord',
 'loyalists',
 'machinery',
 'mad.',
 'made',
 'magazine',
 'magazine.',
 'make',
 'man',
 'marijuana',
 'marx.',
 'max',
 'me',
 'me.',
 'meetings',
 'mensch',
 'mental',
 'million',
 'millions',
 'mind',
 'mind.',
 'momma',
 'months',
 'mood',
 'mooney',
 'more',
 'mother',
 'movie',
 'mph',
 'much',
 'munist',
 'murder.',
 'must',
 'my',
 'myself',
 'mystical',
 "n't",
 'national',
 'nearing',
 'nearsighted',
 'need',
 'needs',
 'never',
 'newspapers',
 'next',
 'nickel',
 'niggers.',
 'no',
 'nor',
 'not',
 'nothing',
 'nothing.',
 'now',
 'obsessed',
 'obsession.',
 'occurs',
 'of',
 'off',
 'old',
 'on',
 'once',
 'or',
 'other',
 'our',
 'out',
 'over',
 'own',
 'parts',
 'party',
 'past',
 'per',
 'perfect',
 'perfectly',
 'plain.',
 'plants',
 'plum',
 'poem',
 'point.',
 'power',
 'practical',
 'prayer.',
 'precision',
 'president',
 'prisons',
 'private',
 'producers',
 'psychoanalyst',
 'psychopathic',
 'public',
 'pushing',
 'putting',
 'queer',
 'quite',
 'read',
 'read.',
 'readers',
 'reading',
 'real',
 'really',
 'red',
 'refuse',
 'resources',
 'resources.',
 'responsibility',
 'right',
 'right.',
 'rising',
 'roses',
 'run',
 'running',
 'russia',
 'russia.',
 'russians',
 'russians.',
 'sacco',
 'saint.',
 'save',
 'saw',
 'say',
 'scott',
 'scottsboro',
 'seen',
 'sell',
 'send',
 'sentimental',
 'serious',
 'serious.',
 'set.',
 'settle',
 'seven',
 'sexes.',
 'she',
 'should',
 'shoulder',
 'siberia.',
 'sick',
 'silly',
 'sin-',
 'sinister',
 'sinister.',
 'sit',
 'sixteen',
 'slink',
 'smoke',
 'so',
 'sold',
 'some',
 'somebody',
 'sorry.',
 'spanish',
 'speeches',
 'spy.',
 'stand',
 'stare',
 'stares',
 'still',
 'stop',
 'strophe',
 'strophes',
 'suns.',
 'supermarket',
 'take',
 'talking',
 'tangiers',
 'tears',
 'television',
 'telling',
 'that',
 'the',
 'them',
 'there',
 'they',
 'thing',
 'think',
 'thinks',
 'this',
 'through',
 'ticket',
 'till',
 'time',
 'to',
 'told',
 'tom',
 'too',
 'took',
 'trial',
 'trotskyites',
 'trouble.',
 'true',
 'trying',
 'turn',
 'twenty-five-thousand',
 'twenty-seven',
 'two',
 'ugh.',
 'uncle',
 'under',
 'underprivileged',
 'unpublishable',
 'up',
 'us',
 'used',
 'vanzetti',
 'vibrations.',
 'visions',
 'want',
 'wants',
 'war',
 'war.',
 'was',
 'way',
 'we',
 'week.',
 'were',
 'what',
 'wheel.',
 'when',
 'who',
 'whorehouses',
 'why',
 'will',
 'with',
 'wo',
 'wobblies.',
 'work',
 'workers',
 'world.',
 'worthy',
 'write',
 'you',
 'you.',
 'your',
 'yourself']

Vowel/Consonant Analysis

In [12]: cvs = [cv for w in words for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [13]: cfd = nltk.ConditionalFreqDist(cvs)

In [14]: cfd.tabulate()
     a    e    i    o    u
k    0   14    3    0    0
p    6    6    2    5    4
r   13   42   47   13   11
s    8   27   29   11    2
t   14   19   23   36    3
v    2   30    6    0    0

In [16]: cv_word_pairs = [(cv, w) for w in words
...                          for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [17]: cv_index = nltk.Index(cv_word_pairs)

Now we can print out some words associated with various vowel-consonant combinations:


In [18]: cv_index['to']
Out[18]:
['atom',
 'to',
 'into',
 'too',
 'to',
 'to',
 'to',
 'to',
 'to',
 'stop',
 'to',
 'to',
 'chinatown',
 'to',
 'told',
 'to',
 'to',
 'candystore.',
 'to',
 'to',
 'to',
 'to',
 'automobiles',
 'tom',
 'took',
 'to',
 'to',
 'to',
 'to',
 'to',
 'to',
 'auto',
 'to',
 'to',
 'factories',
 'to']

In [19]: cv_index['so']
Out[19]:
['some',
 'some',
 'blossoms',
 'somebody',
 'sorry.',
 'resources.',
 'resources',
 'prisons',
 'so',
 'sold',
 'so']

In [20]: cv_index['po']
Out[20]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power']

In [21]: cv_index['po']
Out[21]: ['poem', 'point.', 'responsibility', 'flowerpots', 'power']

In [23]: cv_index['ri']
Out[23]:
['america',
 'america',
 'america',
 'america',
 'write',
 'right',
 'america',
 'america',
 'libraries',
 'america',
 'america',
 'america',
 'america',
 'trial',
 'america',
 'america',
 'marijuana',
 'right.',
 'america',
 'serious.',
 'serious.',
 'serious',
 'america.',
 'rising',
 'marijuana',
 'private',
 'prisons',
 'underprivileged',
 'america',
 'write',
 'america',
 'america',
 'america',
 'america',
 'america',
 'america',
 'nearing',
 'america',
 'america',
 'siberia.',
 'america',
 'serious.',
 'america',
 'america',
 'right',
 'factories',
 'america']

In [24]: cfd.tabulate()
     a    e    i    o    u
k    0   14    3    0    0
p    6    6    2    5    4
r   13   42   47   13   11
s    8   27   29   11    2
t   14   19   23   36    3
v    2   30    6    0    0

In [25]: cv_index['re']
Out[25]:
['are',
 'are',
 'there',
 'are',
 'refuse',
 'are',
 'read',
 'stare',
 'there',
 'reading',
 'addressing',
 'are',
 'read',
 'stares',
 'candystore.',
 'read',
 'responsibility',
 'are',
 'are',
 'resources.',
 'resources',
 'literature',
 'hundred',
 'whorehouses',
 'president',
 'are',
 'more',
 "'re",
 'different',
 'free',
 'were',
 'free',
 'cere',
 'real',
 'really',
 'red',
 'readers',
 'bureaucracy',
 'read.',
 'impression',
 'correct',
 'precision']

References

NLTK book, baby steps: http://www.nltk.org/book/ch01.html

The Language Archives (resources): https://tla.mpi.nl/resources/data-archive/  · https://tla.mpi.nl/tools2/tooltype/analysis/page/3/

NLTK corpora readers list: https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders

text generator class: https://github.com/kopertop/nltkfun

multi-part blog series: http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize

lousy but ok: http://www.gilesthomas.com/2010/05/generating-political-news-using-nltk/

Related Software

Spacy: cython + python + ntlk: https://spacy.io/

prodi.gy for annotated machine learning models (like Tinder for data): https://prodi.gy/