Cewl/Cleaning Wordlists
From charlesreid1
Contents
Cleaning Up Sportsball Wordlist
I did a generic sportsball wordlist using a sportsball news site: http://mynorthwest.com/category/seahawks/
This was guaranteed to get lots of goodies, word-wise, that sportsball fans all over Seattle would use in their passwords - many that I would never have guessed, myself. Thanks Cewl!
Extracting Goodies from Wordlists
Cewl put together nearly 6,000 words. Here's how to extract some of the goodies buried in the long wordlists.
$ wc -l seahawks_mynw.txt 5894 seahawks_mynw.txt
Acronyms
Look for occurrences for 4 or more capital letters in a row:
$ cat seahawks_mynw.txt | grep "[A-Z]\{4,\}"
ESPN
NCAA
KIRO
ROLL
DAILY
THIS
BELOW
LINE
KTTH
ALTER
ANYTHING
WNBA
LIVE
LISTEN
EDIT
973KIROFM
KIROArticles
TBTL
710ESPNSeattle
EMAIL
WSUBlog
SITE
ALERTS
TWITTER
FACEBOOK
SEATTLE
THAT
INCIDENTAL
SPECIAL
CONSEQUENTIAL
OTHERWISE
INABILITY
LIMITATION
APPLIES
ALLEGED
BASED
CONTRACT
ALWAYS
REFUSE
PROVIDE
DECREASED
FUNCTIONALITY
LIMIT
ABILITY
RECEIVE
ABOUT
INTEREST
NOTICE
TAKE
DOWN
PROCEDURE
MAKING
CLAIMS
COPYRIGHT
Good stuff!
Long Words
Let's look for words of 6 letters or more,
$ cat seahawks_mynw.txt | grep "\w\{6,\}"
Remove
disable
Forward
alleged
infringer
promptly
sluggish
Devils
misidentification
judicial
Inform
[...]
Lookin good! We've filtered out 2,000 junk words, leaving us with 4,000 better words.
HTML and Number Junk
There are a couple of Google ad placeholders, and all the number entries were garbage:
$ cat short_seahawks_mynw.txt | grep -v google | grep -v "[0-9]"
The Slim-and-Trim Wordlist
We've reduced the wordlist to a more reasonable list of 3700. Remember that every bit of cruft we can remove will save us loads of time, as John the Ripper and other password tools that use wordlists create many, many variations of each word on the wordlist.
If you want to be maniacal, you can also throw in words you specifically want included...
$ wc -l short_seahawks_mynw.txt 3795 short_seahawks_mynw.txt
Cleaning Up Wikipedia Wordlists
Some shortcuts for cleaning up Cewl results from Wikipedia pages.
Making Wikipedia Wordlist
Make your wordlist from Wikipedia with Cewl:
#!/bin/sh
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."
The Problem
The problem with the resulting wordlist is that there's a lot of garbage. For example, all the language links on Wikipedia add non-sequiturs to your wordlist:
Hindi replace Indonesian Italian Hebrew Hungarian Marathi Dutch Japanese Norwegian Polish Portuguese Russian Serbian Serbo Finnish Swedish Turkish Ukrainian Vietnamese Chinese
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.
Cleaning Up Cewl Results
Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt