Cewl/Cleaning Wordlists
From charlesreid1
Contents
Cleaning Up Sportsball Wordlist
I did a generic sportsball wordlist using a sportsball news site: http://mynorthwest.com/category/seahawks/
This was guaranteed to get lots of goodies, word-wise, that sportsball fans all over Seattle would use in their passwords - many that I would never have guessed, myself. Thanks Cewl!
Extracting Goodies from Wordlists
Cewl put together nearly 6,000 words. Here's how to extract some of the goodies buried in the long wordlists.
$ wc -l seahawks_mynw.txt 5894 seahawks_mynw.txt
Acronyms
Look for occurrences for 4 or more capital letters in a row:
$ cat seahawks_mynw.txt | grep "[A-Z]\{4,\}" ESPN NCAA KIRO ROLL DAILY THIS BELOW LINE KTTH ALTER ANYTHING WNBA LIVE LISTEN EDIT 973KIROFM KIROArticles TBTL 710ESPNSeattle EMAIL WSUBlog SITE ALERTS TWITTER FACEBOOK SEATTLE THAT INCIDENTAL SPECIAL CONSEQUENTIAL OTHERWISE INABILITY LIMITATION APPLIES ALLEGED BASED CONTRACT ALWAYS REFUSE PROVIDE DECREASED FUNCTIONALITY LIMIT ABILITY RECEIVE ABOUT INTEREST NOTICE TAKE DOWN PROCEDURE MAKING CLAIMS COPYRIGHT
Good stuff!
Long Words
Let's look for words of 6 letters or more,
$ cat seahawks_mynw.txt | grep "\w\{6,\}" Remove disable Forward alleged infringer promptly sluggish Devils misidentification judicial Inform [...]
Lookin good! We've filtered out 2,000 junk words, leaving us with 4,000 better words.
HTML and Number Junk
There are a couple of Google ad placeholders, and all the number entries were garbage:
$ cat short_seahawks_mynw.txt | grep -v google | grep -v "[0-9]"
The Slim-and-Trim Wordlist
We've reduced the wordlist to a more reasonable list of 3700. Remember that every bit of cruft we can remove will save us loads of time, as John the Ripper and other password tools that use wordlists create many, many variations of each word on the wordlist.
If you want to be maniacal, you can also throw in words you specifically want included...
$ wc -l short_seahawks_mynw.txt 3795 short_seahawks_mynw.txt
Cleaning Up Wikipedia Wordlists
Some shortcuts for cleaning up Cewl results from Wikipedia pages.
Making Wikipedia Wordlist
Make your wordlist from Wikipedia with Cewl:
#!/bin/sh
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."
The Problem
The problem with the resulting wordlist is that there's a lot of garbage. For example, all the language links on Wikipedia add non-sequiturs to your wordlist:
Hindi replace Indonesian Italian Hebrew Hungarian Marathi Dutch Japanese Norwegian Polish Portuguese Russian Serbian Serbo Finnish Swedish Turkish Ukrainian Vietnamese Chinese
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.
Cleaning Up Cewl Results
Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt