From charlesreid1

Cleaning Up Sportsball Wordlist

I did a generic sportsball wordlist using a sportsball news site: http://mynorthwest.com/category/seahawks/

This was guaranteed to get lots of goodies, word-wise, that sportsball fans all over Seattle would use in their passwords - many that I would never have guessed, myself. Thanks Cewl!

Extracting Goodies from Wordlists

Cewl put together nearly 6,000 words. Here's how to extract some of the goodies buried in the long wordlists.

$ wc -l seahawks_mynw.txt
5894 seahawks_mynw.txt

Acronyms

Look for occurrences for 4 or more capital letters in a row:

$ cat seahawks_mynw.txt | grep "[A-Z]\{4,\}"
ESPN
NCAA
KIRO
ROLL
DAILY
THIS
BELOW
LINE
KTTH
ALTER
ANYTHING
WNBA
LIVE
LISTEN
EDIT
973KIROFM
KIROArticles
TBTL
710ESPNSeattle
EMAIL
WSUBlog
SITE
ALERTS
TWITTER
FACEBOOK
SEATTLE
THAT
INCIDENTAL
SPECIAL
CONSEQUENTIAL
OTHERWISE
INABILITY
LIMITATION
APPLIES
ALLEGED
BASED
CONTRACT
ALWAYS
REFUSE
PROVIDE
DECREASED
FUNCTIONALITY
LIMIT
ABILITY
RECEIVE
ABOUT
INTEREST
NOTICE
TAKE
DOWN
PROCEDURE
MAKING
CLAIMS
COPYRIGHT

Good stuff!

Long Words

Let's look for words of 6 letters or more,

$ cat seahawks_mynw.txt | grep "\w\{6,\}"
Remove
disable
Forward
alleged
infringer
promptly
sluggish
Devils
misidentification
judicial
Inform
[...]

Lookin good! We've filtered out 2,000 junk words, leaving us with 4,000 better words.

HTML and Number Junk

There are a couple of Google ad placeholders, and all the number entries were garbage:

$ cat short_seahawks_mynw.txt | grep -v google | grep -v "[0-9]"

The Slim-and-Trim Wordlist

We've reduced the wordlist to a more reasonable list of 3700. Remember that every bit of cruft we can remove will save us loads of time, as John the Ripper and other password tools that use wordlists create many, many variations of each word on the wordlist.

If you want to be maniacal, you can also throw in words you specifically want included...

$ wc -l short_seahawks_mynw.txt
3795 short_seahawks_mynw.txt


Cleaning Up Wikipedia Wordlists

Some shortcuts for cleaning up Cewl results from Wikipedia pages.

Making Wikipedia Wordlist

Make your wordlist from Wikipedia with Cewl:

#!/bin/sh

echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."

The Problem

The problem with the resulting wordlist is that there's a lot of garbage. For example, all the language links on Wikipedia add non-sequiturs to your wordlist:

Hindi
replace
Indonesian
Italian
Hebrew
Hungarian
Marathi
Dutch
Japanese
Norwegian
Polish
Portuguese
Russian
Serbian
Serbo
Finnish
Swedish
Turkish
Ukrainian
Vietnamese
Chinese

There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.

Cleaning Up Cewl Results

Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:

#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines

cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt