Cewl/Cleaning Wordlists
From charlesreid1
Cleaning Up Wikipedia Wordlists
This goes through some shortcuts for cleaning up Cewl results from Wikipedia pages.
Making Wikipedia Wordlist
Make your wordlist from Wikipedia with Cewl:
#!/bin/sh
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."
The Problem
The problem with the resulting wordlist is that there's a lot of garbage. For example, all the language links on Wikipedia add non-sequiturs to your wordlist:
Hindi replace Indonesian Italian Hebrew Hungarian Marathi Dutch Japanese Norwegian Polish Portuguese Russian Serbian Serbo Finnish Swedish Turkish Ukrainian Vietnamese Chinese
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.
Cleaning Up Cewl Results
Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt