Cewl/Cleaning Wordlists: Difference between revisions
From charlesreid1
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
==Making Wordlist with Cewl== | |||
Well, we made our wordlist with Cewl: | Well, we made our wordlist with Cewl: | ||
| Line 35: | Line 37: | ||
</pre> | </pre> | ||
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. | There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. | ||
==Cleaning Up Cewl Results== | |||
Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk: | |||
<source lang="bash"> | <source lang="bash"> | ||
| Line 42: | Line 48: | ||
# and throw out the last 50 lines | # and throw out the last 50 lines | ||
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | head -n -50 > short_seahawks_wikipedia.txt | cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt | ||
</source> | </source> | ||
Revision as of 19:44, 16 August 2015
Making Wordlist with Cewl
Well, we made our wordlist with Cewl:
#!/bin/sh
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."
The only problem is, it's got a lot of garbage. For one, all of the language links from Wikipedia end up contaminating the end of the list:
Hindi replace Indonesian Italian Hebrew Hungarian Marathi Dutch Japanese Norwegian Polish Portuguese Russian Serbian Serbo Finnish Swedish Turkish Ukrainian Vietnamese Chinese
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.
Cleaning Up Cewl Results
Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt