Published Article in 2600 Magazine: We Will Rock You

We Will Rock You

Hello peeps!
It’s me again, you friendly neighbourhood gerbil.
You may remember be from articles such as “Take Your Work Home After Work” which appeared in the Winter 2014 issue of 2600 Magazine, and “My Voice Is My Key
which appeared in the Autumn 2015 issue of the awesome 2600 magazine. If you haven’t read them, buy the back copies and read them NOW! :)

I haven’t written in a long long time because I have been so so busy, so thought I’d say hi by submitting a little snippet of something very useful.

Let’s talk about wordlists. What is a wordlist?

Well, a wordlist, as it says on a tin, is a file which is made up of a shit-load of words.

The Kali operating system has a few wordlists which can be found at /usr/share/wordlists.
Now, here is a massive file called rockyou.txt. It’s HUGE!!!
This is a bit of a default file for people to use as it contains absouletly millions of words! Let’s have a look:

root@kali:/usr/share/wordlists# wc -l rockyou.txt
14344392 rockyou.txt

Here we can see that there are 14344392 lines in the rockyou file. But does this value reflect words? Well, a word is a word. But is each line in “rockyou” a single word? Let’s run a quick command to have a look if any of these line contain a space, ie, all “phrases” or “sentences”:

root@kali:/usr/share/wordlists# grep ' ' rockyou.txt | head
rock you
i love you
te amo
fuck you
te iubesc
love you
i love u
chris brown
rock on
john cena

John Cena?!?! Ha! We see that the top 10 lines are not single words! So how many of these lines are phrases? Let’s run another command:

root@kali:/usr/share/wordlists# grep -c ' ' rockyou.txt

Wow! Now if I wanted to run a wordlist testing for single words, these would be a waste of time as they are not single words. Ok, the password cracking tool may strip these out, but that too would be extra unnecessary work. You may argue that “they are phrases, keep them in.” Nah! For our phrase to fit their phrase, this would more or less be impossible using only 70619 phrases. And anyway, we are interested in a word list rather than a phrase list.

Before I go further, the rockyou.txt file contains LOADS of crap:

root@kali:/usr/share/wordlists# awk 'BEGIN{len=0;}{if(length($0)>len){len=length($0);printf("%i : %s\n",len,$0);}}' rockyou.txt
6 : 123456
9 : 123456789
10 : 1234567890
11 : christopher
13 : tequieromucho
16 : manchesterunited
17 : mychemicalromance
18 : 123456789123456789
39 : Lets you update your FunNotes and more!
40 : 1111111111111111111111111111111111111111
42 : RockYou account is required for Voicemail.
49 : /* { css code start--} */
awk: cmd. line:1: (FILENAME=rockyou.txt FNR=602044) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
59 :
77 : vabfdvfdlvhjibfedblsfndilvbgilebvgdlsbgvhbesghklhyubvuwklfbrebgfyurerebgyureb
165 : lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
222 : <table style="border-collapse:collapse;"><tr><td colspan="2"><embed src="" quality="high" scale="noscale" salign="lt" width="325" height="260" wmode="transparent" flashvars="imgpath=http%
255 : <object width="206" height="224"><param name="movie" value=""></param><param name="wmode" value="transparent"></param><embed src="" type="application/x-shockwave-flash" wmod
257 : <style type=\\'text/css\\'>body{ background: url( white center no-repeat fixed; } table, .heading_profile, .heading_profile_left, table td, #p_container, #p_nav_primary, #top_header, #p_n
262 : <style type=\\'text/css\\'>.bg_content{background-image:url(;}.bg_content{background-repeat:repeat;}</STYLE><a href=\\'\\' target=\\'_top\\'><img src=\\'http://hi5.enchula
266 : <div id=\\'24813\\'><a href=\\'\\'><img src=\\'\\' border=0 alt=\\'Hazte famoso en\\'></a></div><div id=\\'72891\\'><a href=\\'http://w
285 : <div align=\\\\\\'center\\\\\\' style=\\\\\\'font:bold 11px Verdana; width:310px\\\\\\'><a style=\\\\\\'background-color:#eeeeee;display:block;width:310px;border:solid 2px black; padding:5px\\\\\\' href=\\\\\\'\\\\\\' target=\\\\\\'_blank\\\\\\'>Playing/Tangga

What I have done here is print lines that are bigger than the last recorded line. Just by looking at this output we see that lines that have a character count greater than 18 is infact crap. They’re not even phrases! They are bits of websites – html! Definitely not useful in searching for passwords!

So we can strip these out. Anything with a space – get rid of it.
And while we’re at it, let’s remove emails and websites. Think about it, you are cracking a password has on BumbleBee Security’s webapp. Is some random person’s email address or a website address going to be a password? Unless you are REALLY lucky, no, no it isn’t! Not whatsoever!

Out of interest, how many lines contain emails and websites?

root@kali:/usr/share/wordlists# egrep -c '[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}' rockyou.txt
root@kali:/usr/share/wordlists# grep -c http[s]*:// rockyou.txt

Wow! Quite a lot! Lets remove them too.

In conclusion, the rockyou.txt wordlist contains a load of crap that can be removed. And other wordlists may contain crap such as blocks of “header texts” etc. Due to this I wrote a simple script that can be found at the end of this article, feel free to use it and send me kudos.

Many thanks for reading.

Gerbil. [twitter: @gerbilByte]

# wordlistcleanser. gerbil 2018 [twitter: @gerbilByte]
# This file is used to clean rockyou.txt from all the crap to leave just single words.
# It will also cleanse other wordlists too.
# Usage:
# infile [outfile]
# WARNING: If an output file isn't specified, then the input will be overwritten (permissions allowing).
# Example:
# ./ /usr/share/wordlists/rockyou.txt ./wewillrockyou.txt


if [ $# -lt 1 ];
printf "\nwordlistcleanser v%s - %s 2018\n\nThis is a simple script that will remove \'phrases\', emails and websites from wordlist files.\nEmails and websites will be stored as files under the current directory.\n\n" ${version} ${author}
printf "Usage:\n\t%s infile.txt [outfile.txt]\n\nWARNING: If an output file isn't specified, then the input will be overwritten (permissions allowing).\n\nExample:\n\t./ ./rockyou.txt ./wewillrockyou.txt\n\nHave fun! :)\n-%s\n" $0 ${author}

baseinfile=`basename ${infile}`
printf "Cleaning %s...\n" ${infile};

#Check input file exists...
if ! [ -a ${infile} ];
then #input file doesn't exist.
printf " %s doesn't exist!\n" ${infile}

#Check if input file is to be overwritten or not...
if [ ${outfile}X == X ];
then #no output file specified, therefore destruct mode! ;P
printf " No output file specified, therefore output will be stored at %s\n" ${outfile}
# rm -f ${infile} # just to save space
printf " Output file : ${outfile}\n"

#Removing phrases...
printf "Removing phrases...\n"
grep -v ' ' ${infile} > /tmp/ry1.txt

#Extracting then removing websites...
printf "Extracting then removing websites...\n"
grep http[s]*:// /tmp/ry1.txt > ./${baseinfile}_websites.txt
grep -v http[s]*:// /tmp/ry1.txt > /tmp/ry2.txt
rm -f /tmp/ry1.txt # just to save space

#Extracting then removing emails...
printf "Extracting then removing emails...\n"
egrep '[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}' /tmp/ry2.txt > ./${baseinfile}_emails.txt
egrep -v '[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}' /tmp/ry2.txt > ${outfile}
rm -f /tmp/ry2.txt # just to save space

#Get stats on leftover file (length of each word and count of each, I know there are no words longer than 1000 characters)...
printf "Getting stats on %s, extracted emails and extracted websites...\n" ${outfile}
printf "Emails extracted: `wc -l ./${baseinfile}_emails.txt`\n" > ./${outfile%.*}_stats.txt
printf "Websites extracted: `wc -l ./${baseinfile}_websites.txt`\n" >> ./${outfile%.*}_stats.txt
printf "\nStats on %s : \n\n" ${outfile} >> ./${outfile%.*}_stats.txt
awk 'BEGIN{charcounts[1000]=0;len=0;printf("word length : count\n------------:------\n");}{charcounts[length($0)]++;}END{for(i=0;i<=1000;i++){printf("%11i : %i\n",i,charcounts[i]);}}' ${outfile} | grep -v ': 0'$ >> ./${outfile%.*}_stats.txt

printf "Cleansing completed.\n\n"

File running:

root@kali:~# ./ /usr/share/wordlists/rockyou.txt ./wewillrockyou.txt
Cleaning /usr/share/wordlists/rockyou.txt...
Output file : ./wewillrockyou.txt
Removing phrases...
Extracting then removing websites...
Extracting then removing emails...
Getting stats on ./wewillrockyou.txt, extracted emails and extracted websites...
Cleansing completed.
root@kali:~# wc -l /usr/share/wordlists/rockyou.txt ./wewillrockyou.txt
14344392 /usr/share/wordlists/rockyou.txt
14245981 ./wewillrockyou.txt
28590373 total
root@kali:~# expr 14344392 - 14245981


