Experiment1
From charlesreid1
Walk through the steps for experiment 1.
Overview
The Data
Start with the wifi data. This data will be collected in various locations by Raspberry Pis placed and operating over a given time interval.
Raspberry Pi Data
On board the Pis, airodump-ng will be used to create a CSV file containing information gathered about the wifi network over a time interval of 1-5 minutes. Each CSV file created by airodump-ng contains an aggregated view of the wifi network over that time interval, so you end up with a large number of CSV files - one observation for every 1-5 minutes.
Data Processing
These CSV files are then parsed and processed using Python, to turn those CSV files and observational data about wifi networks into data in an SQL database.
SQL Data Warehouse
The data will be stored in a data warehouse, in the form of a SQLite database. This database will provide a place for us to store two kinds of data:
1. raw data - the basic rearrangement of information to get it from the CSV files into an SQL database.
2. processed data (derived quantities) - calculated quantities that will involve maths, calculations, and mathematical representations.
Mathematical Representation
An example of processed data that we might store in a database is a mathematical representation, such as a graph.
Graphs contain a lot of information and are extremely useful for analysis. However, constructing graphs, and computing information about graphs, can be computationally expensive. And if it takes a while for one graph, it will take a while for hundreds of graphs, and we will have hundreds of graphs from our data.
For this reason, we want to store processed data in the database as well.
Raspberry Pi Data
Collecting Data
Goal: to use the Pythons to collect one wifi csv every N minutes.
SpawnKillDump
Behold the script: spawnkilldump.py
import subprocess
import time
for i in range(3):
Nmin = 1
Nsec = Nmin*60
#Nsec = 10
the_cmd = ['airodump-ng','wlan1','-w','awesome','--output-format','csv']
p = subprocess.Popen(the_cmd)
time.sleep(Nsec)
p.kill()
print "Success!"
Data Processing
The first round of data processing is to parse the CSV files and populate the SQL database with information from them.
To do this, we'll use Python, since Python makes it easy to deal with CSV files.
Process CSV
A prototype Python script that shows how to load the CSV and parse the data.
We begin with one thing we know about the CSV file: we know that it consists of two tables, each with different numbers of columns, and that those tables are separated by a blank line.
If we load every line of the file into a list, then we can pick out items in the list that are blank lines (they're also the only lines in the file that don't have a comma).
Then we can split the list containing every line of the file into two lists: a list of lines containing access point data, and a list of lines containing client/device data.
Once we have those two lists, we loop over each line and tokenize the data by splitting each line at the commas. From there, you have the data as a Python primitive type, and it can be dealt with as you wish from there. (Note: this is where you'd populate your SQL database.)
with open('awesome-01.csv') as f:
lines = f.readlines()
# Find where breaks in CSV file are located
# (Split between APs and Clients)
breaks = []
for i in range(len(lines)):
tokens = [t.strip() for t in lines[i].split(",")]
if len(tokens)==1:
breaks.append(i)
# Use that to extract ap and client data
ap_header = lines[breaks[0]+1]
ap_data = lines[breaks[0]+2:breaks[1]-1]
client_header = lines[breaks[1]+1]
client_data = lines[breaks[1]+2:breaks[2]-1]
# Tokenize and extract fields
print "AP MACs:"
for ap in ap_data:
tokens = ap.split(",")
print tokens[0]
Data Warehouse
To deal with all of this data, we will need to turn all of the CSV files generated on the Raspberry Pi into data in an SQL database.
Each CSV file will contain a (time-aggregated) list of access points and clients that were seen o ver the given time period. Each row in this database corresponds to one experimental "observation" and is recorded as an individual record in the SQL database.
The SQL database will consist of (at least) two tables for raw data: one table for access points, and one table for clients.
Example
Initial Observation
Suppose we have a CSV file with the following information about an access point:
BSSID, First time seen, Last time seen, channel, Speed, Privacy, Cipher, Authentication, Power, # beacons, # IV, LAN IP, ID-length, ESSID, Key 99:33:C8:D0:33:11, 2016-02-05 19:55:57, 2016-02-05 19:55:57, 11, 54, WPA2 WPA, CCMP TKIP,PSK, -80, 1, 0, 0. 0. 0. 0, 9, HOME-7310,
This can be translated into an SQL record, with the key-value pairs of the record corresponding to the columns and values for the row. The field "BSSID" will have a value of "99:33:C8:D0:33:11", the first time seen/last time seen fields will have the corresponding date and time, the field "Channel" will have a value of 11, and so on. The SQL record would also contain an identifying field indicating which CSV file it came from.
Here's a Python code that would pull data out of a csv file and prepare an SQL record from each line of a CSV file:
import sqlite3
def csv_to_apclientdata(csv_file):
with open(csv_file) as f:
lines = f.readlines()
# Find where breaks in CSV file are located
# (Split between APs and Clients)
breaks = []
for i in range(len(lines)):
tokens = [t.strip() for t in lines[i].split(",")]
if len(tokens)==1:
breaks.append(i)
# Use that to extract ap and client data
ap_header = lines[breaks[0]+1]
ap_data = lines[breaks[0]+2:breaks[1]-1]
client_header = lines[breaks[1]+1]
client_data = lines[breaks[1]+2:breaks[2]-1]
return ap_header, ap_data, client_header, client_data
def print_ap_info(ap_header,ap_data):
# Create an sql record for a single AP
ap = ap_data[1]
for ap in ap_data:
head_tokens = [a.strip() for a in ap_header.split(",")]
ap_tokens = [a.strip() for a in ap.split(",")]
print ""
print "----------------------"
for key,val in zip(head_tokens,ap_tokens):
print key,":",val
print "----------------------"
if __name__=="__main__":
ap_header, ap_data, client_header, client_data = csv_to_apclientdata('../csv/awesome-01.csv')
print_ap_info(ap_header,ap_data)
Later Observation
Now suppose that in a later CSV file, the same device shows up again:
BSSID, First time seen, Last time seen, channel, Speed, Privacy, Cipher, Authentication, Power, # beacons, # IV, LAN IP, ID-length, ESSID, Key 99:33:C8:D0:33:11, 2016-02-05 22:33:02, 2016-02-05 22:34:19, 11, 54, WPA2 WPA, CCMP TKIP,PSK, -80, 1, 0, 0. 0. 0. 0, 9, HOME-7310,
Same device, same MAC, but different timestamps. How do we deal with it?
We consider it a separate and new observation, and we make a separate and new SQL record. This SQL record will have key-value pairs, so the "BSSID" field will be "99:33:C8:D0:33:11", the channel field will have a value of 11, the date/time fields will have the corresponding values, etc.
SQL Table Organization
From walking through the above procedure, we can see that each record will not have any field that will be exactly and uniquely identifying. For example, multiple records may have the same value for the "BSSID" field. Likewise for the date and time fields - these are not necessarily unique. The same is true for bit rates, channel, and ESSID.
Each record in the SQL table, then, needs to be tagged with a unique observation ID/key so that we can refer to different points when we begin our analysis.
Access Points Table Structure
The access points table in the data warehouse will basically have a field for each column in the CSV file.
To populate the SQL table, you go through a two-step process:
- Step 1 - create the table with all the fields
- Step 2 - create records and insert them into the table
Here's the script:
import sqlite3
def csv_to_apclientdata(csv_file):
"""
Turn a CSV file into 4 lists:
One list with AP header lines
One list with AP data lines
One list of client header lines
One list with client data lines
"""
with open(csv_file) as f:
lines = f.readlines()
# Find where breaks in CSV file are located
# (Split between APs and Clients)
breaks = []
for i in range(len(lines)):
tokens = [t.strip() for t in lines[i].split(",")]
if len(tokens)==1:
breaks.append(i)
# Use that to extract ap and client data
ap_header = lines[breaks[0]+1]
ap_data = lines[breaks[0]+2:breaks[1]-1]
client_header = lines[breaks[1]+1]
client_data = lines[breaks[1]+2:breaks[2]-1]
return ap_header, ap_data, client_header, client_data
def print_ap(ap_header,ap_data):
# Create an sql record for a single AP
ap = ap_data[1]
head_tokens = [a.strip() for a in ap_header.split(",")]
ap_tokens = [a.strip() for a in ap.split(",")]
record = {}
for key,val in zip(head_tokens,ap_tokens):
record[key] = val
print "-"*30
print record
print "-"*30
def print_ap_head(ap_header,ap_data):
head_tokens = [a.strip() for a in ap_header.split(",")]
print "\'" + "\',\'".join(head_tokens) + "\'"
def make_ap_sql_record(ap_header,ap_data):
"""
Tokenize the two lists of AP header and data lines
and extract the data
"""
# connect to sqlite database
conn = sqlite3.connect('wifidata.db')
# get a pointer in the database
c = conn.cursor()
# Extract column headers and create table
head_tokens = [a.strip() for a in ap_header.split(",")]
ap_head_fields = "\'" + "\',\'".join(head_tokens) + "\'"
sql_create_table = "CREATE TABLE wifidata (%s)"%(ap_head_fields)
#print sql_create_table
try:
c.execute(sql_create_table)
print " [+] Success creating table."
except sqlite3.OperationalError:
pass
conn.commit()
# Populate data
for ap in ap_data:
ap_tokens = [a.strip() for a in ap.split(",")]
ap_values = "\'" + "\',\'".join(ap_tokens)+"\'"
sql_insert = "INSERT INTO wifidata VALUES (%s)"%(ap_values)
#print sql_insert
try:
c.execute(sql_insert)
print " [+] Success inserting row."
except sqlite3.OperationalError:
pass
conn.commit()
conn.close()
if __name__=="__main__":
ap_header, ap_data, client_header, client_data = csv_to_apclientdata('../csv/awesome-01.csv')
make_ap_sql_record(ap_header,ap_data)