Google Cloud/Weather Data in BigQery
From charlesreid1
Main link: https://google.qwiklabs.com/focuses/5689?locale=en
Qwiklab
Summary
This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.
Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable
Steps
Overview of the basic steps:
- Set up a Google Cloud account (follow qwiklabs instructions, or use own)
- Explore weather data and 311 data in BigQuery
- Find correlation between weather and complaints
Setting Up Google Cloud Account
Won't cover this here. Use a free trial or use the Qwiklab-provided account.
Exploring Data
To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.
Now on the left, where the project name is listed, there is a triangle drop-down button next to it.
Click that and pick "Switch to project" and "Display project..." to display the weather data set.
Before running SQL queries, check to make sure you are running SQL queries in legacy mode.
Weather Data
Here is a sample SQL query:
SELECT -- Create a timestamp from the date components. stn, TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp, -- Replace numerical null values with actual null AVG(IF (temp=9999.9, null, temp)) AS temperature, AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed, AVG(IF (prcp=99.99, 0, prcp)) AS precipitation FROM `bigquery-public-data.noaa_gsod.gsod20*` WHERE CAST(YEAR AS INT64) > 2010 AND CAST(MO AS INT64) = 6 AND CAST(DA AS INT64) = 12 AND (stn="725030" OR -- La Guardia stn="744860") -- JFK GROUP BY stn, timestamp ORDER BY timestamp DESC, stn ASC
Let's break it down to understand what it is doing...
- First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
- The FROM part of the query indicates which table we're taking data from - the range of our historical data.
- We want to limit the data to a particular date, so we limit to each June 12 after 2010.
- Finally, we limit the results to the LaGuardia and JFK airport stations.
So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011
311 Data
To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:
SELECT EXTRACT(YEAR FROM created_date) AS year, complaint_type, COUNT(1) AS num_complaints FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY year, complaint_type ORDER BY num_complaints DESC
What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.
- The SELECT statement extracts the year from the timestamp when the complaint was created
- The FROM statement specifies that we want to use NYC 311 data
- GROUP BY means we group the results according to these two dimensions - year and complaint_type
- We order the results by number of complaints, largest to smallest.
Searching for Correlations
For the second part of the lab, create a new table to store results.
Click the triangular drop down arrow next to project name, and pick "Create new dataset".
Click "Compose New Query". Can call it nyc_weather or something.
Step 1: Construct Weather Data Table
Step 1 will be constructing a lookup table of dates and weather characteristics. We will construct a query of the NOAA weather data, and the results of this query will populate the table nyc_weather (through the magic of BigQuery).
SELECT -- Create a timestamp from the date components. timestamp(concat(year,"-",mo,"-",da)) as timestamp, -- Replace numerical null values with actual nulls AVG(IF (temp=9999.9, null, temp)) AS temperature, AVG(IF (visib=999.9, null, visib)) AS visibility, AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed, AVG(IF (gust=999.9, null, gust)) AS wind_gust, AVG(IF (prcp=99.99, null, prcp)) AS precipitation, AVG(IF (sndp=999.9, null, sndp)) AS snow_depth FROM `bigquery-public-data.noaa_gsod.gsod20*` WHERE CAST(YEAR AS INT64) > 2008 AND (stn="725030" OR -- La Guardia stn="744860") -- JFK GROUP BY timestamp
Again, we're looking at the two airport weather stations.
- The SELECT statement extracts the year, month, and day as our timestamp. It also asks for the average of several wind, precipitation, and temperature variables over observations for that day.
- The FROM statement just indicates we are looking at weather data
- The WHERE forces the query to look at data more recent than 2008 and from JFK or LaGuardia airport weather stations only.
Now the nyc_weather table has a more convenient intermediate query result, which is a list of timestamps and the corresponding weather that day.
Now we can cross-reference this from 311 results, and do so more easily than before.
Step 2: Cross-Reference Weather and 311 Results
The final step is the hardest - we have a long query to run. Let's review what the query should do first.
We already have a list of timestamps and the corresponding weather, so we want to extract the timestamp from 311 calls (in the same format), group them by category, count them by unique days, and look up weather data for each of those groups.
This will give us a bucket of two-tuples: (weather data, number of complaints).
These two-tuples are then plotted and fitted to give a model of number of complaints vs. weather feature.
On to the full query:
SELECT descriptor, sum(complaint_count) as total_complaint_count, count(temperature) as data_count, ROUND(corr(temperature, avg_count),3) AS corr_count, ROUND(corr(temperature, avg_pct_count),3) AS corr_pct From ( SELECT avg(pct_count) as avg_pct_count, avg(day_count) as avg_count, sum(day_count) as complaint_count, descriptor, temperature FROM ( SELECT DATE(timestamp) AS date, temperature FROM demos.nyc_weather) a JOIN ( SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count FROM (SELECT DATE(created_date) AS date, concat(complaint_type, ": ", descriptor) as descriptor, COUNT(*) AS day_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date, descriptor)x JOIN ( SELECT DATE(created_date) AS date, COUNT(*) AS all_calls_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date )y ON x.date=y.date )b ON a.date = b.date GROUP BY descriptor, temperature ) GROUP BY descriptor HAVING total_complaint_count > 5000 AND ABS(corr_pct) > 0.5 AND data_count > 5 ORDER BY ABS(corr_pct) DESC
Keep in mind this is a complicated SQL query, so we'll break it down in parts.
Breakdown
Let's start with the innermost query, which is a join of two queries:
SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count FROM (SELECT DATE(created_date) AS date, concat(complaint_type, ": ", descriptor) as descriptor, COUNT(*) AS day_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date, descriptor)x JOIN ( SELECT DATE(created_date) AS date, COUNT(*) AS all_calls_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date )y ON x.date=y.date
The first SELECT query is getting the date of each complaint matching a certain type of complaint, grouping them by day, and counting complaints of that type on a given date.
The second SELECT query is getting the date of each complaint (for ALL complaints), and counting all complaints by date. This gives a total number of complaints on each day, which gives a relative measure of how many 311 calls were a particular type of complaint.
Lastly, these quantities are joined on the date field, which we expect to be exactly the same between the two.
Now, all of this is wrapped in a JOIN statement, which means we're taking the result of the query we just described above (which returns percentages for each complaint category) and joining it to another query result. Here's the larger query context:
FROM ( SELECT DATE(timestamp) AS date, temperature FROM demos.nyc_weather)a JOIN ( <this is where the prior query goes> )y ON x.date=y.date )b ON a.date = b.date
So we are getting a list of dates, types of complaints, total complaint counts, percent complaint counts, and the temperature, all in this portion of the query.
These are then joined on the date, again, like the two 311 queries (essentially, we are joining the results of the two joined 311 queries with the results of the NOAA weather query). That's the "a" and "b" join.
But wait! We're not done yet! (In fact, it may have been wiser to split this query up into more sub-steps...)
From the results of this, we do one more step, and extract the following (this is the very beginning of the query):
SELECT descriptor, sum(complaint_count) as total_complaint_count, count(temperature) as data_count, ROUND(corr(temperature, avg_count),3) AS corr_count, ROUND(corr(temperature, avg_pct_count),3) AS corr_pct From ( <this is where the prior query goes> GROUP BY descriptor HAVING total_complaint_count > 5000 AND ABS(corr_pct) > 0.5 AND data_count > 5 ORDER BY ABS(corr_pct) DESC
Step 3: Correlating More Quantities
Here is an SQL function that will do the same as above, but with wind speed instead of temperature:
SELECT descriptor, sum(complaint_count) as total_complaint_count, count(wind_speed) as data_count, ROUND(corr(wind_speed, avg_count),3) AS corr_count, ROUND(corr(wind_speed, avg_pct_count),3) AS corr_pct From ( SELECT avg(pct_count) as avg_pct_count, avg(day_count) as avg_count, sum(day_count) as complaint_count, descriptor, wind_speed FROM ( SELECT DATE(timestamp) AS date, wind_speed FROM demos.nyc_weather)a JOIN ( SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count FROM (SELECT DATE(created_date) AS date, concat(complaint_type, ": ", descriptor) as descriptor, COUNT(*) AS day_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date, descriptor)x JOIN ( SELECT DATE(created_date) AS date, COUNT(*) AS all_calls_count FROM `bigquery-public-data.new_york.311_service_requests` GROUP BY date )y ON x.date=y.date )b ON a.date = b.date GROUP BY descriptor, wind_speed ) GROUP BY descriptor HAVING total_complaint_count > 5000 AND ABS(corr_pct) > 0.5 AND data_count > 10 ORDER BY ABS(corr_pct) DESC