Google Cloud/Weather Data in BigQery
From charlesreid1
Main link: https://google.qwiklabs.com/focuses/5689?locale=en
Qwiklab
Summary
This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.
Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable
Steps
Overview of the basic steps:
- Set up a Google Cloud account (follow qwiklabs instructions, or use own)
- Explore weather data and 311 data in BigQuery
- Find correlation between weather and complaints
Setting Up Google Cloud Account
Won't cover this here. Use a free trial or use the Qwiklab-provided account.
Exploring Data
To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.
Now on the left, where the project name is listed, there is a triangle drop-down button next to it.
Click that and pick "Switch to project" and "Display project..." to display the weather data set.
Before running SQL queries, check to make sure you are running SQL queries in legacy mode.
Weather Data
Here is a sample SQL query:
SELECT
-- Create a timestamp from the date components.
stn,
TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
-- Replace numerical null values with actual null
AVG(IF (temp=9999.9,
null,
temp)) AS temperature,
AVG(IF (wdsp="999.9",
null,
CAST(wdsp AS Float64))) AS wind_speed,
AVG(IF (prcp=99.99,
0,
prcp)) AS precipitation
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
WHERE
CAST(YEAR AS INT64) > 2010
AND CAST(MO AS INT64) = 6
AND CAST(DA AS INT64) = 12
AND (stn="725030" OR -- La Guardia
stn="744860") -- JFK
GROUP BY
stn,
timestamp
ORDER BY
timestamp DESC,
stn ASC
Let's break it down to understand what it is doing...
- First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
- The FROM part of the query indicates which table we're taking data from - the range of our historical data.
- We want to limit the data to a particular date, so we limit to each June 12 after 2010.
- Finally, we limit the results to the LaGuardia and JFK airport stations.
So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011
311 Data
To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:
SELECT
EXTRACT(YEAR
FROM
created_date) AS year,
complaint_type,
COUNT(1) AS num_complaints
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY
year,
complaint_type
ORDER BY
num_complaints DESC
What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.
- The SELECT statement extracts the year from the timestamp when the complaint was created
- The FROM statement specifies that we want to use NYC 311 data
- GROUP BY means we group the results according to these two dimensions - year and complaint_type
- We order the results by number of complaints, largest to smallest.
Searching for Correlations
For the second part of the lab, create a new table to store results.
Click the triangular drop down arrow next to project name, and pick "Create new dataset".
Click "Compose New Query". Can call it nyc_weather or something.
Step 1: Construct Weather Data Table
Step 1 will be constructing a lookup table of dates and weather characteristics. We will construct a query of the NOAA weather data, and the results of this query will populate the table nyc_weather (through the magic of BigQuery).
SELECT
-- Create a timestamp from the date components.
timestamp(concat(year,"-",mo,"-",da)) as timestamp,
-- Replace numerical null values with actual nulls
AVG(IF (temp=9999.9, null, temp)) AS temperature,
AVG(IF (visib=999.9, null, visib)) AS visibility,
AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed,
AVG(IF (gust=999.9, null, gust)) AS wind_gust,
AVG(IF (prcp=99.99, null, prcp)) AS precipitation,
AVG(IF (sndp=999.9, null, sndp)) AS snow_depth
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
WHERE
CAST(YEAR AS INT64) > 2008
AND (stn="725030" OR -- La Guardia
stn="744860") -- JFK
GROUP BY timestamp
Again, we're looking at the two airport weather stations.
- The SELECT statement extracts the year, month, and day as our timestamp. It also asks for the average of several wind, precipitation, and temperature variables over observations for that day.
- The FROM statement just indicates we are looking at weather data
- The WHERE forces the query to look at data more recent than 2008 and from JFK or LaGuardia airport weather stations only.
Now the nyc_weather table has a more convenient intermediate query result, which is a list of timestamps and the corresponding weather that day.
Now we can cross-reference this from 311 results, and do so more easily than before.
Step 2: Cross-Reference Weather and 311 Results
The final step is the hardest - we have a long query to run. Let's review what the query should do first.
We already have a list of timestamps and the corresponding weather, so we want to extract the timestamp from 311 calls (in the same format), group them by category, count them by unique days, and look up weather data for each of those groups.
This will give us a bucket of two-tuples: (weather data, number of complaints).
These two-tuples are then plotted and fitted to give a model of number of complaints vs. weather feature.
On to the full query:
SELECT
descriptor,
sum(complaint_count) as total_complaint_count,
count(temperature) as data_count,
ROUND(corr(temperature, avg_count),3) AS corr_count,
ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (
SELECT
avg(pct_count) as avg_pct_count,
avg(day_count) as avg_count,
sum(day_count) as complaint_count,
descriptor,
temperature
FROM (
SELECT
DATE(timestamp) AS date,
temperature
FROM
demos.nyc_weather) a
JOIN (
SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
FROM
(SELECT
DATE(created_date) AS date,
concat(complaint_type, ": ", descriptor) as descriptor,
COUNT(*) AS day_count
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY
date,
descriptor)x
JOIN (
SELECT
DATE(created_date) AS date,
COUNT(*) AS all_calls_count
FROM `bigquery-public-data.new_york.311_service_requests`
GROUP BY date
)y
ON x.date=y.date
)b
ON
a.date = b.date
GROUP BY
descriptor,
temperature
)
GROUP BY descriptor
HAVING
total_complaint_count > 5000 AND
ABS(corr_pct) > 0.5 AND
data_count > 5
ORDER BY
ABS(corr_pct) DESC
Keep in mind this is a complicated SQL query, so we'll break it down in parts.
Breakdown
Let's start with the innermost query, which is a join of two queries:
SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
FROM
(SELECT
DATE(created_date) AS date,
concat(complaint_type, ": ", descriptor) as descriptor,
COUNT(*) AS day_count
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY
date,
descriptor)x
JOIN (
SELECT
DATE(created_date) AS date,
COUNT(*) AS all_calls_count
FROM `bigquery-public-data.new_york.311_service_requests`
GROUP BY date
)y
ON x.date=y.date
The first SELECT query is getting the date of each complaint matching a certain type of complaint, grouping them by day, and counting complaints of that type on a given date.
The second SELECT query is getting the date of each complaint (for ALL complaints), and counting all complaints by date. This gives a total number of complaints on each day, which gives a relative measure of how many 311 calls were a particular type of complaint.
Lastly, these quantities are joined on the date field, which we expect to be exactly the same between the two.
Now, all of this is wrapped in a JOIN statement, which means we're taking the result of the query we just described above (which returns percentages for each complaint category) and joining it to another query result. Here's the larger query context:
FROM (
SELECT
DATE(timestamp) AS date,
temperature
FROM
demos.nyc_weather)a
JOIN (
<this is where the prior query goes>
)y
ON x.date=y.date
)b
ON
a.date = b.date
So we are getting a list of dates, types of complaints, total complaint counts, percent complaint counts, and the temperature, all in this portion of the query.
These are then joined on the date, again, like the two 311 queries (essentially, we are joining the results of the two joined 311 queries with the results of the NOAA weather query). That's the "a" and "b" join.
But wait! We're not done yet! (In fact, it may have been wiser to split this query up into more sub-steps...)
From the results of this, we do one more step, and extract the following (this is the very beginning of the query):
SELECT
descriptor,
sum(complaint_count) as total_complaint_count,
count(temperature) as data_count,
ROUND(corr(temperature, avg_count),3) AS corr_count,
ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (
<this is where the prior query goes>
GROUP BY descriptor
HAVING
total_complaint_count > 5000 AND
ABS(corr_pct) > 0.5 AND
data_count > 5
ORDER BY
ABS(corr_pct) DESC
Step 3: Correlating More Quantities
Here is an SQL function that will do the same as above, but with wind speed instead of temperature:
SELECT
descriptor,
sum(complaint_count) as total_complaint_count,
count(wind_speed) as data_count,
ROUND(corr(wind_speed, avg_count),3) AS corr_count,
ROUND(corr(wind_speed, avg_pct_count),3) AS corr_pct
From (
SELECT
avg(pct_count) as avg_pct_count,
avg(day_count) as avg_count,
sum(day_count) as complaint_count,
descriptor,
wind_speed
FROM (
SELECT
DATE(timestamp) AS date,
wind_speed
FROM
demos.nyc_weather)a
JOIN (
SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
FROM
(SELECT
DATE(created_date) AS date,
concat(complaint_type, ": ", descriptor) as descriptor,
COUNT(*) AS day_count
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY
date,
descriptor)x
JOIN (
SELECT
DATE(created_date) AS date,
COUNT(*) AS all_calls_count
FROM `bigquery-public-data.new_york.311_service_requests`
GROUP BY date
)y
ON x.date=y.date
)b
ON
a.date = b.date
GROUP BY
descriptor,
wind_speed
)
GROUP BY descriptor
HAVING
total_complaint_count > 5000 AND
ABS(corr_pct) > 0.5 AND
data_count > 10
ORDER BY
ABS(corr_pct) DESC