From charlesreid1

Main link: https://google.qwiklabs.com/focuses/5689?locale=en

Qwiklab

Summary

This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.

Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable

Steps

Overview of the basic steps:

  • Set up a Google Cloud account (follow qwiklabs instructions, or use own)
  • Explore weather data and 311 data in BigQuery
  • Find correlation between weather and complaints

Setting Up Google Cloud Account

Won't cover this here. Use a free trial or use the Qwiklab-provided account.

Exploring Data

To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.

Now on the left, where the project name is listed, there is a triangle drop-down button next to it.

Click that and pick "Switch to project" and "Display project..." to display the weather data set.

Before running SQL queries, check to make sure you are running SQL queries in legacy mode.

Weather Data

Here is a sample SQL query:

SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC

Let's break it down to understand what it is doing...

  • First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
  • The FROM part of the query indicates which table we're taking data from - the range of our historical data.
  • We want to limit the data to a particular date, so we limit to each June 12 after 2010.
  • Finally, we limit the results to the LaGuardia and JFK airport stations.

So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011


311 Data

To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:

SELECT
  EXTRACT(YEAR
  FROM
    created_date) AS year,
  complaint_type,
  COUNT(1) AS num_complaints
FROM
  `bigquery-public-data.new_york.311_service_requests`
GROUP BY
  year,
  complaint_type
ORDER BY
  num_complaints DESC

What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.

  • The SELECT statement extracts the year from the timestamp when the complaint was created
  • The FROM statement specifies that we want to use NYC 311 data
  • GROUP BY means we group the results according to these two dimensions - year and complaint_type
  • We order the results by number of complaints, largest to smallest.


Searching for Correlations

For the second part of the lab, create a new table to store results.

Click the triangular drop down arrow next to project name, and pick "Create new dataset".

Click "Compose New Query". Can call it nyc_weather or something.

Step 1: Construct Weather Data Table

Step 1 will be constructing a lookup table of dates and weather characteristics. We will construct a query of the NOAA weather data, and the results of this query will populate the table nyc_weather (through the magic of BigQuery).

SELECT
  -- Create a timestamp from the date components.
  timestamp(concat(year,"-",mo,"-",da)) as timestamp,
  -- Replace numerical null values with actual nulls
  AVG(IF (temp=9999.9, null, temp)) AS temperature,
  AVG(IF (visib=999.9, null, visib)) AS visibility,
  AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (gust=999.9, null, gust)) AS wind_gust,
  AVG(IF (prcp=99.99, null, prcp)) AS precipitation,
  AVG(IF (sndp=999.9, null, sndp)) AS snow_depth

FROM
  `bigquery-public-data.noaa_gsod.gsod20*`

WHERE
  CAST(YEAR AS INT64) > 2008
  AND (stn="725030" OR  -- La Guardia
       stn="744860")    -- JFK
GROUP BY timestamp

Again, we're looking at the two airport weather stations.

  • The SELECT statement extracts the year, month, and day as our timestamp. It also asks for the average of several wind, precipitation, and temperature variables over observations for that day.
  • The FROM statement just indicates we are looking at weather data
  • The WHERE forces the query to look at data more recent than 2008 and from JFK or LaGuardia airport weather stations only.

Now the nyc_weather table has a more convenient intermediate query result, which is a list of timestamps and the corresponding weather that day.

Now we can cross-reference this from 311 results, and do so more easily than before.

Step 2: Cross-Reference Weather and 311 Results

The final step is the hardest - we have a long query to run. Let's review what the query should do first.

We already have a list of timestamps and the corresponding weather, so we want to extract the timestamp from 311 calls (in the same format), group them by category, count them by unique days, and look up weather data for each of those groups.

This will give us a bucket of two-tuples: (weather data, number of complaints).

These two-tuples are then plotted and fitted to give a model of number of complaints vs. weather feature.

On to the full query:

SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(temperature) as data_count,
  ROUND(corr(temperature, avg_count),3) AS corr_count,
  ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (
SELECT
  avg(pct_count) as avg_pct_count,
  avg(day_count) as avg_count,
  sum(day_count) as complaint_count,
  descriptor,
  temperature
FROM (
  SELECT
    DATE(timestamp) AS date,
    temperature
  FROM
    demos.nyc_weather) a
  JOIN (
  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests` 
    GROUP BY
      date,
      descriptor)x 
    JOIN (
      SELECT
        DATE(created_date) AS date,
        COUNT(*) AS all_calls_count
      FROM `bigquery-public-data.new_york.311_service_requests` 
      GROUP BY date
    )y
  ON x.date=y.date
)b
ON
  a.date = b.date
GROUP BY
  descriptor,
  temperature
)
GROUP BY descriptor
HAVING 
  total_complaint_count > 5000 AND 
  ABS(corr_pct) > 0.5 AND
  data_count > 5
ORDER BY
  ABS(corr_pct) DESC

Keep in mind this is a complicated SQL query, so we'll break it down in parts.

Breakdown

Let's start with the innermost query, which is a join of two queries:

  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests` 
    GROUP BY
      date,
      descriptor)x 
    JOIN (
      SELECT
        DATE(created_date) AS date,
        COUNT(*) AS all_calls_count
      FROM `bigquery-public-data.new_york.311_service_requests` 
      GROUP BY date
    )y
  ON x.date=y.date

The first SELECT query is getting the date of each complaint matching a certain type of complaint, grouping them by day, and counting complaints of that type on a given date.

The second SELECT query is getting the date of each complaint (for ALL complaints), and counting all complaints by date. This gives a total number of complaints on each day, which gives a relative measure of how many 311 calls were a particular type of complaint.

Lastly, these quantities are joined on the date field, which we expect to be exactly the same between the two.

Now, all of this is wrapped in a JOIN statement, which means we're taking the result of the query we just described above (which returns percentages for each complaint category) and joining it to another query result. Here's the larger query context:

FROM (
  SELECT
    DATE(timestamp) AS date,
    temperature
  FROM
    demos.nyc_weather)a
  JOIN (
    
    <this is where the prior query goes>

    )y
  ON x.date=y.date
)b
ON
  a.date = b.date

So we are getting a list of dates, types of complaints, total complaint counts, percent complaint counts, and the temperature, all in this portion of the query.

These are then joined on the date, again, like the two 311 queries (essentially, we are joining the results of the two joined 311 queries with the results of the NOAA weather query). That's the "a" and "b" join.

But wait! We're not done yet! (In fact, it may have been wiser to split this query up into more sub-steps...)

From the results of this, we do one more step, and extract the following (this is the very beginning of the query):

SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(temperature) as data_count,
  ROUND(corr(temperature, avg_count),3) AS corr_count,
  ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (

    <this is where the prior query goes>

GROUP BY descriptor
HAVING 
  total_complaint_count > 5000 AND 
  ABS(corr_pct) > 0.5 AND
  data_count > 5
ORDER BY
  ABS(corr_pct) DESC

Step 3: Correlating More Quantities

Here is an SQL function that will do the same as above, but with wind speed instead of temperature:

SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(wind_speed) as data_count,
  ROUND(corr(wind_speed, avg_count),3) AS corr_count,
  ROUND(corr(wind_speed, avg_pct_count),3) AS corr_pct
From (
SELECT
  avg(pct_count) as avg_pct_count,
  avg(day_count) as avg_count,
  sum(day_count) as complaint_count,
  descriptor,
  wind_speed
FROM (
  SELECT
    DATE(timestamp) AS date,
    wind_speed
  FROM
    demos.nyc_weather)a
  JOIN (
  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests` 
    GROUP BY
      date,
      descriptor)x 
    JOIN (
      SELECT
        DATE(created_date) AS date,
        COUNT(*) AS all_calls_count
      FROM `bigquery-public-data.new_york.311_service_requests` 
      GROUP BY date
    )y
  ON x.date=y.date
)b
ON
  a.date = b.date
GROUP BY
  descriptor,
  wind_speed
)
GROUP BY descriptor
HAVING 
  total_complaint_count > 5000 AND 
  ABS(corr_pct) > 0.5 AND
  data_count > 10
ORDER BY
  ABS(corr_pct) DESC

Flags