From charlesreid1

(Redirected from BigQuery)

What is it

BigQuery is a serverless data warehouse solution from Google Cloud. It provides petabyte-scale, column-based storage with latency on the order of seconds, which can be queried using SQL. It provides a very flexible warehouse solution that can be used as a source or a sink for all manner of data pipelines.

History

Google's BigQuery technology began when an engineer was trying to run queries on log data that had thousands of columns. They were only searching for or processing a few columns, but because the SQL database they were using stored entire records together, each query to process the entire row for each record. They were inspired to write a program that would store data and run queries by storing data in a column-major format rather than row-major format. That tool later evolved into BigQuery.

Stack

Dremel is the engine that BigQuery runs on top of - given an SQL query, Dremel decides how to parallelize the query

Open Source Equivalents

A few open-source technologies that are similar, if not exactly equivalent, to BigQuery are:

  • Apache Parquet - an Apache project that provides a column-based storage format
  • Hypertable - (possibly?) an open-source project attempting to implement a BigQuery-like technology

Installing

Gcloud

BigQuery Client Library

There is a long list of client libraries for Google Cloud provided here: https://cloud.google.com/apis/docs/cloud-client-libraries

The Python API bundles each component separately, and not everything comes with the client library by default. For example, if you want to use BigQuery, you have to install the BigQuery API components. If you want to use PubSub, you have to install the PubSub API components. Installing one does not necessarily install the other.

Python API

To use BigQuery from Python, you need to install the Google Cloud Python API, plus BigQuery bindings. Use pip:

$ pip3 install --upgrade google-cloud
$ pip3 install --upgrade google-cloud-bigquery

Link/reference: https://cloud.google.com/bigquery/docs/reference/libraries

Also see: https://github.com/GoogleCloudPlatform/google-cloud-python

Specifically: https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery

Using

Using from gcloud

Using from Python

Resources

Google Cloud Podcast episode about BigQuery: https://www.gcppodcast.com/post/episode-94-big-query-under-the-hood-with-tino-tereshko-and-jordan-tigani/

BigQuery launch checklist: https://cloud.google.com/bigquery/launch-checklist

BigQuery pricing: https://cloud.google.com/bigquery/pricing

BigQuery pricing calculator: https://cloud.google.com/products/calculator/

Flags