Google Cloud/BigQuery
From charlesreid1
Contents
What is it
BigQuery is a serverless data warehouse solution from Google Cloud. It provides petabyte-scale, column-based storage with latency on the order of seconds, which can be queried using SQL. It provides a very flexible warehouse solution that can be used as a source or a sink for all manner of data pipelines.
History
Google's BigQuery technology began when an engineer was trying to run queries on log data that had thousands of columns. They were only searching for or processing a few columns, but because the SQL database they were using stored entire records together, each query to process the entire row for each record. They were inspired to write a program that would store data and run queries by storing data in a column-major format rather than row-major format. That tool later evolved into BigQuery.
Stack
Dremel is the engine that BigQuery runs on top of - given an SQL query, Dremel decides how to parallelize the query
Open Source Equivalents
A few open-source technologies that are similar, if not exactly equivalent, to BigQuery are:
- Apache Parquet - an Apache project that provides a column-based storage format
- Hypertable - (possibly?) an open-source project attempting to implement a BigQuery-like technology
Installing
Gcloud
BigQuery Client Library
There is a long list of client libraries for Google Cloud provided here: https://cloud.google.com/apis/docs/cloud-client-libraries
The Python API bundles each component separately, and not everything comes with the client library by default. For example, if you want to use BigQuery, you have to install the BigQuery API components. If you want to use PubSub, you have to install the PubSub API components. Installing one does not necessarily install the other.
Python API
To use BigQuery from Python, you need to install the Google Cloud Python API, plus BigQuery bindings. Use pip:
$ pip3 install --upgrade google-cloud $ pip3 install --upgrade google-cloud-bigquery
Link/reference: https://cloud.google.com/bigquery/docs/reference/libraries
Also see: https://github.com/GoogleCloudPlatform/google-cloud-python
Specifically: https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery
Using
Using from gcloud
Using from Python
Resources
Google Cloud Podcast episode about BigQuery: https://www.gcppodcast.com/post/episode-94-big-query-under-the-hood-with-tino-tereshko-and-jordan-tigani/
BigQuery launch checklist: https://cloud.google.com/bigquery/launch-checklist
BigQuery pricing: https://cloud.google.com/bigquery/pricing
BigQuery pricing calculator: https://cloud.google.com/products/calculator/