Google Cloud
From charlesreid1
Notes for Google Cloud Data Engineer (GCDE) certification. See GCDE.
Links:
- Certification info: https://cloud.google.com/certification/data-engineer
- Sample case study: https://cloud.google.com/certification/guides/data-engineer/casestudy-flowlogistic
- Tutorials/Guides/Resources for all of Google Cloud: https://cloud.google.com/solutions/
Contents
Case Study
The GCDEC page gives an example of a case study that can be used to see how different parts of the Google Cloud platform come together in the kind of scenario a real company might face. The case study focuses on a logistics company that delivers packages and tracks the deliveries with servers, software, and other infrastructure already in-place. The company's goal is to improve their computational infrastructure by moving parts of it to the cloud, and implement the ability to predict late shipments.
Google Cloud Services
Notes on all of the various parts of the Google Cloud platform and the services available on it.
Introduction
Google Cloud for Big Data
- MapReduce - can use Dataflow
- Spark - can use Dataproc
- BigQuery
Usage scenarios
Foundations
Compute and Storage
Data ingestion
Data storage
Federated analysis
Compute engine
Cloud storage
Data Analytics
Cloud SQL - relational database
Dataproc for machine learning
BigTop ecosystem:
- Pig
- Spark
- Hive
- Hadoop
Data Storage
Choosing a storage option: https://cloud.google.com/storage-options/
Data warehousing:
- Bigtable - low-latency and updatable data warehouse solution, data is not highly structurable, no need to support ACID transactions
- BigQuery - petabyte-scale, structured, column-major, SQL-queryable data warehouse solution
Data storage:
- Cloud Storage - unstructured data (documents, sound files, PDFs, etc etc)
- Cloud Datastore - non-relational (NoSQL), highly scalable storage solution; SQL-like query language; more restrictive queries (b/c optimized to be faster); supports ACID transactions
- Cloud SQL - full SQL support and online transaction processing (OLTP) system
- Cloud Spanner - (horizontally sharded SQL) fully managed mission-critical relational OLTP database that can scale horizontally to hundreds or thousands of servers to handle high workload transactions; supports ACID transactions
Writeup of Spanner: https://quizlet.com/blog/quizlet-cloud-spanner
Scaling Data Analysis
(Transformational use cases)
Datalab
Datastore
BigTable (fast random access, tradeoffs between consistency and availability)
BigQuery (query petabytes in seconds)
TensorFlow (distributed in the cloud over very large data sets)
Demand forecasting with machine learning
Data Processing Architectures
PubSub (messaging architecture)
Dataflow (way to execute code that processes streaming and batch data in similar ways)