From charlesreid1

Revision as of 20:39, 13 October 2017 by Admin (talk | contribs) (→‎Dataproc)

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.


Flags