Revision as of 20:39, 13 October 2017

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

Dataproc - allocate clusters, run jobs

Amazon product:

Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

Spark technologies:

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Flags

@@ Line 9: / Line 9: @@
 This is the "classic" big data technology - distributed computing on clusters.
-* Dataproc - Google Cloud version, allocate a cluster and run jobs through it
+Google Cloud product:
+* Dataproc - allocate clusters, run jobs
+Amazon product:
+* Amazon EC2 - allocate clusters, run jobs
+Hadoop ecosystem:
 * Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
-* Spark - similar to Hadoop, but more focused on efficient computation
-* PySpark - Python bindings for Spark (Java)
-* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
 * Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
 * Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
 * HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
@@ Line 24: / Line 23: @@
 * Parquet - column-based table storage that sits on Hadoop
+Spark technologies:
+* Spark - similar to Hadoop, but more focused on efficient computation
+* PySpark - Python bindings for Spark (Java)
+* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
 =GCDEC=

Data Engineering: Difference between revisions

From charlesreid1

Revision as of 20:39, 13 October 2017

Contents

Data Engineering Scenarios

Dataproc

GCDEC

Flags