Revision as of 20:50, 13 October 2017

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

Dataproc - allocate clusters, run jobs

Amazon product:

Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

Spark technologies:

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataflow

Google Cloud product:

Dataflow - building data processing pipelines for transforming streams, with sources/sinks
PubSub - (unordered) streaming events and messaging
Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

Kinesis - streaming events? messaging?

Apache projects:

Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Flags

@@ Line 27: / Line 27: @@
 * PySpark - Python bindings for Spark (Java)
 * SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
+==Dataflow==
+Google Cloud product:
+* Dataflow - building data processing pipelines for transforming streams, with sources/sinks
+* PubSub - (unordered) streaming events and messaging
+* Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
+Amazon product:
+* Kinesis - streaming events? messaging?
+Apache projects:
+* Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
 =GCDEC=

Data Engineering: Difference between revisions

From charlesreid1