Data Engineering: Difference between revisions
From charlesreid1
| Line 27: | Line 27: | ||
* PySpark - Python bindings for Spark (Java) | * PySpark - Python bindings for Spark (Java) | ||
* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations | * SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations | ||
==Dataflow== | |||
Google Cloud product: | |||
* Dataflow - building data processing pipelines for transforming streams, with sources/sinks | |||
* PubSub - (unordered) streaming events and messaging | |||
* Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow | |||
Amazon product: | |||
* Kinesis - streaming events? messaging? | |||
Apache projects: | |||
* Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters | |||
=GCDEC= | =GCDEC= | ||
Revision as of 20:50, 13 October 2017
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
Dataflow
Google Cloud product:
- Dataflow - building data processing pipelines for transforming streams, with sources/sinks
- PubSub - (unordered) streaming events and messaging
- Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
Amazon product:
- Kinesis - streaming events? messaging?
Apache projects:
- Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.