Data Engineering: Difference between revisions
From charlesreid1
No edit summary |
|||
| Line 9: | Line 9: | ||
This is the "classic" big data technology - distributed computing on clusters. | This is the "classic" big data technology - distributed computing on clusters. | ||
* Dataproc - | Google Cloud product: | ||
* Dataproc - allocate clusters, run jobs | |||
Amazon product: | |||
* Amazon EC2 - allocate clusters, run jobs | |||
Hadoop ecosystem: | |||
* Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework | * Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework | ||
* Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs | * Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs | ||
* Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce) | * Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce) | ||
* HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only | * HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only | ||
| Line 24: | Line 23: | ||
* Parquet - column-based table storage that sits on Hadoop | * Parquet - column-based table storage that sits on Hadoop | ||
Spark technologies: | |||
* Spark - similar to Hadoop, but more focused on efficient computation | |||
* PySpark - Python bindings for Spark (Java) | |||
* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations | |||
=GCDEC= | =GCDEC= | ||
Revision as of 20:39, 13 October 2017
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.