Data Engineering
From charlesreid1
Contents
Overview
Data engineering - software engineering with an emphasis on dealing with large amounts of data
What is Data Engineering
Enable others to answer questions using datasets, within latency constraints
Components:
- Distributed systems
- Parallel processing
- Databases
- Queuing
Purpose?
- Human fault tolerance
- Metrics
- Monitoring
- Multi-tenancy
Example of where you start:
- Searches by keyword/user only
- Basic statistics only
- Using someone else's search engine
Example stack:
- Custom crawlers ingesting data (Gearman)
- Passing data off to custom workers
- Dumping data to MySQL/Sphinx/etc
Problems:
- Inflexibility
- Corruption is highly probable
- High burden on operations
- No scalability
- No fault tolerance
Alternative stack:
- Many collectors dumping to Amazon S3
- Analysis with Hadoop
- ElephantDB
- Low latency (but lead time of several hours)
- More advanced statistics (influencer of, influenced by)
Data pipeline example:
- Tweets go to S3
- URLS are normalized
- Each hour, new compute bucket
- Sum by hour and by url
- Emit ElephantDB indices
Another data pipeline example:
- Tweets go to Kafka
- URLs are normalized
- Each hour, new compute bucket
- Update hour/url bucket
- Send data to Cassandra
Clojure example:
- tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
- serialization of data using thrift
Infrastructure components:
- HDFS - distributed in-memory big data filesystem
- MapReduce - operations on HDFS data
- Kafka - messaging queue (and later, distributed processing on messages)
- Storm - distributed processing
- Spark - distributed, parallelized computation on HDFS data
- Cassandra - scalable database
- HBase - database operating on top of HDFS
- Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
- ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop
Multi-tenancy:
- Independent applications on a single cluster
- Topologies should not affect each other
- Topologies should have adequate resources (Apache Mesos)
- When submitting a job, specify resources needed
Data engineering vs data science:
- Data engineers have well-defined problems
- Data scientists need specialized statistical skills
- Data engineers deal with a larger scope - not just analytics
Open source:
- Important for recruitment
- Strong developers want to work where they can be involved in open source
- Popular open-source projects give access to better engineers
- identify good recruits, learn best practices, get help - not "free work"
Ideal data engineers:
- Strong software engineering skills
- Abstraction
- Testing
- Version control
- Refactoring
- Strong software engineering skills
- Strong algorithm skills
- Digging into open source code
- Stress testing
Finding strong data engineers:
- Standard "code on a whiteboard" interviews are useless
- Take-home projects to gauge general abilities
- Best: see projects requiring data engineering
Data Engineering Example: Twitter
Data Engineering/Twitter Example
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Compute Engine
An approach to cloud infrastructure that provides a greater degree of freedom, but requires more complicated configuration. Compute Engine gives you virtual machines that start as bare metal, so you have to build/install any software you need. This can be a pain but also gives you greater control.
Also see Container Engine section below.
Container Engine
The Google Cloud container engine basically provides a version control system on Docker images, which can then be pushed and pulled onto nodes in the cloud. This allows you to scale a single container image to deploy many instances, as needed.
Also see Docker.
Dataproc
Dataproc Technologies
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
Dataproc Scenario
The scenario here is dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.
Dataflow
Dataflow Technologies
Google Cloud product:
- Dataflow - building data processing pipelines for transforming streams, with sources/sinks
- PubSub - (unordered) streaming events and messaging
- Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
Amazon product:
- Kinesis - streaming events? messaging?
Apache projects:
- Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
- Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
- Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)
Dataflow Scenarios
Scenario:
- Docker pod - generating messages and publishing them to a pipeline
- Docker container running a collector (unstructured/nosql)
- Docker container running a dashboard to visualize the collector database
Query
Query Technologies
Google Cloud products:
- BigQuery - petabyte-scale datasets
- BigTable - large, non-relational databases
- CloudSQL - elastic, scalable SQL databases in the cloud
Query Scenarios
Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery
Link: https://github.com/charlesreid1/sabermetrics-bigquery
Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
Machine Learning
Machine Learning Technologies
Scikit:
- scikit-learn
- sklearn-pandas
Supporting py-data libraries:
- Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
- SQLAlchemy - SQL data into Python
- Seaborn - linear regression, basic models, essential plot types
- OpenCV - object and face detection
Classic Machine Learning Scenarios
Scenario ideas:
- Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
- Web frontend for OpenCV - bounding boxes where objects found
Neural Network Machine Learning
Neural Network Machine Learning Technologies
Google Cloud:
- Cloud ML APIs - using packaged/bundled API calls for achine learning.
- Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
- Compute Engine - scaling workflows to large data sets "by hand"
- (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)
Software:
- Keras
- TensorFlow
- Sonnet
- Theano
- MXNet
- etc etc etc
Goals?
- Predictive analytics
- Creating business value from unstructured/very large/unanalyzed data sets
Neural Network Machine Learning Scenarios
Scenario 1: SQL data in a Docker container, training a Keras neural network model
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario notes:
- Don't reinvent the wheel, use pre-trained models and APIs
- Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
- Scenario template: JS frontend, Flask glue, Keras/other Python backend
Scenario ideas:
- Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
- Trained face differences, upload two faces, give prediction.
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.
Apache
Apache software projects related to Data Engineering:
- Avro
- DataFusion
- Iceberg