Latest revision as of 17:54, 26 May 2025

Overview

Data engineering - software engineering with an emphasis on dealing with large amounts of data

What is Data Engineering

Enable others to answer questions using datasets, within latency constraints

Components:

Distributed systems
Parallel processing
Databases
Queuing

Purpose?

Human fault tolerance
Metrics
Monitoring
Multi-tenancy

Example of where you start:

Searches by keyword/user only
Basic statistics only
Using someone else's search engine

Example stack:

Custom crawlers ingesting data (Gearman)
Passing data off to custom workers
Dumping data to MySQL/Sphinx/etc

Problems:

Inflexibility
Corruption is highly probable
High burden on operations
No scalability
No fault tolerance

Alternative stack:

Many collectors dumping to Amazon S3
Analysis with Hadoop
ElephantDB
Low latency (but lead time of several hours)
More advanced statistics (influencer of, influenced by)

Data pipeline example:

Tweets go to S3
URLS are normalized
Each hour, new compute bucket
Sum by hour and by url
Emit ElephantDB indices

Another data pipeline example:

Tweets go to Kafka
URLs are normalized
Each hour, new compute bucket
Update hour/url bucket
Send data to Cassandra

Clojure example:

tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
serialization of data using thrift

Infrastructure components:

HDFS - distributed in-memory big data filesystem
MapReduce - operations on HDFS data
Kafka - messaging queue (and later, distributed processing on messages)
Storm - distributed processing
Spark - distributed, parallelized computation on HDFS data
Cassandra - scalable database
HBase - database operating on top of HDFS
Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop

Multi-tenancy:

Independent applications on a single cluster
Topologies should not affect each other
Topologies should have adequate resources (Apache Mesos)
When submitting a job, specify resources needed

Data engineering vs data science:

Data engineers have well-defined problems
Data scientists need specialized statistical skills
Data engineers deal with a larger scope - not just analytics

Open source:

Important for recruitment
Strong developers want to work where they can be involved in open source
Popular open-source projects give access to better engineers
identify good recruits, learn best practices, get help - not "free work"

Ideal data engineers:

Strong software engineering skills
Abstraction
Testing
Version control
Refactoring
Strong software engineering skills
Strong algorithm skills
Digging into open source code
Stress testing

Finding strong data engineers:

Standard "code on a whiteboard" interviews are useless
Take-home projects to gauge general abilities
Best: see projects requiring data engineering

Data Engineering Example: Twitter

Data Engineering/Twitter Example

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Compute Engine

An approach to cloud infrastructure that provides a greater degree of freedom, but requires more complicated configuration. Compute Engine gives you virtual machines that start as bare metal, so you have to build/install any software you need. This can be a pain but also gives you greater control.

Also see Container Engine section below.

Container Engine

The Google Cloud container engine basically provides a version control system on Docker images, which can then be pushed and pulled onto nodes in the cloud. This allows you to scale a single container image to deploy many instances, as needed.

Also see Docker.

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

Dataproc - allocate clusters, run jobs

Amazon product:

Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

Spark technologies:

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

Dataflow - building data processing pipelines for transforming streams, with sources/sinks
PubSub - (unordered) streaming events and messaging
Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

Kinesis - streaming events? messaging?

Apache projects:

Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

Docker pod - generating messages and publishing them to a pipeline
Docker container running a collector (unstructured/nosql)
Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

BigQuery - petabyte-scale datasets
BigTable - large, non-relational databases
CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning Technologies

Scikit:

scikit-learn
sklearn-pandas

Supporting py-data libraries:

Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
SQLAlchemy - SQL data into Python
Seaborn - linear regression, basic models, essential plot types
OpenCV - object and face detection

Classic Machine Learning Scenarios

Scenario ideas:

Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
Web frontend for OpenCV - bounding boxes where objects found

Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

Cloud ML APIs - using packaged/bundled API calls for achine learning.
Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
Compute Engine - scaling workflows to large data sets "by hand"
(Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

Keras
TensorFlow
Sonnet
Theano
MXNet
etc etc etc

Goals?

Predictive analytics
Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model