From charlesreid1

Overview

Data engineering - software engineering with an emphasis on dealing with large amounts of data

What is Data Engineering

Enable others to answer questions using datasets, within latency constraints

Components:

  • Distributed systems
  • Parallel processing
  • Databases
  • Queuing

Purpose?

  • Human fault tolerance
  • Metrics
  • Monitoring
  • Multi-tenancy

Example of where you start:

  • Searches by keyword/user only
  • Basic statistics only
  • Using someone else's search engine

Example stack:

  • Custom crawlers ingesting data (Gearman)
  • Passing data off to custom workers
  • Dumping data to MySQL/Sphinx/etc

Problems:

  • Inflexibility
  • Corruption is highly probable
  • High burden on operations
  • No scalability
  • No fault tolerance

Alternative stack:

  • Many collectors dumping to Amazon S3
  • Analysis with Hadoop
  • ElephantDB
  • Low latency (but lead time of several hours)
  • More advanced statistics (influencer of, influenced by)

Data pipeline example:

  • Tweets go to S3
  • URLS are normalized
  • Each hour, new compute bucket
  • Sum by hour and by url
  • Emit ElephantDB indices

Another data pipeline example:

  • Tweets go to Kafka
  • URLs are normalized
  • Each hour, new compute bucket
  • Update hour/url bucket
  • Send data to Cassandra

Clojure example:

  • tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
  • serialization of data using thrift

Infrastructure components:

  • HDFS - distributed in-memory big data filesystem
  • MapReduce - operations on HDFS data
  • Kafka - messaging queue (and later, distributed processing on messages)
  • Storm - distributed processing
  • Spark - distributed, parallelized computation on HDFS data
  • Cassandra - scalable database
  • HBase - database operating on top of HDFS
  • Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
  • ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop

Multi-tenancy:

  • Independent applications on a single cluster
  • Topologies should not affect each other
  • Topologies should have adequate resources (Apache Mesos)
  • When submitting a job, specify resources needed

Data engineering vs data science:

  • Data engineers have well-defined problems
  • Data scientists need specialized statistical skills
  • Data engineers deal with a larger scope - not just analytics

Open source:

  • Important for recruitment
  • Strong developers want to work where they can be involved in open source
  • Popular open-source projects give access to better engineers
  • identify good recruits, learn best practices, get help - not "free work"

Ideal data engineers:

  • Strong software engineering skills
  • Abstraction
  • Testing
  • Version control
  • Refactoring
  • Strong software engineering skills
  • Strong algorithm skills
  • Digging into open source code
  • Stress testing

Finding strong data engineers:

  • Standard "code on a whiteboard" interviews are useless
  • Take-home projects to gauge general abilities
  • Best: see projects requiring data engineering

Data Engineering Example: Twitter

Data Engineering/Twitter Example

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Compute Engine

An approach to cloud infrastructure that provides a greater degree of freedom, but requires more complicated configuration. Compute Engine gives you virtual machines that start as bare metal, so you have to build/install any software you need. This can be a pain but also gives you greater control.

Also see Container Engine section below.

Container Engine

The Google Cloud container engine basically provides a version control system on Docker images, which can then be pushed and pulled onto nodes in the cloud. This allows you to scale a single container image to deploy many instances, as needed.

Also see Docker.

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

  • Dataflow - building data processing pipelines for transforming streams, with sources/sinks
  • PubSub - (unordered) streaming events and messaging
  • Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

  • Kinesis - streaming events? messaging?

Apache projects:

  • Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
  • Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
  • Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

  • Docker pod - generating messages and publishing them to a pipeline
  • Docker container running a collector (unstructured/nosql)
  • Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

  • BigQuery - petabyte-scale datasets
  • BigTable - large, non-relational databases
  • CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning Technologies

Scikit:

  • scikit-learn
  • sklearn-pandas

Supporting py-data libraries:

  • Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
  • SQLAlchemy - SQL data into Python
  • Seaborn - linear regression, basic models, essential plot types
  • OpenCV - object and face detection

Classic Machine Learning Scenarios

Scenario ideas:

  • Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
  • Web frontend for OpenCV - bounding boxes where objects found

Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

  • Cloud ML APIs - using packaged/bundled API calls for achine learning.
  • Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
  • Compute Engine - scaling workflows to large data sets "by hand"
  • (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

  • Keras
  • TensorFlow
  • Sonnet
  • Theano
  • MXNet
  • etc etc etc

Goals?

  • Predictive analytics
  • Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario notes:

  • Don't reinvent the wheel, use pre-trained models and APIs
  • Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
  • Scenario template: JS frontend, Flask glue, Keras/other Python backend

Scenario ideas:

  • Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
  • Trained face differences, upload two faces, give prediction.

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Apache

Apache software projects related to Data Engineering:

Flags