From charlesreid1

 
(3 intermediate revisions by 2 users not shown)
Line 107: Line 107:
==Data Engineering Example: Twitter==
==Data Engineering Example: Twitter==


Real time data:
[[Data Engineering/Twitter Example]]
* online queries for a web request
* offline computations with very low latency
* latency and throughput equally important
* Hadoop is too high-latency!


Four data problems at twitter:
=Data Engineering Scenarios=
* Tweets
* Timelines
* Social graphs
* Search indices


===Tweets===
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.


Definitions of data:
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
* Tweet is primary key id, user id, text, timestamp (and replies)
* Row storage
* Initially: single table vertically scaled
* Initially: master-worker replication (writes to master, replication to workers)
* Initially: Memmcached for reads (rails reads the real database, populates memcached instances)
 
Issues:
* Scaling isk space: disk arrays > 800 GB problematic
* At 3 trillion tweets, disk space 90% utilized
 
Solutions:
* Partition: partition by primary key (one cluster holds one set of keys, another holds different)
* Partition: tweet IDs partitioned by user
* Partition: tweets by time
* Try each partition in order, until enough data accumulated
 
Locality of databases:
* Memcached: primary key lookup is 1 ms
* MySQL: primary key lookup is < 10 ms
* Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)
 
Problems:
* write throughput
* Deadlocks in MySQL (if tweet volume gets crazy)
* Temporal shard creation is manual process
 
More solutions:
* Cassandra (NoSQL, non-relational)
* primary key partition (tweet ID)
* but also, secondary key on user ID
 
===Timelines===
 
Definitions:
* Timeline is series of tweet ids
* Query pattern: organized by user ID
* Operations: append, merge, truncate
* high velocity bounded vector
* Space-based


Primitive approach: SQL query:
==Compute Engine==
* Use the following SQL query (below)
* Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
* BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?


<pre>
An approach to cloud infrastructure that provides a greater degree of freedom, but requires more complicated configuration. Compute Engine gives you virtual machines that start as bare metal, so you have to build/install any software you need. This can be a pain but also gives you greater control.
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20
</pre>


offline vs online computation:
Also see Container Engine section below.
* Sequences can be stored in memcached (individual timelines)
* You pass a status to Fanout
* Fanout is offline, but has a low-latency SLlA
* Truncate at random intervals, ensuring bounded length
* What to do on a cache miss?
* Merge the user timelines


Stats:
==Container Engine==
* 2008: 30 TPS, 120 TPS peak
* 2010: 700 TPS, 2,000 TPS peak
* 1.2M deliveries per second


Memory hierarchy:
The Google Cloud container engine basically provides a version control system on Docker images, which can then be pushed and pulled onto nodes in the cloud. This allows you to scale a single container image to deploy many instances, as needed.
* Possibilities:
* Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)


Principles:
Also see [[Docker]].
* Offline vs online computation
* Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
* Must keep memory hierarchy in mind
* Efficiency of system includes cost of generating data from another source times probability of needing to
 
===Social Graphs===
 
Who follows you? Who do you follow? Who have you blocked, etc.
 
Operations:
* Enumerate by time
* Set operations: intersection, difference, union
* Inclusion, cardinality
 
Spam problems:
* Need mass delete ability
 
Temporal enumeration:
* Who you followed, listed by when you followed them
* Inclusion: do they follow you too?
* Cardinality: How many followers do they have? How many people are following them?
 
If person A tweets at Person B:
* Want to deliver tweet to people who follow both person A and person B
* Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
* Single table, master-worker replication
 
What problems did this cause?
* Write throughput problem
* Inputs could not be kept in RAM
 
Solution:
* Partition by User ID
* Edges stored in BOTH forward AND backward direction (same tweet stored twice)
* Indexed by time
* Indexed by element (for doing set algebra)
* Partitioned by user: source_id of one and destination_id of other are identical
 
Challenges:
* Consistency in the presence of failures
* Write operations: idempotent (retry until success)
* Last-write wins for edges
* Commutative strategies for mass writes
* Low latency - ms time scale
 
Principles:
* Impossible to precompute set algebra queries
* Simple distributed coordination techniques
* Partition, replicate, and index
* Many efficient scalability problems are solved this way: Partition, Replicate, Index
 
===Search Indices===
 
Results of searching for, e.g., "mountain dew cheetos"
 
Search index:
* Find all tweets with these words in it
* Posting list
* Boolean
* Queries
* Complex queries
* Ad hoc queries
* Relevancy is recency (this ignores the non-real-time component to search...)
 
Searching for, e.g., mountain dew cheetos is the intersection of three posting lists
* Original implementation: single table, vertically scaled
* One column: term ID
* Another column: document ID
* Master-worker replication for read throughput
 
Problems: index cannot be kept in memory
 
Current implementation:
* Partitioned by TIME first
* Then partitioned by term ID...
* Use delayed key-write
 
What problems does the solution create:
* Write throughput issues
* Queries for rare terms have to search MANY partitions
* Space efficiency and recall
* MySQL requires loooots of memory
 
=Data Engineering Scenarios=
 
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
 
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:


==Dataproc==
==Dataproc==
Line 419: Line 268:
Working through the Google Cloud Data Engineer certification course... See [[GCDEC]] for pages related to that.
Working through the Google Cloud Data Engineer certification course... See [[GCDEC]] for pages related to that.


=Apache=
Apache software projects related to Data Engineering:
* Avro
* [[DataFusion]]
* Iceberg


=Flags=
=Flags=


[[Category:Data Engineering]]
[[Category:Data Engineering]]

Latest revision as of 17:54, 26 May 2025

Overview

Data engineering - software engineering with an emphasis on dealing with large amounts of data

What is Data Engineering

Enable others to answer questions using datasets, within latency constraints

Components:

  • Distributed systems
  • Parallel processing
  • Databases
  • Queuing

Purpose?

  • Human fault tolerance
  • Metrics
  • Monitoring
  • Multi-tenancy

Example of where you start:

  • Searches by keyword/user only
  • Basic statistics only
  • Using someone else's search engine

Example stack:

  • Custom crawlers ingesting data (Gearman)
  • Passing data off to custom workers
  • Dumping data to MySQL/Sphinx/etc

Problems:

  • Inflexibility
  • Corruption is highly probable
  • High burden on operations
  • No scalability
  • No fault tolerance

Alternative stack:

  • Many collectors dumping to Amazon S3
  • Analysis with Hadoop
  • ElephantDB
  • Low latency (but lead time of several hours)
  • More advanced statistics (influencer of, influenced by)

Data pipeline example:

  • Tweets go to S3
  • URLS are normalized
  • Each hour, new compute bucket
  • Sum by hour and by url
  • Emit ElephantDB indices

Another data pipeline example:

  • Tweets go to Kafka
  • URLs are normalized
  • Each hour, new compute bucket
  • Update hour/url bucket
  • Send data to Cassandra

Clojure example:

  • tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
  • serialization of data using thrift

Infrastructure components:

  • HDFS - distributed in-memory big data filesystem
  • MapReduce - operations on HDFS data
  • Kafka - messaging queue (and later, distributed processing on messages)
  • Storm - distributed processing
  • Spark - distributed, parallelized computation on HDFS data
  • Cassandra - scalable database
  • HBase - database operating on top of HDFS
  • Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
  • ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop

Multi-tenancy:

  • Independent applications on a single cluster
  • Topologies should not affect each other
  • Topologies should have adequate resources (Apache Mesos)
  • When submitting a job, specify resources needed

Data engineering vs data science:

  • Data engineers have well-defined problems
  • Data scientists need specialized statistical skills
  • Data engineers deal with a larger scope - not just analytics

Open source:

  • Important for recruitment
  • Strong developers want to work where they can be involved in open source
  • Popular open-source projects give access to better engineers
  • identify good recruits, learn best practices, get help - not "free work"

Ideal data engineers:

  • Strong software engineering skills
  • Abstraction
  • Testing
  • Version control
  • Refactoring
  • Strong software engineering skills
  • Strong algorithm skills
  • Digging into open source code
  • Stress testing

Finding strong data engineers:

  • Standard "code on a whiteboard" interviews are useless
  • Take-home projects to gauge general abilities
  • Best: see projects requiring data engineering

Data Engineering Example: Twitter

Data Engineering/Twitter Example

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Compute Engine

An approach to cloud infrastructure that provides a greater degree of freedom, but requires more complicated configuration. Compute Engine gives you virtual machines that start as bare metal, so you have to build/install any software you need. This can be a pain but also gives you greater control.

Also see Container Engine section below.

Container Engine

The Google Cloud container engine basically provides a version control system on Docker images, which can then be pushed and pulled onto nodes in the cloud. This allows you to scale a single container image to deploy many instances, as needed.

Also see Docker.

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

  • Dataflow - building data processing pipelines for transforming streams, with sources/sinks
  • PubSub - (unordered) streaming events and messaging
  • Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

  • Kinesis - streaming events? messaging?

Apache projects:

  • Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
  • Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
  • Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

  • Docker pod - generating messages and publishing them to a pipeline
  • Docker container running a collector (unstructured/nosql)
  • Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

  • BigQuery - petabyte-scale datasets
  • BigTable - large, non-relational databases
  • CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning Technologies

Scikit:

  • scikit-learn
  • sklearn-pandas

Supporting py-data libraries:

  • Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
  • SQLAlchemy - SQL data into Python
  • Seaborn - linear regression, basic models, essential plot types
  • OpenCV - object and face detection

Classic Machine Learning Scenarios

Scenario ideas:

  • Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
  • Web frontend for OpenCV - bounding boxes where objects found

Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

  • Cloud ML APIs - using packaged/bundled API calls for achine learning.
  • Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
  • Compute Engine - scaling workflows to large data sets "by hand"
  • (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

  • Keras
  • TensorFlow
  • Sonnet
  • Theano
  • MXNet
  • etc etc etc

Goals?

  • Predictive analytics
  • Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario notes:

  • Don't reinvent the wheel, use pre-trained models and APIs
  • Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
  • Scenario template: JS frontend, Flask glue, Keras/other Python backend

Scenario ideas:

  • Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
  • Trained face differences, upload two faces, give prediction.

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Apache

Apache software projects related to Data Engineering:

Flags