Data Engineering/Twitter Example

Data Engineering Examples: Twitter

Real time data:

online queries for a web request
offline computations with very low latency
latency and throughput equally important
Hadoop is too high-latency!

Four data problems at twitter:

Tweets
Timelines
Social graphs
Search indices

Tweets

Definitions of data:

Tweet is primary key id, user id, text, timestamp (and replies)
Row storage
Initially: single table vertically scaled
Initially: master-worker replication (writes to master, replication to workers)
Initially: Memmcached for reads (rails reads the real database, populates memcached instances)

Issues:

Scaling isk space: disk arrays > 800 GB problematic
At 3 trillion tweets, disk space 90% utilized

Solutions:

Partition: partition by primary key (one cluster holds one set of keys, another holds different)
Partition: tweet IDs partitioned by user
Partition: tweets by time
Try each partition in order, until enough data accumulated

Locality of databases:

Memcached: primary key lookup is 1 ms
MySQL: primary key lookup is < 10 ms
Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)

Problems:

write throughput
Deadlocks in MySQL (if tweet volume gets crazy)
Temporal shard creation is manual process

Timelines

Definitions:

Timeline is series of tweet ids
Query pattern: organized by user ID
Operations: append, merge, truncate
high velocity bounded vector
Space-based

Primitive approach: SQL query:

Use the following SQL query (below)
Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?

SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20

offline vs online computation:

Sequences can be stored in memcached (individual timelines)
You pass a status to Fanout
Fanout is offline, but has a low-latency SLlA
Truncate at random intervals, ensuring bounded length
What to do on a cache miss?
Merge the user timelines

Stats:

2008: 30 TPS, 120 TPS peak
2010: 700 TPS, 2,000 TPS peak
1.2M deliveries per second

Memory hierarchy:

Possibilities:
Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)

Principles:

Offline vs online computation
Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
Must keep memory hierarchy in mind
Efficiency of system includes cost of generating data from another source times probability of needing to

Social Graphs

Who follows you? Who do you follow? Who have you blocked, etc.

Operations:

Enumerate by time
Set operations: intersection, difference, union
Inclusion, cardinality

Spam problems:

Need mass delete ability

Temporal enumeration:

Who you followed, listed by when you followed them
Inclusion: do they follow you too?
Cardinality: How many followers do they have? How many people are following them?

If person A tweets at Person B:

Want to deliver tweet to people who follow both person A and person B
Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
Single table, master-worker replication

What problems did this cause?

Write throughput problem
Inputs could not be kept in RAM

Solution:

Partition by User ID
Edges stored in BOTH forward AND backward direction (same tweet stored twice)
Indexed by time
Indexed by element (for doing set algebra)
Partitioned by user: source_id of one and destination_id of other are identical

Challenges:

Consistency in the presence of failures
Write operations: idempotent (retry until success)
Last-write wins for edges
Commutative strategies for mass writes
Low latency - ms time scale

Principles:

Impossible to precompute set algebra queries
Simple distributed coordination techniques
Partition, replicate, and index
Many efficient scalability problems are solved this way: Partition, Replicate, Index

Search Indices

Results of searching for, e.g., "mountain dew cheetos"

Search index:

Find all tweets with these words in it
Posting list
Boolean
Queries
Complex queries
Ad hoc queries
Relevancy is recency (this ignores the non-real-time component to search...)

Searching for, e.g., mountain dew cheetos is the intersection of three posting lists

Original implementation: single table, vertically scaled
One column: term ID
Another column: document ID
Master-worker replication for read throughput

Problems: index cannot be kept in memory

Current implementation:

Partitioned by TIME first
Then partitioned by term ID...
Use delayed key-write

What problems does the solution create:

Write throughput issues
Queries for rare terms have to search MANY partitions
Space efficiency and recall
MySQL requires loooots of memory

Flags