Data Engineering/Twitter Example
From charlesreid1
Contents
Data Engineering Examples: Twitter
Real time data:
- online queries for a web request
- offline computations with very low latency
- latency and throughput equally important
- Hadoop is too high-latency!
Four data problems at twitter:
- Tweets
- Timelines
- Social graphs
- Search indices
Tweets
Definitions of data:
- Tweet is primary key id, user id, text, timestamp (and replies)
- Row storage
- Initially: single table vertically scaled
- Initially: master-worker replication (writes to master, replication to workers)
- Initially: Memmcached for reads (rails reads the real database, populates memcached instances)
Issues:
- Scaling isk space: disk arrays > 800 GB problematic
- At 3 trillion tweets, disk space 90% utilized
Solutions:
- Partition: partition by primary key (one cluster holds one set of keys, another holds different)
- Partition: tweet IDs partitioned by user
- Partition: tweets by time
- Try each partition in order, until enough data accumulated
Locality of databases:
- Memcached: primary key lookup is 1 ms
- MySQL: primary key lookup is < 10 ms
- Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)
Problems:
- write throughput
- Deadlocks in MySQL (if tweet volume gets crazy)
- Temporal shard creation is manual process
More solutions:
- Cassandra (NoSQL, non-relational)
- primary key partition (tweet ID)
- but also, secondary key on user ID
Timelines
Definitions:
- Timeline is series of tweet ids
- Query pattern: organized by user ID
- Operations: append, merge, truncate
- high velocity bounded vector
- Space-based
Primitive approach: SQL query:
- Use the following SQL query (below)
- Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
- BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20
offline vs online computation:
- Sequences can be stored in memcached (individual timelines)
- You pass a status to Fanout
- Fanout is offline, but has a low-latency SLlA
- Truncate at random intervals, ensuring bounded length
- What to do on a cache miss?
- Merge the user timelines
Stats:
- 2008: 30 TPS, 120 TPS peak
- 2010: 700 TPS, 2,000 TPS peak
- 1.2M deliveries per second
Memory hierarchy:
- Possibilities:
- Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)
Principles:
- Offline vs online computation
- Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
- Must keep memory hierarchy in mind
- Efficiency of system includes cost of generating data from another source times probability of needing to
Social Graphs
Who follows you? Who do you follow? Who have you blocked, etc.
Operations:
- Enumerate by time
- Set operations: intersection, difference, union
- Inclusion, cardinality
Spam problems:
- Need mass delete ability
Temporal enumeration:
- Who you followed, listed by when you followed them
- Inclusion: do they follow you too?
- Cardinality: How many followers do they have? How many people are following them?
If person A tweets at Person B:
- Want to deliver tweet to people who follow both person A and person B
- Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
- Single table, master-worker replication
What problems did this cause?
- Write throughput problem
- Inputs could not be kept in RAM
Solution:
- Partition by User ID
- Edges stored in BOTH forward AND backward direction (same tweet stored twice)
- Indexed by time
- Indexed by element (for doing set algebra)
- Partitioned by user: source_id of one and destination_id of other are identical
Challenges:
- Consistency in the presence of failures
- Write operations: idempotent (retry until success)
- Last-write wins for edges
- Commutative strategies for mass writes
- Low latency - ms time scale
Principles:
- Impossible to precompute set algebra queries
- Simple distributed coordination techniques
- Partition, replicate, and index
- Many efficient scalability problems are solved this way: Partition, Replicate, Index
Search Indices
Results of searching for, e.g., "mountain dew cheetos"
Search index:
- Find all tweets with these words in it
- Posting list
- Boolean
- Queries
- Complex queries
- Ad hoc queries
- Relevancy is recency (this ignores the non-real-time component to search...)
Searching for, e.g., mountain dew cheetos is the intersection of three posting lists
- Original implementation: single table, vertically scaled
- One column: term ID
- Another column: document ID
- Master-worker replication for read throughput
Problems: index cannot be kept in memory
Current implementation:
- Partitioned by TIME first
- Then partitioned by term ID...
- Use delayed key-write
What problems does the solution create:
- Write throughput issues
- Queries for rare terms have to search MANY partitions
- Space efficiency and recall
- MySQL requires loooots of memory