From charlesreid1

Data Engineering Examples: Twitter

Real time data:

  • online queries for a web request
  • offline computations with very low latency
  • latency and throughput equally important
  • Hadoop is too high-latency!

Four data problems at twitter:

  • Tweets
  • Timelines
  • Social graphs
  • Search indices

Tweets

Definitions of data:

  • Tweet is primary key id, user id, text, timestamp (and replies)
  • Row storage
  • Initially: single table vertically scaled
  • Initially: master-worker replication (writes to master, replication to workers)
  • Initially: Memmcached for reads (rails reads the real database, populates memcached instances)

Issues:

  • Scaling isk space: disk arrays > 800 GB problematic
  • At 3 trillion tweets, disk space 90% utilized

Solutions:

  • Partition: partition by primary key (one cluster holds one set of keys, another holds different)
  • Partition: tweet IDs partitioned by user
  • Partition: tweets by time
  • Try each partition in order, until enough data accumulated

Locality of databases:

  • Memcached: primary key lookup is 1 ms
  • MySQL: primary key lookup is < 10 ms
  • Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)

Problems:

  • write throughput
  • Deadlocks in MySQL (if tweet volume gets crazy)
  • Temporal shard creation is manual process

More solutions:

  • Cassandra (NoSQL, non-relational)
  • primary key partition (tweet ID)
  • but also, secondary key on user ID

Timelines

Definitions:

  • Timeline is series of tweet ids
  • Query pattern: organized by user ID
  • Operations: append, merge, truncate
  • high velocity bounded vector
  • Space-based

Primitive approach: SQL query:

  • Use the following SQL query (below)
  • Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
  • BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20

offline vs online computation:

  • Sequences can be stored in memcached (individual timelines)
  • You pass a status to Fanout
  • Fanout is offline, but has a low-latency SLlA
  • Truncate at random intervals, ensuring bounded length
  • What to do on a cache miss?
  • Merge the user timelines

Stats:

  • 2008: 30 TPS, 120 TPS peak
  • 2010: 700 TPS, 2,000 TPS peak
  • 1.2M deliveries per second

Memory hierarchy:

  • Possibilities:
  • Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)

Principles:

  • Offline vs online computation
  • Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
  • Must keep memory hierarchy in mind
  • Efficiency of system includes cost of generating data from another source times probability of needing to

Social Graphs

Who follows you? Who do you follow? Who have you blocked, etc.

Operations:

  • Enumerate by time
  • Set operations: intersection, difference, union
  • Inclusion, cardinality

Spam problems:

  • Need mass delete ability

Temporal enumeration:

  • Who you followed, listed by when you followed them
  • Inclusion: do they follow you too?
  • Cardinality: How many followers do they have? How many people are following them?

If person A tweets at Person B:

  • Want to deliver tweet to people who follow both person A and person B
  • Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
  • Single table, master-worker replication

What problems did this cause?

  • Write throughput problem
  • Inputs could not be kept in RAM

Solution:

  • Partition by User ID
  • Edges stored in BOTH forward AND backward direction (same tweet stored twice)
  • Indexed by time
  • Indexed by element (for doing set algebra)
  • Partitioned by user: source_id of one and destination_id of other are identical

Challenges:

  • Consistency in the presence of failures
  • Write operations: idempotent (retry until success)
  • Last-write wins for edges
  • Commutative strategies for mass writes
  • Low latency - ms time scale

Principles:

  • Impossible to precompute set algebra queries
  • Simple distributed coordination techniques
  • Partition, replicate, and index
  • Many efficient scalability problems are solved this way: Partition, Replicate, Index

Search Indices

Results of searching for, e.g., "mountain dew cheetos"

Search index:

  • Find all tweets with these words in it
  • Posting list
  • Boolean
  • Queries
  • Complex queries
  • Ad hoc queries
  • Relevancy is recency (this ignores the non-real-time component to search...)

Searching for, e.g., mountain dew cheetos is the intersection of three posting lists

  • Original implementation: single table, vertically scaled
  • One column: term ID
  • Another column: document ID
  • Master-worker replication for read throughput

Problems: index cannot be kept in memory

Current implementation:

  • Partitioned by TIME first
  • Then partitioned by term ID...
  • Use delayed key-write

What problems does the solution create:

  • Write throughput issues
  • Queries for rare terms have to search MANY partitions
  • Space efficiency and recall
  • MySQL requires loooots of memory



Flags