Dwight's blog
just for fun: a single server twitter design

Implementing a twitter-style system is an interesting problem.  One thing that makes it hard is some people have a lot of followers - say, 1 million.

One approach is outlined below, which feels to me a good mix in terms of computation time and space.  It’s space efficient - we don’t store tweets (or even tweet ids) more than once.  We assume the entire system is running on one machine and in RAM.  Of course, that won’t work in the real world, but one step at a time here — the first question I was asking myself for fun is “what is a good way to solve this problem even on a single machine?”

So here’s the approach:

First, we store each user, with their tweets, who they follow, and their followers.  Simple enough:

( (name ‘jane)
  (follows (‘alvin ‘sue ‘bob))
  (tweets …)
)

( (name ‘bob)
  (tweets
    ((ts “01Mar10 1313”) (txt “grabbing some dinner”)))
  (followers (‘joe ‘jane …))
)

On a new post by some user x, (1) add new tweet to x’s tweets list (at the head), (2) for each follower y of x, move x to the beginning of y’s follows list.  For example on a new tweet from bob:

( (name ‘jane)
  (follows (‘bob ‘alvin ‘sue))
  (tweets …)
)

( (name ‘bob)
  (tweets  
    ((ts “01Mar10 131959”) (txt “sleepy”))
    ((ts “01Mar10 131803”) (txt “grabbing some dinner”)))
  (followers (‘joe ‘jane …))
)

Now, to show a current news feed for jane, we do the following.  We start at the beginning of jane’s follows list.  We dequeue a tweet from the first person listed: bob.  We know bob has the freshest tweet for jane as the follows list is ordered.  We go get that tweet, grab another for bob, and also the most recent tweet for alvin — who is the next freshest tweeter.  We work down the follows lists, and each person we follow’s tweets, in a merge operation, pulling off the most frequent tweets.
Working in this manner, we don’t have to touch the objects of most of the people jane follows, and we don’t have to touch many tweets either: to get a news feed of N items, we will touch less than 2N tweets.  Posting a tweet is more expensive, where we have to actually touch the follows list on every one of our followers.  However, at least it doesn’t get any bigger, we simply reorder it.


(Note I’ve left out some details here, for example one would probably want a fancier data structure to make move_to_head_of_follows_list() fast.  And assuming a userid->object map exists.)

  1. dmerr posted this
blog comments powered by Disqus