Monday, November 05, 2012

Twitter API: Y U No respond fast & why

There is a very interesting post about an application developer running into issues with the Twitter API timeout of 5 sec. I would encourage everyone to read it if you haven't already.

Based on what I read (and some assumptions), I am going to try and comment on what the problem could be and potential solutions. I am greatly interested in criticisms of the analysis and possibly different ways of looking at the problem and possible solutions to it.

[A] Assumption: Twitter stores tweets for its users in an Inbox for every user. So, every time someone I am following makes a tweet, it is copied into my Inbox. This means that if 1M people follow me, my tweet is copied to the Inbox of 1M people. Let's call this value the fanout of the tweet.

Every time twitter needs to copy a tweet I make to the Inboxes of my followers, it needs to fetch a list of people that are following me. It does this by querying the database and fetching the list of people that have me as one of the people they are following. MySql (and other databases) do not allow you to perform efficient offset based queries, and you need to either:
  1. Use cursors (which tie up server resources for the duration of the complete request) or
  2. Perform a complete scan of all records till the requested record offset
every time an offset based query is made.

This is fairly wasteful of resources, whichever way you look at it, when in fact better solutions exist as mentioned in the post above.

[B] Assumption: When you perform an API request to Twitter to fetch the list of followers for a particular user, I am guessing that Twitter looks up a cache to locate the list and if it is unable to find it in the cache, it goes and looks up the DB for the list of these followers.

The reason that the API returns only the list of follower IDs (instead of complete user information) is so that they can avoid Point Queries which are notoriously inefficient. Of course, network bandwidth could be another reason.

[C] Now that we have established the parameters and behaviour of the system with which we are dealing, we can try and think of a solution.

  1. Ensure that the list of followers we want to query are already in the cache so that we don't query the DB for the list. This can be done by making a web-request to the API without your credentials and then making the same request again with your credentials so that you aren't charged for the request that loads data into the cache. This will not always work, and Twitter might ban your IP.

  2. Binary search over the size of the result set that twitter can reliably return for a user. Start with 1 follower, and ramp up to 2, 4, 8, 16, etc... (double) on every request till twitter times out. Subsequently, perform a binary search between the size that failed (say 128) and the size the succeeded (say 256). You are guaranteed to have fetched 36 follower IDs/request before you hit the request that timed out. Suppose the upper limit was 128, you will not get anything in the subsequent 8 requests, which means you still got an average of 17 follower IDs/request. Store this number, and use this number to start the next time you perform requests for this user.

  3. Randomized solution: If you get 2 successive failures [timeouts] for a user, give up and try another user. With high probability, you will be able to query the followers of O(log n) users if you try to query O(c log n) distinct random users. This is because the probability of failure shrinks rapidly as you get consecutive failures for distinct users.

Do let me know if you spot something wrong or have other solutions.

6 comments:

Anonymous said...

Y U No write title in coherent English?

Dhruv Matani said...

:)

Roman said...

Hey, thanks for the in-depth post. I am currently using solution 3, it helps a bit to jump around randomly with the IDs I'm requesting, albeit reducing my local database a bit with it. But as said, main problem is their faulty database, the time-out settings and their API rate limits plus charging me a call when I got no result.

Dhruv Matani said...

@Roman Thanks for sharing your problem with everyone so clearly. It's quite an interesting one (some form of which) is faced by almost everyone who uses APIs extensively.

For now, there's nothing you can do about the latency on their end. The best you can do is ensure that you get the most bang for your buck. That said, it might be possible to use a combination of solution 3 & 1 (use the public API call with the same time limits, and you should probably be good). This will ensure that if you time out on the public API call (w/o your API key), you can at least back off an not make the call that will count toward your quota.

Yazhini said...

About solution 1, I think unauthenticated calls are not allowed from march 2013.

"Additionally, all 1.1 endpoints require authentication, so no longer will there be a concept of unauthenticated calls and rate limits." - https://dev.twitter.com/docs/rate-limiting/1.1

Dhruv Matani said...

@Yazhini Fair point about un-auth calls. However, how will Twitter itself display pages to users if they aren't logged in? I mean assume I am not logged in and I want to see the list of your followers.

Twitter seems to be getting dev-unfriendly by the day :-(

[a] They don't "yet" have a push API
[b] They limit result sets to 5k, which imho is puny
[c] No un-auth calls

All in all a pretty poor developer experience if you ask me.