Denormalization, Processes

If you read the news, you’ll know that tuneups are happening behind the scenes of BeerRiot. If you came to this blog after reading that story, you’re wondering what, exactly, they are.

If I’m not feeling particularly communication-challenged, I’ll be able to explain them to you. ;)

The first tuneup is one every webmaster has heard of: denormalization. I had been using a view to select data from three tables with one call. The performance drag of that query was serious enough, though, that I’ve decided to complicate things a bit and copy the extra bits of data I need from the other tables into the main one for the query.

The speed gain is great, and, somewhat strangely, the denormalization actually cleaned up a bunch of my code. ErlyDB lacks a “one-to-one” relation, so it was impossible for me to say “each record in this view is really just a record in this other table with some extra data.” That made for a bit of hackery swinging from one type to another. Without that extra table, I think the code reads more clearly.

(Disclaimer: I’m far from being an relational database master, so it’s likely that there is a much better way to express everything I’m doing. But, I’m happy to be making what seems to be forward progress.)

The other main change is more Erlang-centric. Until now, I had been tracking sessions using a customization of the Yaws recommended session server. This is basically a central process that stores opaque data associated with an id string. Whenever your app gets a request, it pulls the cookie value out and checks with this central process to find out if there is any opaque data associated with this key. It works (quite well, in fact), but it seems like a bit of a bottle neck.

So, I’ve decided that there’s a more Erlangy way to do things. What BeerRiot is doing now is starting up a new process for each session, and saving that process id in a client cookie. Then, whenever a request comes in, if it has a cookie with a PID, we can try to contact that session’s handling process directly. No central service required.

It turns out that there’s loads of benefits to having this session hanging around beyond relieving the central service bottleneck. It can cache data, smartly (i.e. listen for updates, etc.). It’s a natural place to run background processes (like propagating live changes to durable storage). I see other potential uses, but since I haven’t tested them yet, I’ll hold my tongue to avoid getting too many hopes up. ;)

For Facebook developers: This process-session system wasn’t possible until just a few weeks ago, when Facebook started supporting cookies on the canvas page. Unfortunately, they only support them for canvas requests, and not for their “mock ajax.” For mock ajax, I’ve decided to just encode the cookie values in post parameters. It works (and it’s no more inconsistent than the rest of the Facebook Developer experience).

Update 2.Jan 18:52 EDT: If you spent any part of today poking at BeerRiot to see how the speed-ups turned out, you were probably rather dissatisfied. I just figured out that I didn’t fully rollout the update. :P It’s there now, and I think you’ll be much more impressed.

About these ads

6 comments so far

  1. vincent on

    Having a process per session seems to be a good idea, however wouldn’t storing a PID as a cookie be a security risk?

  2. Bryan on

    It does seem dangerous to let out PIDs, doesn’t it? Especially since it looks like PID generation is pretty predictable.

    For now, I’m also generating a unique, several-byte key for each new PID, so an attacker will have to know both the PID and the key to be able to talk to that process.

    My other option is likely to encrypt the cookie before sending it to the client, and decrypt it when it comes back. Having a human-readable cookie was invaluable during development, but now that the groundwork is in place, I could probably switch any time.

  3. thomas lackner on

    What I usually do is set the cookie session = key:md5(key + secret). The secret is a constant stored somewhere in my code. When I read the cookie string back, if the key doesn’t match the md5 value of the key plus the secret, I know something went wrong. Pretty easy to debug, too.

  4. Yariv on

    I use Mnesia to store session data. Mnesia’s main benefits are support for multiple indexes (allowing you to quickly find a session data by e.g. username and/or session key), easy distribution/replication (not that I’m using it now), and ability to use disc-based storage (I’m not using that either :) ).

    I like the idea of using a process per user because the process could just garbage-collect itself if it doesn’t receive a message in X seconds, which in a simple app helps you avoid periodically scanning your session table for expired sessions.

    The right solution depends on the application I guess.

  5. Yariv on

    Also, I try to denormalize the Vimagi schema as much as possible. Denormalization simplifies some parts of the code (querying) and gives nice performance improvements, but it sometimes makes updates more complex because you have to update multiple tables with the same data. I think in general it’s a good tradeoff to simplify the reads at the expense of writes because most queries are SELECT queries.

  6. Bryan on

    Yeah, I thought about going the Mnesia route for sessions. Something about the automation available in a dedicated process seemed really intriguing, though. We’ll see whether or not I made the right decision as development progresses.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: