---
title: Yuletide Logs and MongoDB Capped Collections
teaser:
tags: web
author: Nick Quaranto
published_on: 2010-12-21
---

This Christmas, maybe you're thinking of the long trip back home through snowy
highways and bustling airports, spending time with the family next to a warm
fireplace, or perhaps being serenaded by [The Big
Maestro](http://www.universalhub.com/2010/we-look-forward-shaq-throwing-relief-sox-april)
himself. Sadly I didn't get to see Shaq conduct the Pops, instead I'm thinking
of how much [Capped
Collections](http://www.mongodb.org/display/DOCS/Capped+Collections) in MongoDB
have made my job a bit easier.

### Warmup

Here's the problem: Lots of data coming in extremely frequently, on the order of
several thousand discrete chunks of data every 5 seconds. Since it's coming in
too fast to be able to comprehend it, we really just want to show the last 15
chunks so the user. This lets the user know the service is accepting data, and
it's kind of neat to see data updating that fast in real time. My first
implementation? ActiveRecord, of course.

    create_table "raws" do |t|
      t.text     "line"
      t.datetime "ran_at"
    end

Seems simple enough. Since the data had to be parsed I tossed it into a
background job, so it looked something like:

    class Parser
      attr_accessor :log
      def perform
        lines = ActiveSupport::Gzip.decompress(log).split("\n")
        lines.each do |raw_line|
          create_raw(DateTime.now, raw_line.strip)
        end
      end

      def create_raw(ran_at, line)
        Raw.create :ran_at => ran_at, :line => line
      end
    end

Yes, I was pretty stupid and my first implementation just gzipped the data and
shot it to the server. We're now using a proper <abbr title="JavaScript Object
Notation">JSON</abbr> <abbr title="Application Programming Interface">API</abbr>
hooked up with [yajl-ruby](https://github.com/brianmario/yajl-ruby), but the
real problem here had two steps:

1. Parse input
1. Create row for each chunk of data

Moving to JSON/YAJL made parsing faster, but what about writing the data? I
thought this seemed great until I left it running overnight, and big surprise:
there were over a million rows. Ok, that's fine, we don't need to keep it all
anyway...

    def perform
      Raw.delete_all(["created_at < ?", 1.minute.ago])
      # parsing/inserting code here
    end

That still wasn't enough, as I increased traffic to that endpoint. Before I
refactored how this endpoint accepted data, the table actively had 50,000 rows
in it (and that's just the last minute of data), and we were around the 140
millionth primary key dished out. This was simply storing too much input, and
given that we only needed to ever show the last few pieces of it, there had to
be a better way to model the data.

### Flurries

I had to sit back and consider my options here. I couldn't figure out a way to
implement this kind of write heavy behavior in <abbr title="Structured Query
Language">SQL</abbr> after poring over the PostgreSQL docs and mailing lists.

I still consider most NoSQL solutions to be awesome [utility
belts](http://nosql.mypopescu.com/post/836086276/presentation-redis-persistence-power-or-redis-use)
to be used along side of relational data stores. I turned to Redis first. This
could have used a simple Redis list like so:

    # in config/initializers/redis.rb
    $redis = Redis.new

    # when writing data
    $redis.lpush("latest-data", JSON.dump(line))
    $redis.ltrim("latest-data", 0, 15)

    # when reading data
    raws = $redis.lrange("latest-data", 0, -1)
    raws.map { |raw| JSON.parse(raw) }

This definitely would have been faster, since Redis is all in memory, and the
[LTRIM](http://code.google.com/p/redis/wiki/LtrimCommand) command will basically
keep a list short.

The reads are kind of awkward though, since we're storing more than one tidbit
of data (when the data was parsed, and the data itself) and we plan on adding
more later. Since Redis only understands strings, it seemed awkward to use this
solution. YAJL would make short work of it, but when we also have to maintain
the size of the list ourselves, it seems like there could be a more natural and
built-in way of modeling this problem.

### Blizzard

I next looked at MongoDB. I had a great introduction during
[MongoBoston](http://www.10gen.com/conferences/mongoboston2010), and I heard of
[Capped Collections](http://www.mongodb.org/display/DOCS/Capped+Collections) but
it didn't click until...

> Capped collections are fixed sized collections that have a very high
> performance auto-FIFO age-out feature (age out is based on insertion order).
> \[...\] In addition, **capped collections automatically, with high
> performance, maintain insertion order for the objects in the collection;**
> this is very powerful for certain use cases such as logging.

Booyah! I immediately hooked up a free [MongoHQ](https://mongohq.com/home) plan
and got started. The [Mongo ruby driver](http://api.mongodb.org/ruby/) is pretty
easy to use, after reading some tutorials I ended up with a decent solution.

First off, connecting to MongoHQ on Heroku was a bit irksome to discover, so
here's a sample if you can't find why this isn't documented properly. It
includes hooking up the `Rails.logger` and using environments properly on your
local machine.

```ruby
# in config/initializers/mongo.rb
if ENV['MONGOHQ_URL']
  uri    = URI.parse(ENV['MONGOHQ_URL'])
  conn   = Mongo::Connection.from_uri(ENV['MONGOHQ_URL'], :logger => Rails.logger)
  $mongo = conn.db(uri.path.gsub(/^\//, ''))
else
  $mongo = Mongo::Connection.new(nil, nil,  :logger => Rails.logger).db("app_#{Rails.env.downcase}")
end
```

And our capped collection write implementation:

```ruby
class Tail
  def self.insert(line)
    collection.insert(
      :at => Time.at(line['at']),
      :command => line['command'])
  end

  def self.collection
    @collection ||= $mongo.create_collection("tail",
      :capped => true, :max => 15)
  end
end
```

Creating a capped collection is pretty simple, you can specify the number of max
documents it can have or bytes total. `Mongo::DB#create_collection` won't blow
away an existing collection, so it's ok to keep calling it in this class.

Reading from the capped collection is simple, except that Mongo insists on
always returning a `BSON::ObjectId` on every document. That's not necessary to
show the user, so I end up filtering it out in Ruby:

```ruby
class Tail
  def self.last
    collection.find.to_a.tap do |tails|
      tails.each do |tail|
        tail.delete('_id')
      end
    end
  end
end
```

In practice, each `Tail` collection is segmented out by users, and MongoHQ even
provides a nice interface to browse what's in your database:

![''](https://img.skitch.com/20101221-n4m6paywf2t1dk4gffict62pty.png)

### Clear Skies

Overall, I think a few MongoDB capped collections served this use case extremely
well over going with Postgres or Redis. Use the best tool for the job, and it's
even better if someone else runs the tool for you. If you have similar use cases
of bringing in NoSQL DBs along side of relational DB's, let us know in the
comments!
