--- title: Yuletide Logs and MongoDB Capped Collections teaser: tags: web author: Nick Quaranto published_on: 2010-12-21 --- This Christmas, maybe you're thinking of the long trip back home through snowy highways and bustling airports, spending time with the family next to a warm fireplace, or perhaps being serenaded by [The Big Maestro](http://www.universalhub.com/2010/we-look-forward-shaq-throwing-relief-sox-april) himself. Sadly I didn't get to see Shaq conduct the Pops, instead I'm thinking of how much [Capped Collections](http://www.mongodb.org/display/DOCS/Capped+Collections) in MongoDB have made my job a bit easier. ### Warmup Here's the problem: Lots of data coming in extremely frequently, on the order of several thousand discrete chunks of data every 5 seconds. Since it's coming in too fast to be able to comprehend it, we really just want to show the last 15 chunks so the user. This lets the user know the service is accepting data, and it's kind of neat to see data updating that fast in real time. My first implementation? ActiveRecord, of course. create_table "raws" do |t| t.text "line" t.datetime "ran_at" end Seems simple enough. Since the data had to be parsed I tossed it into a background job, so it looked something like: class Parser attr_accessor :log def perform lines = ActiveSupport::Gzip.decompress(log).split("\n") lines.each do |raw_line| create_raw(DateTime.now, raw_line.strip) end end def create_raw(ran_at, line) Raw.create :ran_at => ran_at, :line => line end end Yes, I was pretty stupid and my first implementation just gzipped the data and shot it to the server. We're now using a proper JSON API hooked up with [yajl-ruby](https://github.com/brianmario/yajl-ruby), but the real problem here had two steps: 1. Parse input 1. Create row for each chunk of data Moving to JSON/YAJL made parsing faster, but what about writing the data? I thought this seemed great until I left it running overnight, and big surprise: there were over a million rows. Ok, that's fine, we don't need to keep it all anyway... def perform Raw.delete_all(["created_at < ?", 1.minute.ago]) # parsing/inserting code here end That still wasn't enough, as I increased traffic to that endpoint. Before I refactored how this endpoint accepted data, the table actively had 50,000 rows in it (and that's just the last minute of data), and we were around the 140 millionth primary key dished out. This was simply storing too much input, and given that we only needed to ever show the last few pieces of it, there had to be a better way to model the data. ### Flurries I had to sit back and consider my options here. I couldn't figure out a way to implement this kind of write heavy behavior in SQL after poring over the PostgreSQL docs and mailing lists. I still consider most NoSQL solutions to be awesome [utility belts](http://nosql.mypopescu.com/post/836086276/presentation-redis-persistence-power-or-redis-use) to be used along side of relational data stores. I turned to Redis first. This could have used a simple Redis list like so: # in config/initializers/redis.rb $redis = Redis.new # when writing data $redis.lpush("latest-data", JSON.dump(line)) $redis.ltrim("latest-data", 0, 15) # when reading data raws = $redis.lrange("latest-data", 0, -1) raws.map { |raw| JSON.parse(raw) } This definitely would have been faster, since Redis is all in memory, and the [LTRIM](http://code.google.com/p/redis/wiki/LtrimCommand) command will basically keep a list short. The reads are kind of awkward though, since we're storing more than one tidbit of data (when the data was parsed, and the data itself) and we plan on adding more later. Since Redis only understands strings, it seemed awkward to use this solution. YAJL would make short work of it, but when we also have to maintain the size of the list ourselves, it seems like there could be a more natural and built-in way of modeling this problem. ### Blizzard I next looked at MongoDB. I had a great introduction during [MongoBoston](http://www.10gen.com/conferences/mongoboston2010), and I heard of [Capped Collections](http://www.mongodb.org/display/DOCS/Capped+Collections) but it didn't click until... > Capped collections are fixed sized collections that have a very high > performance auto-FIFO age-out feature (age out is based on insertion order). > \[...\] In addition, **capped collections automatically, with high > performance, maintain insertion order for the objects in the collection;** > this is very powerful for certain use cases such as logging. Booyah! I immediately hooked up a free [MongoHQ](https://mongohq.com/home) plan and got started. The [Mongo ruby driver](http://api.mongodb.org/ruby/) is pretty easy to use, after reading some tutorials I ended up with a decent solution. First off, connecting to MongoHQ on Heroku was a bit irksome to discover, so here's a sample if you can't find why this isn't documented properly. It includes hooking up the `Rails.logger` and using environments properly on your local machine. ```ruby # in config/initializers/mongo.rb if ENV['MONGOHQ_URL'] uri = URI.parse(ENV['MONGOHQ_URL']) conn = Mongo::Connection.from_uri(ENV['MONGOHQ_URL'], :logger => Rails.logger) $mongo = conn.db(uri.path.gsub(/^\//, '')) else $mongo = Mongo::Connection.new(nil, nil, :logger => Rails.logger).db("app_#{Rails.env.downcase}") end ``` And our capped collection write implementation: ```ruby class Tail def self.insert(line) collection.insert( :at => Time.at(line['at']), :command => line['command']) end def self.collection @collection ||= $mongo.create_collection("tail", :capped => true, :max => 15) end end ``` Creating a capped collection is pretty simple, you can specify the number of max documents it can have or bytes total. `Mongo::DB#create_collection` won't blow away an existing collection, so it's ok to keep calling it in this class. Reading from the capped collection is simple, except that Mongo insists on always returning a `BSON::ObjectId` on every document. That's not necessary to show the user, so I end up filtering it out in Ruby: ```ruby class Tail def self.last collection.find.to_a.tap do |tails| tails.each do |tail| tail.delete('_id') end end end end ``` In practice, each `Tail` collection is segmented out by users, and MongoHQ even provides a nice interface to browse what's in your database: ![''](https://img.skitch.com/20101221-n4m6paywf2t1dk4gffict62pty.png) ### Clear Skies Overall, I think a few MongoDB capped collections served this use case extremely well over going with Postgres or Redis. Use the best tool for the job, and it's even better if someone else runs the tool for you. If you have similar use cases of bringing in NoSQL DBs along side of relational DB's, let us know in the comments!