This Christmas, maybe you’re thinking of the long trip back home through snowy highways and bustling airports, spending time with the family next to a warm fireplace, or perhaps being serenaded by The Big Maestro himself. Sadly I didn’t get to see Shaq conduct the Pops, instead I’m thinking of how much Capped Collections in MongoDB have made my job a bit easier.
Warmup
Here’s the problem: Lots of data coming in extremely frequently, on the order of several thousand discrete chunks of data every 5 seconds. Since it’s coming in too fast to be able to comprehend it, we really just want to show the last 15 chunks so the user. This lets the user know the service is accepting data, and it’s kind of neat to see data updating that fast in real time. My first implementation? ActiveRecord, of course.
create_table "raws" do |t|
t.text "line"
t.datetime "ran_at"
end
Seems simple enough. Since the data had to be parsed I tossed it into a background job, so it looked something like:
class Parser
attr_accessor :log
def perform
lines = ActiveSupport::Gzip.decompress(log).split("\n")
lines.each do |raw_line|
create_raw(DateTime.now, raw_line.strip)
end
end
def create_raw(ran_at, line)
Raw.create :ran_at => ran_at, :line => line
end
end
Yes, I was pretty stupid and my first implementation just gzipped the data and shot it to the server. We’re now using a proper JSON API hooked up with yajl-ruby, but the real problem here had two steps:
- Parse input
- Create row for each chunk of data
Moving to JSON/YAJL made parsing faster, but what about writing the data? I thought this seemed great until I left it running overnight, and big surprise: there were over a million rows. Ok, that’s fine, we don’t need to keep it all anyway…
def perform
Raw.delete_all(["created_at < ?", 1.minute.ago])
# parsing/inserting code here
end
That still wasn’t enough, as I increased traffic to that endpoint. Before I refactored how this endpoint accepted data, the table actively had 50,000 rows in it (and that’s just the last minute of data), and we were around the 140 millionth primary key dished out. This was simply storing too much input, and given that we only needed to ever show the last few pieces of it, there had to be a better way to model the data.
Flurries
I had to sit back and consider my options here. I couldn’t figure out a way to implement this kind of write heavy behavior in SQL after poring over the PostgreSQL docs and mailing lists.
I still consider most NoSQL solutions to be awesome utility belts to be used along side of relational data stores. I turned to Redis first. This could have used a simple Redis list like so:
# in config/initializers/redis.rb
$redis = Redis.new
# when writing data
$redis.lpush("latest-data", JSON.dump(line))
$redis.ltrim("latest-data", 0, 15)
# when reading data
raws = $redis.lrange("latest-data", 0, -1)
raws.map { |raw| JSON.parse(raw) }
This definitely would have been faster, since Redis is all in memory, and the LTRIM command will basically keep a list short.
The reads are kind of awkward though, since we’re storing more than one tidbit of data (when the data was parsed, and the data itself) and we plan on adding more later. Since Redis only understands strings, it seemed awkward to use this solution. YAJL would make short work of it, but when we also have to maintain the size of the list ourselves, it seems like there could be a more natural and built-in way of modeling this problem.
Blizzard
I next looked at MongoDB. I had a great introduction during MongoBoston, and I heard of Capped Collections but it didn’t click until…
Capped collections are fixed sized collections that have a very high performance auto-FIFO age-out feature (age out is based on insertion order). […] In addition, capped collections automatically, with high performance, maintain insertion order for the objects in the collection; this is very powerful for certain use cases such as logging.
Booyah! I immediately hooked up a free MongoHQ plan and got started. The Mongo ruby driver is pretty easy to use, after reading some tutorials I ended up with a decent solution.
First off, connecting to MongoHQ on Heroku was a bit irksome to discover, so
here’s a sample if you can’t find why this isn’t documented properly. It
includes hooking up the Rails.logger
and using environments properly on your
local machine.
# in config/initializers/mongo.rb
if ENV['MONGOHQ_URL']
uri = URI.parse(ENV['MONGOHQ_URL'])
conn = Mongo::Connection.from_uri(ENV['MONGOHQ_URL'], :logger => Rails.logger)
$mongo = conn.db(uri.path.gsub(/^\//, ''))
else
$mongo = Mongo::Connection.new(nil, nil, :logger => Rails.logger).db("app_#{Rails.env.downcase}")
end
And our capped collection write implementation:
class Tail
def self.insert(line)
collection.insert(
:at => Time.at(line['at']),
:command => line['command'])
end
def self.collection
@collection ||= $mongo.create_collection("tail",
:capped => true, :max => 15)
end
end
Creating a capped collection is pretty simple, you can specify the number of max
documents it can have or bytes total. Mongo::DB#create_collection
won’t blow
away an existing collection, so it’s ok to keep calling it in this class.
Reading from the capped collection is simple, except that Mongo insists on
always returning a BSON::ObjectId
on every document. That’s not necessary to
show the user, so I end up filtering it out in Ruby:
class Tail
def self.last
collection.find.to_a.tap do |tails|
tails.each do |tail|
tail.delete('_id')
end
end
end
end
In practice, each Tail
collection is segmented out by users, and MongoHQ even
provides a nice interface to browse what’s in your database:
Clear Skies
Overall, I think a few MongoDB capped collections served this use case extremely well over going with Postgres or Redis. Use the best tool for the job, and it’s even better if someone else runs the tool for you. If you have similar use cases of bringing in NoSQL DBs along side of relational DB’s, let us know in the comments!