---
title: Hopping in the Cloud
teaser: We made Hoptoad into a distributed system to better scale its traffic.
tags: airbrake,performance
author: Jason Morrison
published_on: 2010-03-31
---

Hoptoad has been running on the Engine Yard cloud for more than a week now, with
excellent performance.  We've been looking forward to this for quite some time,
and want to share a little about our [motivations](#why) and
[experiences](#learned).

<a name="why"></a>
## Why'd we do it

### Architectural flexibility

Currently, Hoptoad is a monolithic application, with the notification API, data
API, and user-facing website all running in the same Rails application.  This
fits well into a server environment with several identical application servers
and a shared database server.

[![''](http://images.thoughtbot.com/ui/hoptoad-arch-current.png)][arch-large]

[arch-large]: http://images.thoughtbot.com/ui/hoptoad-arch-current-large.png

(Click for a larger image.)

However, this coupled application design has a few implications:

1. Traffic in one component, like the exception notifier API, can adversely
   affect the performance of another component, like the web interface.
1. You can't tune the resource allocation of one component, like the notifier
   <abbr title="Application Programming Interface">API</abbr> endpoint,
   independently of other components, like the web interface.

Now, the performance characteristics of the application and the error notifier
endpoints are drastically different.  The notifier endpoint is a very
performance-sensitive high-write component that processes a few thousand
requests per minute.  It is responsible for validating the incoming error <abbr
title="Extensible Markup Language">XML</abbr>, applying any rate-limiting rules,
and determining the correct error group to bucket the error into.

The web interface, on the other hand,  is almost entirely database reads.  It
also has much lower traffic overall, compared to the notifier endpoint.
However, it is probably the place where you'll notice performance dips the most.

Together, these properties make the Hoptoad a ripe target for modularization.
Our goal is to move toward a system where the notification endpoint is separated
from the user-facing web UI, so as to isolate them for scaling and performance.

[![''](http://images.thoughtbot.com/ui/hoptoad-arch-decoupled.png)][decoupled-large]

[decoupled-large]: http://images.thoughtbot.com/ui/hoptoad-arch-decoupled-large.png

(Click for a larger image.)

Having a flexible architecture will also allow us to move some processing steps
into a separate queue, which opens up an interesting possibility for batch
processing.

### Batch processing with utility slices

Hoptoad's focus is on identifying and reducing duplicate information into unique
records.  Currently, Hoptoad will insert your exception into the database, and
then determine the group of similar exceptions that it would be assigned to.
When your application issues a high volume of exceptions, this can result in a
large number of INSERT and UPDATE statements.  We've had to rate limit these
cases, simply because the high INSERT/UPDATE rates would otherwise make the site
unusable for other users.  This is still less than ideal though, as high-volume
bursts will lose some exception instances due to rate limiting.

[![''](http://images.thoughtbot.com/ui/hoptoad-nobatch-small.png)][nobatch-large]

[nobatch-large]: http://images.thoughtbot.com/ui/hoptoad-nobatch-large.png

(Click for a larger image.)

But!  We can use this to our advantage, and are working on a queue processing
system that works on batches of exceptions at once, identifying duplicates in
the Redis queue and folding them down before inserting into the database.  This
should dramatically ease the INSERT/UPDATE rate for high-volume exception
situations, making Hoptoad much better equipped to handle bursts of duplicate
exceptions.

For example, if your application's database goes down, your application may send
thousands of exceptions to Hoptad per minute.  Currently, that would result in a
similar rate of database INSERTs and UPDATEs to record and group each exception
individually, which is very disk intensive.  If these duplicates queue up over
the course of a few seconds, it can precompute duplicate counts and fold
hundreds of duplicates down into a handful of INSERT and UPDATE statements.

[![''](http://images.thoughtbot.com/ui/hoptoad-batch-small.png)][batch-large]

[batch-large]: http://images.thoughtbot.com/ui/hoptoad-batch-large.png

(Click for a larger image.)

The ability to add utility slices for longer-running reporting and processing
tasks also opens up the door for some interesting features that could not
currently be computed during the notification request/response lifecycle.

### Realistic benchmarking with environment cloning

With the _clone environment_ operation available in the Engine Yard cloud
dashboard ([video](http://vimeo.com/7056116)), it's appealingly straightforward
to performance test a new feature by duplicating your production environment,
and forking live traffic to the clone in realtime using a tool like
[em-proxy](http://www.igvita.com/2009/04/20/ruby-proxies-for-scale-and-monitoring)
to see how it performs.

We've hit a few stumbling blocks with this approach, mostly due to having a
large-ish database (several hundred GB) to clone.  We're currently in the
process of benchmarking the addition of a Redis-backed worker queue to implement
the "Batch processor workers" component of the above diagram.

<a name="learned"></a>
## What did we learn

The first time we planned to move over the the cloud, we ran into unplanned
performance issues, and had to roll back to our prior hosting.  We learned a few
things from this, and our second cutover was smooth as silk.

### Realistic load testing is invaluable

![loads](http://images.thoughtbot.com/ui/loadtesting.png)

We load tested the production configuration with synthetic traffic that
approximated our live traffic.  The synthetic load testing indicated a large
amount of performance headroom.

Later, we load tested against live traffic in realtime, using
[em-proxy](http://www.igvita.com/2009/04/20/ruby-proxies-for-scale-and-monitoring)
to fork live traffic over to the cloud environment in parallel to the previous
environment, discarding the cloud responses.  This tactic revealed a different
performance picture, and allowed us to benchmark various hardware configurations
and choose one with a great deal more confidence.

However, when we ran these live parallel load tests, we intentionally disabled
email delivery so as not to deliver duplicate exception notifications to our
customers.  This could have left us blind to the performace of a critical
component, and it's easy to assume that delivering an email will be a low
latency operation.  By default, we were using ssmtp in our new configuration,
which only runs in interactive mode.  In that setup, our application would block
on every SMTP delivery request, for hundreds of milliseconds; much too long.  We
switched from ssmtp to [exim](http://www.exim.org) as a queueing MTA in front of
[SendGrid](http://sendgrid.com), to minimize the time to deliver transactional
emails.

### Buttressing the DNS cascade: swinging with iptables

![swings](http://images.thoughtbot.com/ui/swing.jpg)

When the scheduled cutover approached, we made sure to drop our DNS TTL, so that
when it came time to repoint our DNS for hoptoadapp.com to the new server, the
DNS change would cascade quickly.  Engine Yard went one step further, though,
and used iptables to redirect traffic from our old IP to the new IP, so that
users would have a seamless experience after we brought up the cloud
environment, regardless of whether the new DNS entry had reached them or not.

### Database caches are key

![caches](http://images.thoughtbot.com/ui/my_secret_cache.jpg)

The first time we switched over, we roughly took the following steps:

1. Use MySQL replication to keep the new cloud database in sync with the old
   production database.
1. Put up a maintenance page on the old web server
1. Allow replication to catch up
1. Redirect traffic (DNS and iptables)
1. Receive new traffic on the cloud environment

Once we opened the floodgates of traffic, the database thrashed as its query
caches were completely cold.  The caches began to fill up as we handled traffic,
but the user experience was very poor, and we watched the response time grow to
hundreds of seconds in NewRelic.

The second time we switched, we still kept the cloud database up to date with
MySQL replication.  However, we also removed the application's write privileges
to the cloud database, and ran em-proxy to fork live traffic to the cloud
application servers.  This ensured that the read caches were full, all the way
up to the point of cutover.  When we completed the replication during downtime,
we did not have to restart the database server, leaving the caches fat and
happy, ready to serve normal traffic levels.

### Doing lots with lots of email

Hoptoad sends a reasonably large amount of email - tens of thousands of messages
per day.

The switch to Engine Yard cloud also afforded us a convenient time to reconsider
our email delivery.  We were previously using an internal Engine Yard SMTP
server that is not available to cloud customers.  We wanted the switch to be as
low-impact as possible, so we went with SMTP provider
[SendGrid](http://sendgrid.com).  We evaluated a variety of other hosted
transactional SMTP providers.  We also checked out
[Postmark](http://postmarkapp.com) which looks very promising.  Postmark
provides an HTTP <abbr title="Application Programming Interface">API</abbr> for
mail delivery, and we decided to stick with an SMTP interface to minimize the
impact on our codebase.

## And you

So that's where we're at.  We're looking forward to improving the architecture
as we handle more traffic, being able to add interesting features that take
advantage of our new flexibility, and continuing to refine Hoptoad as a useful
service.

What have your experiences been with hosting "in the cloud?"  What have you
learned?  What benefits have you gained, or would you like to gain, with
flexible hosting?

FYI: _Hoptoad/Airbrake was sold to RackSpace and is now called [Airbrake Bug Tracker](http://airbrake.io)_.