This is the API-focused sequel to the Web-focused post. As before, the main take-away from this post is to always check return values. A corollary is to design the user experience to include error messages in a helpful manner.
The second example is the more complex flow: sync and async with a service object behind an API. We need to report as many errors as we can – if we’re performing a synchronous task, we need to dispatch on its result – but also save space for errors further down the road. The UI dictates various pieces of communication, and we need to fulfill that communication.
As mentioned before this is a two-way street: the user experience team that builds the UI needs to work with the developers to understand what can go wrong. Sometimes the developers won’t find out until implementation is underway, and so the designers need to stick around and be prepared for an overhaul.
With an API there is a level of indirection: the UI team, mobile and other client teams, and backend team must work together to understand all that can go wrong and how the user should know about it.
Given what we know, we need to save the current checkout details, kick off a
background job, and report what we can. We’ll use a service object to wrap
some of these complexities;
Checkout#commit saves the state and kicks off a
This represents the “just before the view” point in an API. Let’s see what it takes to fill in the rest.
Let’s start the service object by scoping out a naive implementation of our
Checkout#commit method. This has a bug!
The issue is that we’ve violated our contract: we need
#errors to produce any
errors that we can generate synchronously. Our call to
#update could generate
useful errors – for example, if the user tries to press “Checkout” in a stale
tab on an old cart – but we drop them into the bit bucket instead. Moreover,
we march forward on an invalid update, enqueuing the job to charge the customer
By now we’re deep down the code path, far from the ERb view rendering this. Whatever decisions we made up the callstack are going to be hard to change now. This is often where people reach for exceptions. Note that a stale cart is not an exceptional situation; we’re reaching for an exception instead because we coded ourselves into a corner.
def commit @order.update!(state: :processing) # changed from a non-raising #update PaymentProcessorJob.perform_later(@payment_token, @order) end
And honestly, I’d rather see
#update! than an unchecked
#update. At least
it’s not hiding a bug.
But the better solution – and, luckily, the actual design we had set ourselves
up for – is to leave the exceptions for the exceptional cases but to treat
#update as something we can tell the user about.
Let’s go into detail about our method calls.
As was discussed with
#update can raise an exception on
database failures. These are programmer bugs, which we are letting bubble up.
It’s good that we’re considering these exceptions and recognizing how we want
to handle them.
Next up is the
#perform_later method, part of
ActiveJob. This is
documented to return an instance of the job class; it does this by calling
#enqueue method, which returns the job instance, or
false on failure.
This is exactly what we want out of our
#commit method, so we can just return
Perfect? What if the order is marked processing, but the payment processor fails to enqueue? What if the underlying queue adapter raises (such as during a lost Redis connection)?
Let’s try to orchestrate these two error-producing things:
Ick. How about we enqueue it first, with a delay, then cancel it ASAP if the
No one said this was easy.
This covers the synchronous errors, but now we have a background job that is about to make a network request. As the comic says, “oh no!”
We need to communicate these errors back via the API. Assuming GraphQL and a patient mobile development team, we’re talking about a subscription. We’ll listen for subscription requests by the job ID, and on job updates we’ll send back a job status.
We need more infrastructure. Here’s the first pass of the background job:
Our hypothetical payment provider offers a
#capture! method to finalize the
charge, but it raises an exception. So here we have to decide: do we let this
exception bubble, or do we transform it into something else?
Our goal is to notify our API subscribers. We programmed ourselves into a corner again: an exception would float into the ether, logged into our exception tracker, disappearing from the user’s view. We don’t want that. So we handle the exception that we expect: rescue it and notify our API subscribers.
Astute readers will notice a mistake we’ve mentioned already, but there are two other errors that stand out to me.
We are using ActiveJob’s GlobalID mechanism to pass full objects through to
the job. If the
Order object is deleted from the database before this job
runs (remember that five second delay at the end of the prior section?), this
will cause ActiveJob to raise an
Order objects are permanent; if one is deleted, this is a
programmer error, and should bubble up to our exception tracker.
However however!, we would violate our contract – to notify our API
subscribers – if we simply bubbled up; we need to notify along the way. Let’s
rescue_from to handle all exceptions with a notification and then
The next issue that stands out to me is the fact that our payment provider is making a network call. This can lead to timeouts, DNS errors, TLS errors, TCP errors, and HTTP protocol errors. Those are most definitely going to come in as exceptions. Some of these exceptions mean we should try again in a minute, and some don’t.
The retry is a notable aspect often left out of UI-related async discussions. The user will have some expectation that things are going well unless they hear of an error. After a discussion with the product team, everyone agrees to let the API know when a job is taking longer than expected, to allow the mobile team to experiment with informing the user.
With that in place, this example still fails to apply lessons from an earlier
section: we call
#update without checking the return value. But what does it
mean for the
#update to fail? Again, a corner case that did not come up
during the design.
#update to fail would mean that the customer is charged but the order
is in a processing state. It will appear in their list of in-progress purchases
– at this point, forever – but their credit card bill will reflect the
reality that they have paid for the product. And without moving to the
completed state, the fulfillment center will never see it.
The stakes are high, but luckily this specific failure is rare. The solution here can be as simple as notifying the customer support team but otherwise treating this as a success.
Let’s leave the mail failure for the exception tracker. This allows us to complete the checkout flow just in time to end this article.
We followed two common examples – authentication and checkout – through all their error handling woes. Along the way we uncovered corner cases and found places where the design did or did not lend itself to helping the user through failure. The design dictated the implementation; thinking about errors up-front helped us build out a system for handling most of what can be thrown at us.
For each method we used, we thought through its failure modes. This included its return value, its documented inputs, and its exceptions both known and undocumented. We bucketed failures into user-fixable, user-visible, and programmer error, and we handled each bucket differently.
Following this practice exposed a stable series of patterns that will reduce the growing noise of exceptions accumulating in the exception tracker and also reduce the number of bugs hiding for users to discover after the next refactoring.
With a little forethought and discipline, our users can have a more informative and stable app experience.
(Thank you to Eric Bailey for his wonderful UI mocks used throughout this post.)