---
title: 'Handling external API errors: A resumable approach'
teaser: 'Clarifying a few possible ways to implement resumable workflows when working
  with external APIs.

  '
tags: error handling,api,distributed systems,postgresql
author: Thiago Araújo Silva
published_on: 2023-01-24
---

When an application needs to integrate with external APIs, making
workflows fault-tolerant and resumable may be a subtle, hidden, and
unobvious requirement. Why design a workflow with such qualities in
mind, and what are the pitfalls?

## Reasons to make a workflow fault-tolerant and resumable

"Fault-tolerant and resumable" is a set of error handling strategies
to maintain integrity and reach _eventual data consistency_ in the
face of expected and unexpected errors, so it's a very important
quality in certain workflows.

When integrating with external systems, we'll generally want to keep
track of their data in a database that we fully control; that may
include external IDs, monetary amounts, taxes, etc., and any other
valuable data returned to our system.

But how do we ensure those data references are consistent and
correctly in sync with the external system?

## What not to do

Let's use the following example workflow:

1. Given a contract with fundings, issue an API request to create an
   external account;
2. Iterate over existing contract fundings, and for each:
  - Issue an API request to fund the external account;
  - Mark the funding as released.

Here's a naive way to design such a workflow:

```rb
ApplicationRecord.transaction do
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)

  contract.fundings.each do |funding|
    fund_external_account(contract, funding)
    funding.update!(released: true)
  end
end
```

Let's assume the first funding fails to be updated, and an exception
is raised. What happens? The database transaction would be rolled
back, and all of the local state created within the transaction would
evaporate. External state, however, _wouldn't_ be rolled back; we'd
have an external account, potentially with a few fundings, not tracked
anywhere in our local system.

Worse yet, a new attempt to run the same workflow would yield the same
side effects, which means the workflow is not _idempotent_. That may
be far from acceptable depending on our requirements and how the
external system works.

## What to do

### Divide into control points

When designing a workflow like that, we'll never want to roll back
local transactions. A much better version of our workflow would be:

```ruby
if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    fund_external_account(contract, funding)
    funding.update!(released: true)
  end
end
```

The first noticeable improvement is that there is no transaction.
Second, we're guarding the critical steps of our workflow under
conditionals; if the local state for a step is already recorded, we
skip that step. In other words, we're defining _units of work_ that
serve as control points to resume the computation where it left off
with the help of global state. Creating an account is the first unit
of work, and funding the account is the second.

Now, if a funding fails to be updated, at least the creation of the
external account would be persisted. Rerunning the workflow would skip
creation of the external account and hopefully release the remaining
fundings, and we'd be good to go!

That improves the resumability of our workflow but is still not ideal.
Let's find out why.

### Roll back external state

The problem when a funding fails to be `released` is that its external
counterpart would not be rolled back:

```ruby
fund_external_account(contract, funding)

# If failing here, the previous command wouldn't roll back
funding.update!(released: true)
```

If we rerun our workflow successfully, we'd be double-funding the
external account, which seems a very dangerous and undesirable side
effect. The solution is to roll back the external funding, but how?

A general answer to that question lies in the [Sagas pattern]. Since
external systems are unaware of local transactions, the solution is to
emit compensating actions to roll back the external state established
within the current unit of work. Now, _how to do that_ may vary
depending on what the external system provides.

In our example, a "funding" creates _ledger entries_. A possible
attempt would involve a new API request to _delete_ the extra ledger
entries, but let's assume that not to be possible. Let's assume that,
instead, ledger entries can't be deleted, but they can be created with
a `pending` status and then `archived` if something goes wrong. If all
goes well, we can move their status to `posted`. Let's introduce that
change into our snippet:

```ruby
if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    begin
      ledger_transaction_id = fund_external_account(contract, funding)
      funding.update!(released: true)

      PostLedgerEntries.call(ledger_transaction_id)
    rescue Exception => e
      if ledger_transaction_id
        ArchiveLedgerEntries.call(ledger_transaction_id)
      end

      # Reraising is important to signal a failed workflow
      raise e
    end
  end
end
```

Archiving ledger entries is how we roll back fundings. That's great!
So, is our mission accomplished? Not yet.

### Run select parts in the background

What if the "archive" step throws an exception?

```ruby
rescue Exception => e
  if ledger_transaction_id
    # What if this line throws an exception?
    ArchiveLedgerEntries.call(ledger_transaction_id)
  end

  # ...
```

When rerunning the workflow, the failed funding would be left orphaned
and unreleased, and another funding could be created with a `posted`
status. That would be disastrous since we're dealing with money. And
the same would happen if the entries failed to be posted!

We can solve that problem by running the "archive" and "post" steps in
another process or thread. Background processing tools such as Sidekiq
have built-in retry capabilities, making our workflow more likely to
reach eventual consistency. Moving parts into the background should be
analyzed case-by-case, but it would make sense in our example. Let's
do it:

```ruby
if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    begin
      ledger_transaction_id = fund_external_account(contract, funding)
      funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
    rescue Exception => e
      if ledger_transaction_id
        ArchiveLedgerEntries.perform_async(ledger_transaction_id)
      end

      # Reraising is important to signal a failed workflow
      raise e
    else
      PostLedgerEntries.perform_async(ledger_transaction_id)
    end
  end
end
```

<aside class="info">
  Did you know that <a href="https://docs.ruby-lang.org/en/master/syntax/exceptions_rdoc.html">rescue supports an else clause</a>?
</aside>

We've turned `ArchiveLedgerEntries` and `PostLedgerEntries` into jobs
with a `perform_async` method, which can be independently dispatched
to the background. Unfortunately, no solution is perfect so there's
still a chance that retries would exhaust and neither step would ever
finish. In that case, we could learn something new about the domain
and improve our code, or even devise tools to detect and repair
inconsistencies, but we won't touch that here in this article.

### Use locks to handle concurrency

Our workflow should be perfect now that it's resumable and
fault-tolerant; API requests shouldn't be able to make it through more
than once, right? Wrong.

What if two or more instances run _exactly_ at the same time and read
back the same state? One or more `if` conditions could likely resolve
to truthy, which would result in two single-funded accounts (one of
them being orphaned), or, even worse - one double-funded account! When
it comes to critical workflows, we can't assume this will never
happen. It _can_ happen, and it's more frequent than one would
imagine. For example:

- A user could click on a submit button multiple times and trigger
  multiple web requests (if the workflow is synchronous);

- Two or more jobs could accidentally be scheduled for the same
  contract and be executed concurrently (if the workflow is
  asynchronous);

- Two or more people could trigger our workflow on hundreds of
  contracts, and due to stale UIs on either side, trigger it
  concurrently for the same contract.

_Many_ things could happen, and resolving the issue on the front end
is never enough unless the workflow is not critical. The back end
**should** guarantee data consistency.

Since our workflow is backed by a `contract` database row, a simple
solution would be to use a [row-level lock] when working with
PostgreSQL or another compliant database. They would be used solely as
a global [mutex] across processes and machines. However, row-level
locks run under a transaction, and that's where we need to be careful.
We want to ensure our transaction is always committed and never rolled
back, so the following pattern would be advisable:

```ruby
captured_exception = nil

# Warning, implicit transaction here!
contract.with_lock do
  contract.reload

  # Execute work
rescue Exception => e
  # Let the transaction commit no matter what

  captured_exception = e
end

raise captured_exception if captured_exception
```

<aside class="info">
  I'd advise you to play with locks in <code>psql</code> to learn how they work
</aside>

Let's apply that pattern to our code snippet:

```ruby
captured_exception = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
      rescue Exception => e
        if ledger_transaction_id
          ArchiveLedgerEntries.perform_async(ledger_transaction_id)
        end
  
        raise e
      else
        PostLedgerEntries.perform_async(ledger_transaction_id)
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

raise captured_exception if captured_exception
```

Now, only one workflow would acquire the lock at a time, while
concurrent workflows would be paused until they are able to acquire
the lock.

Note that we'd usually resort to [advisory locks] (if we're using
Postgres) when dealing with application-level locks, but they are
harder to test under [transactional fixtures]. If you have a backing
object like `contract`, row-level locks should be pretty much
equivalent in functionality.

#### Fixing possible transient failures

Can you spot an innocuous bug now that we've introduced locks into our
code? 👀

**Answer**: we're scheduling background jobs _before committing the
transaction_, so there's a chance the lookup by
`ledger_transaction_id` would fail within the job because it wouldn't
see uncommitted data. It would, however, likely succeed on the next
try, because the database transaction would have been committed by
then. This is a reminder for us to be careful when mixing background
jobs with database transactions!

Fixing our transient issue's a bit tricky, requiring us to keep track
of successful and failed ledger transaction ids and moving the
scheduling of background jobs upper in the stack, outside of any
database transactions:

```ruby
captured_exception = nil
ok_ledger_transaction_ids = []
failed_ledger_transaction_id = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)

        ok_ledger_transaction_ids << ledger_transaction_id
      rescue Exception => e
        failed_ledger_transaction_id = ledger_transaction_id

        raise e
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end

raise captured_exception if captured_exception
```

> Thank you [Elias Rodrigues] for calling out this bug!

### Use idempotency keys

Some APIs (such as [Stripe]) provide a feature called "Idempotency
keys" to safely retry requests without performing the same side effect
twice. To make it work, we provide a unique key along with the
request. Let's see how we could model our first step, creating the
account, with that in mind:

```ruby
external_account_id = create_external_account(
  contract,
  idempotency_key: "create_account_#{contract.id}"
)
contract.update!(external_account_id: external_account_id)
```

Our idempotency key uses `contract.id` to ensure uniqueness, which
means only a single account would be created if we run the same
workflow multiple times. Does that render our guarding `if` condition
unnecessary? Should we really do that? I don't believe so.

Unfortunately, idempotency keys usually expire after 24 hours (look up
your API's documentation). Considering that tools like Sidekiq may
attempt retries for around [three weeks after the first attempt], it's
not safe to rely _only_ on an idempotency key. I would advise having
_both the guard conditions and the idempotency keys_ if possible, for
extra security. That would make the requests safer to use outside of
the workflow.

When using idempotency keys, we should make sure to change the key's
value if we ever want to retry our workflow due to a validation error
or another _expected_ error, so be careful! That would require
tracking the key's version with the help of database state, whether it
be Postgres, Redis, etc:

```ruby
begin
  create_external_account(
    contract,
    idempotency_key: "create_account_#{idempotency_key.version}"
  )
rescue *VALIDATION_ERRORS => e
  idempotency_key.increment_version
  raise e
end
```

## How much error handling do you need?

If you're paying attention, you may have noticed a possible point of
failure. In the following section, there's no error handler:

```ruby
ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end
```

What if the connection with Redis fails (considering Sidekiq) and
throws an exception? We'd be left with a pending ledger transaction!
Among a few possible solutions, we could record the funding status:

```rb
funding.update!(
  released: true,
  status: "pending",
  ledger_transaction_id: ledger_transaction_id
)
```

If an error is raised when trying to schedule the jobs, assuming that
a retry would be attempted, we could initialize
`ok_ledger_transaction_ids` with the inconsistent ids:

```ruby
# Using example Rails code
ok_ledger_transaction_ids = contract
  .fundings
  .where(released: true, status: "pending")
  .pluck(:ledger_transaction_id)
```

When the workflow would hit the bottom, it would retry scheduling the
jobs that failed to be scheduled. Here's the final code:

```ruby
captured_exception = nil

ok_ledger_transaction_ids = contract
  .fundings
  .where(released: true, status: "pending")
  .pluck(:ledger_transaction_id)
  
failed_ledger_transaction_id = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(
          status: "pending",
          released: true,
          ledger_transaction_id: ledger_transaction_id
        )

        ok_ledger_transaction_ids << ledger_transaction_id
      rescue Exception => e
        failed_ledger_transaction_id = ledger_transaction_id

        raise e
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end

raise captured_exception if captured_exception
```

> We're assuming that `PostLedgerEntries` will update the funding status
> to `posted` and `ArchiveLedgerEntries` to `archived`.

There could be room for improvement. For instance, would it make sense
to only archive a ledger transaction if `PostLedgerEntries` retries
are exhausted? It depends on our requirements and on how long we can
keep a `pending` funding around.

<aside class="info">
  Be careful with code inside <code>rescue/else/ensure</code> because it's
  not under the error handler!
</aside>

## When to choose a resumable strategy?

Choose a resumable error handling strategy when:

- The workflow is composed of discrete steps that run independently;
- There is loose coupling between the steps;
- The workflow has a reasonable amount of steps and API requests to
  deal with;
- You can bear temporary inconsistency;
- When there is a retry mechanism in place such as a background
  processing tool, or an external API that retries webhooks on errors.

For workflows running in the background, it is possible to use
websockets to broadcast flash messages and other data if delivering
feedback to the user is essential.

## Conclusion

There is no single answer on how to handle errors, so it should always
be analyzed case-by-case. The key idea behind this post is to present
you with a few techniques to hopefully inspire your own error handling
strategies, which are especially important on critical workflows.

Make sure that you handle both _expected_ and _unexpected_ errors.
Never assume unexpected errors can't happen. They can and they will
happen, and there are a multitude of possible reasons: API timeouts,
outages, cloud instances restarting, unstable internet connection,
unexpected database constraints kicking in, and so on. When possible,
always [design with failure in mind].

[Sagas pattern]: https://blog.bernd-ruecker.com/saga-how-to-implement-complex-business-transactions-without-two-phase-commit-e00aa41a1b1b
[three weeks after the first attempt]: https://github.com/mperham/sidekiq/wiki/Error-Handling
[Stripe]: https://stripe.com/docs/api/idempotent_requests
[advisory locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
[transactional fixtures]: https://relishapp.com/rspec/rspec-rails/docs/transactions
[row-level lock]: https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS
[mutex]: https://en.wikipedia.org/wiki/Lock_(computer_science)
[Elias Rodrigues]: https://thoughtbot.com/blog/authors/elias-rodrigues
[design with failure in mind]: https://thoughtbot.com/blog/check-return-values-web
