Handling external API errors: A transactional approach

Error handling and fault tolerance are often neglected aspects of development. How much does it cost to fix errors due to a poorly implemented error handling strategy or a complete lack thereof? How many API integrations are poorly put together, disregarding what can go wrong? How much data do we have to fix due to catastrophic events of cascading errors that could have been prevented with well-thought-out code?

Let’s get down to the basics. This post is about building integrations with any system, third-party or not, over the network. In a previous post, we discussed the resumable error handling strategy and in what situations it can be helpful. Now, let’s discuss the transactional strategy.

When to choose a transactional strategy?

Let’s start with some recommendations to make the distinction between strategies clear. Choose a transactional error handling strategy when:

  • The workflow is composed of steps that need to be committed together - it’s all or nothing;
  • There is tight coupling between the steps;
  • You can’t bear temporary inconsistency;
  • The external service allows undoing or rolling back side effects;
  • The workflow has just a few steps and API requests (this is common but not a hard requirement).

The example

In our example, we’re logging a list of order line items to the Modern Treasury API, where we have a “ledger account” for a “buyer”. Logging a line item creates a ledger transaction object from the buyer’s ledger account to the vendor’s ledger account.

Let’s imagine the following code:

def log_order
  total_amount = order_line_items.sum(&:amount)

  if has_enough_funds?(total_amount) # Issues a synchronous HTTP request
    order_line_items.each do |line_item|
      result = log_line_item(line_item) # Issues a synchronous HTTP request
      line_item.update!( # Local database update
        external_transaction_id: result.transaction_id
      )
    end

    :ok
  else
    :not_enough_funds
  end
end

This code is a rough draft of what we need to do, with no regard for an error-handling strategy. It’s not unusual to have multiple types of API requests in a transactional workflow, but for simplicity’s sake we’re using a single API request inside a loop.

Transactional or resumable?

When designing client API code, the first question to ask is “should it be transactional or resumable”? That can be determined by looking at the concept being modeled within. Can we partially log an order with one or more line items? The answer is no. We can’t afford temporary inconsistency because the customer shouldn’t look at their order and not see all their line items, which is true at any arbitrary point in time. It’s all or nothing. In resumable workflows, however, temporary inconsistency is bearable, and eventual consistency is reached through multiple retries in the worst-case scenario.

ACID database transaction

For our code to be transactional, we must submit our database commands to an ACID transaction, as it sends UPDATE statements to the underlying database connection. We want to guarantee all database commands are rolled back if something goes wrong.

def log_order
  # If the Ruby code raises an exception, the database
  # issues a `ROLLBACK` statement
  ApplicationRecord.transaction do
    # Code goes here 
  end
end

Database transactions are not a solution to all data consistency problems, but they are the proper solution to our “all or nothing” use case.

Are we done yet? No! ACID transactions are only concerned about local database commands. We must still roll back the external state from HTTP requests.

External transactions

The Sagas pattern instructs on how to roll back external transactions. Most Google results for “sagas pattern” will mention event-based microservices communicating through message brokers, which is not the case here. We’re referring to any code that interacts with external APIs.

The core concept, however, still applies: if the transaction orchestrator (our code) detects an error condition, compensating HTTP requests must be emitted to undo the changes made by the preceding HTTP requests. Let’s apply this improvement to our code:

def log_order
  ApplicationRecord.transaction do
    total_amount = order_line_items.sum(&:amount)

    if has_enough_funds?(total_amount)
      begin
        order_line_items.each do |line_item|
          result = log_line_item(line_item) # Issues a synchronous HTTP request
          line_item.update!(external_transaction_id: result.transaction_id)
        end
      rescue => e
        order_line_items.each do |line_item|
          if line_item.external_transaction_id.present?
            rollback_line_item(line_item.external_transaction_id)
          end
        end

        raise e
      end

      :ok
    else
      :not_enough_funds
    end
  end
end

We’ve introduced a method call, rollback_line_item, to roll back the logged line items so far when encountering an error condition. For simplicity’s sake, the loop that logs the line items is assertive, and there’s no specific error condition to check other than rescuing exceptions. That’s a significant first step, but we must be mindful of API semantics, leading us to our next topic.

Designing external rollbacks

What should the implementation of rollback_line_item look like? That depends on our API features, which should be carefully assessed. In Modern Treasury, we can’t delete a ledger transaction, but we can archive it, which seems like an excellent way to revert our operation and make its implementation more robust.

To roll back a ledger transaction, we must ensure it’s created in a pending state because posted ones are immutable and can’t be rolled back. Also, we need to add a commit step that will move pending ledger transactions to posted. Let’s change our orchestrator code, renaming log_line_item to log_pending_line_item and adding a commit step:

def log_order
  ApplicationRecord.transaction do
    unless all_logged?(order_line_items)
      if has_enough_funds?(total_amount)
        begin
          order_line_items.each do |line_item|
            result = log_pending_line_item(line_item) # Issues a synchronous HTTP request
            line_item.update!(external_transaction_id: result.transaction_id)
          end
        rescue => e
          # ...
        end
      else
        return :not_enough_funds
      end
    end
  end

  # Commit step here
  order_line_items.each do |line_item|
    commit_line_item(line_item) # Issues a synchronous HTTP request
  end

  :ok
end

The commit step should run after all ledger transactions are logged, apart from the ACID transaction. We can’t commit ledger transactions as they are logged because earlier ones wouldn’t be allowed to roll back if the current one results in an error.

Also, we added an unless all_logged?(order_line_items) check for idempotency’s sake to avoid double logging.

Requirements will vary from API to API, so the main takeaway is to look up your API docs and carefully plan your implementation with error handling and fault tolerance in mind.

Handling concurrency

There’s a critical path in our code subject to race conditions. Note the following if condition:

  if has_enough_funds?(total_amount) # Issues a synchronous HTTP request
    # ...
  end

Let’s assume our buyer has a $10 balance. What if a web request that spends the full $10 is issued twice, and both make it to the if condition simultaneously? Yes, both would resolve to truthy and run the same code. The user would be spending what they don’t have – $20 instead of $10 – which would result in a negative balance of -$10.

The first question to ask is “does my API have concurrency handling features?”. In the case of Modern Treasury, the answer is yes. When logging a ledger transaction, we can submit balance check parameters to lock on what the current balance should be after the operation. With that, the API simulates the operation and returns an error code if the after-balance is different than provided; otherwise, it goes ahead and performs the operation. We can send the following parameters along with our JSON payload:

{
  "available_balance": {
    "eq": WHAT_THE_BALANCE_SHOULD_BE_AFTER_THE_OPERATION
  }
}

This feature renders our has_enough_funds? check useless because now the balance would be checked implicitly when logging each ledger transaction. If we raise an exception when the balance check fails, our code already knows how to roll back. Therefore, our code can be simplified, and we can also detect the specific exception to return the appropriate error condition:

rescue InsufficientFundsError
  return :not_enough_funds
end

If the API doesn’t provide concurrency features, a possible solution is to use row-level locking to throttle concurrency:

order_line_items.sort.first.with_lock do
  # All code goes here
end

order_line_items.sort.first.with_lock would replace ApplicationRecord.transaction, as it has the same functionality but with row-level locks on top.

Making commit and rollback fault-tolerant

The API portion of our code now has commit and rollback steps, but they are unreliable. Be mindful that any code can fail, especially when making network calls. If either commit or rollback fails, our data would be inconsistent, and rerunning the code wouldn’t correct it. Designing commits and rollbacks as units that can be independently retried solves our problem, so let’s offload both steps to background jobs.

def log_order
  failed_ledger_transaction_ids = []
  result = :ok

  ApplicationRecord.transaction do
    unless all_logged?(order_line_items)
      # ...

      begin
        order_line_items.each do |line_item|
          result = log_pending_line_item(line_item) # Issues a synchronous HTTP request 
          line_item.update!(external_transaction_id: result.transaction_id)
        end
      rescue => e
        order_line_items.each do |line_item|
          if line_item.external_transaction_id.present?
            failed_ledger_transaction_ids << line_item.external_transaction_id
          end
        end

        result = e.is_a?(NotEnoughFundsError) ? :not_enough_funds : :error
      end
    end
  end

  if failed_ledger_transaction_ids.any?
    # Rollback step here
    failed_ledger_transaction_ids.each do |external_ledger_transaction_id|
      # Issues an asynchronous HTTP request
      rollback_line_item_async(external_ledger_transaction_id)
    end
  else 
    # Commit step here
    order_line_items.each do |line_item|
      # Issues an asynchronous HTTP request
      commit_line_item_async(line_item)
    end
  end

  result == :error ? raise(e) : result
end

That makes our code more reliable, given that most failures on both steps would likely be transient errors that would succeed in a few retries. We’re, of course, assuming the background solution to have retries baked in.

Takeaways

Working with external APIs takes a lot of work. The more critical our workflow is, the more important it is to have a solid error handling/fault tolerance strategy.

  • Pay close attention to the concept being modeled to decide whether to go with a transactional or resumable error handling strategy;
  • The transactional strategy requires API calls to be synchronous because we need to decide whether to commit or rollback everything;
  • How we design and implement external rollbacks will always depend on particular API features and semantics;
  • Always have at least a rollback step for external API interactions. The commit step may sometimes be implicit and will depend on API semantics;
  • External API Commit and rollback steps can generally be offloaded to background jobs for increased fault tolerance;
  • Idempotency is important for critical workflows;
  • Properly handling and limiting concurrency is also important;
  • See if your API provides features to handle concurrency; otherwise, try local solutions such as row-level locks or advisory locks.