Error handling and fault tolerance are often neglected aspects of development. How much does it cost to fix errors due to a poorly implemented error handling strategy or a complete lack thereof? How many API integrations are poorly put together, disregarding what can go wrong? How much data do we have to fix due to catastrophic events of cascading errors that could have been prevented with well-thought-out code?
Let’s get down to the basics. This post is about building integrations with any system, third-party or not, over the network. In a previous post, we discussed the resumable error handling strategy and in what situations it can be helpful. Now, let’s discuss the transactional strategy.
When to choose a transactional strategy?
Let’s start with some recommendations to make the distinction between strategies clear. Choose a transactional error handling strategy when:
- The workflow is composed of steps that need to be committed together - it’s all or nothing;
- There is tight coupling between the steps;
- You can’t bear temporary inconsistency;
- The external service allows undoing or rolling back side effects;
- The workflow has just a few steps and API requests (this is common but not a hard requirement).
The example
In our example, we’re logging a list of order line items to the Modern Treasury API, where we have a “ledger account” for a “buyer”. Logging a line item creates a ledger transaction object from the buyer’s ledger account to the vendor’s ledger account.
Let’s imagine the following code:
def log_order
total_amount = order_line_items.sum(&:amount)
if has_enough_funds?(total_amount) # Issues a synchronous HTTP request
order_line_items.each do |line_item|
result = log_line_item(line_item) # Issues a synchronous HTTP request
line_item.update!( # Local database update
external_transaction_id: result.transaction_id
)
end
:ok
else
:not_enough_funds
end
end
This code is a rough draft of what we need to do, with no regard for an error-handling strategy. It’s not unusual to have multiple types of API requests in a transactional workflow, but for simplicity’s sake we’re using a single API request inside a loop.
Transactional or resumable?
When designing client API code, the first question to ask is “should it be transactional or resumable”? That can be determined by looking at the concept being modeled within. Can we partially log an order with one or more line items? The answer is no. We can’t afford temporary inconsistency because the customer shouldn’t look at their order and not see all their line items, which is true at any arbitrary point in time. It’s all or nothing. In resumable workflows, however, temporary inconsistency is bearable, and eventual consistency is reached through multiple retries in the worst-case scenario.
ACID database transaction
For our code to be transactional, we must submit our database commands
to an ACID transaction, as it sends UPDATE
statements to the
underlying database connection. We want to guarantee all database
commands are rolled back if something goes wrong.
def log_order
# If the Ruby code raises an exception, the database
# issues a `ROLLBACK` statement
ApplicationRecord.transaction do
# Code goes here
end
end
Database transactions are not a solution to all data consistency problems, but they are the proper solution to our “all or nothing” use case.
Are we done yet? No! ACID transactions are only concerned about local database commands. We must still roll back the external state from HTTP requests.
External transactions
The Sagas pattern instructs on how to roll back external transactions. Most Google results for “sagas pattern” will mention event-based microservices communicating through message brokers, which is not the case here. We’re referring to any code that interacts with external APIs.
The core concept, however, still applies: if the transaction orchestrator (our code) detects an error condition, compensating HTTP requests must be emitted to undo the changes made by the preceding HTTP requests. Let’s apply this improvement to our code:
def log_order
ApplicationRecord.transaction do
total_amount = order_line_items.sum(&:amount)
if has_enough_funds?(total_amount)
begin
order_line_items.each do |line_item|
result = log_line_item(line_item) # Issues a synchronous HTTP request
line_item.update!(external_transaction_id: result.transaction_id)
end
rescue => e
order_line_items.each do |line_item|
if line_item.external_transaction_id.present?
rollback_line_item(line_item.external_transaction_id)
end
end
raise e
end
:ok
else
:not_enough_funds
end
end
end
We’ve introduced a method call, rollback_line_item
, to roll back the
logged line items so far when encountering an error condition. For
simplicity’s sake, the loop that logs the line items is assertive, and
there’s no specific error condition to check other than rescuing
exceptions. That’s a significant first step, but we must be mindful of
API semantics, leading us to our next topic.
Designing external rollbacks
What should the implementation of rollback_line_item
look like? That
depends on our API features, which should be carefully assessed. In
Modern Treasury, we can’t delete a ledger transaction, but we can
archive it, which seems like an excellent way to revert our
operation and make its implementation more robust.
To roll back a ledger transaction, we must ensure it’s created in a
pending
state because posted
ones are immutable and can’t be
rolled back. Also, we need to add a commit step that will move
pending
ledger transactions to posted
. Let’s change our
orchestrator code, renaming log_line_item
to
log_pending_line_item
and adding a commit step:
def log_order
ApplicationRecord.transaction do
unless all_logged?(order_line_items)
if has_enough_funds?(total_amount)
begin
order_line_items.each do |line_item|
result = log_pending_line_item(line_item) # Issues a synchronous HTTP request
line_item.update!(external_transaction_id: result.transaction_id)
end
rescue => e
# ...
end
else
return :not_enough_funds
end
end
end
# Commit step here
order_line_items.each do |line_item|
commit_line_item(line_item) # Issues a synchronous HTTP request
end
:ok
end
The commit step should run after all ledger transactions are logged, apart from the ACID transaction. We can’t commit ledger transactions as they are logged because earlier ones wouldn’t be allowed to roll back if the current one results in an error.
Also, we added an unless all_logged?(order_line_items)
check for
idempotency’s sake to avoid double logging.
Requirements will vary from API to API, so the main takeaway is to look up your API docs and carefully plan your implementation with error handling and fault tolerance in mind.
Handling concurrency
There’s a critical path in our code subject to race conditions. Note
the following if
condition:
if has_enough_funds?(total_amount) # Issues a synchronous HTTP request
# ...
end
Let’s assume our buyer has a $10 balance. What if a web request that
spends the full $10 is issued twice, and both make it to the if
condition simultaneously? Yes, both would resolve to truthy and run
the same code. The user would be spending what they don’t have – $20
instead of $10 – which would result in a negative balance of -$10.
The first question to ask is “does my API have concurrency handling features?”. In the case of Modern Treasury, the answer is yes. When logging a ledger transaction, we can submit balance check parameters to lock on what the current balance should be after the operation. With that, the API simulates the operation and returns an error code if the after-balance is different than provided; otherwise, it goes ahead and performs the operation. We can send the following parameters along with our JSON payload:
{
"available_balance": {
"eq": WHAT_THE_BALANCE_SHOULD_BE_AFTER_THE_OPERATION
}
}
This feature renders our has_enough_funds?
check useless because now
the balance would be checked implicitly when logging each ledger
transaction. If we raise an exception when the balance check fails,
our code already knows how to roll back. Therefore, our code can be
simplified, and we can also detect the specific exception to return
the appropriate error condition:
rescue InsufficientFundsError
return :not_enough_funds
end
If the API doesn’t provide concurrency features, a possible solution is to use row-level locking to throttle concurrency:
order_line_items.sort.first.with_lock do
# All code goes here
end
order_line_items.sort.first.with_lock
would replace
ApplicationRecord.transaction
, as it has the same functionality but
with row-level locks on top.
Making commit and rollback fault-tolerant
The API portion of our code now has commit and rollback steps, but they are unreliable. Be mindful that any code can fail, especially when making network calls. If either commit or rollback fails, our data would be inconsistent, and rerunning the code wouldn’t correct it. Designing commits and rollbacks as units that can be independently retried solves our problem, so let’s offload both steps to background jobs.
def log_order
failed_ledger_transaction_ids = []
result = :ok
ApplicationRecord.transaction do
unless all_logged?(order_line_items)
# ...
begin
order_line_items.each do |line_item|
result = log_pending_line_item(line_item) # Issues a synchronous HTTP request
line_item.update!(external_transaction_id: result.transaction_id)
end
rescue => e
order_line_items.each do |line_item|
if line_item.external_transaction_id.present?
failed_ledger_transaction_ids << line_item.external_transaction_id
end
end
result = e.is_a?(NotEnoughFundsError) ? :not_enough_funds : :error
end
end
end
if failed_ledger_transaction_ids.any?
# Rollback step here
failed_ledger_transaction_ids.each do |external_ledger_transaction_id|
# Issues an asynchronous HTTP request
rollback_line_item_async(external_ledger_transaction_id)
end
else
# Commit step here
order_line_items.each do |line_item|
# Issues an asynchronous HTTP request
commit_line_item_async(line_item)
end
end
result == :error ? raise(e) : result
end
That makes our code more reliable, given that most failures on both steps would likely be transient errors that would succeed in a few retries. We’re, of course, assuming the background solution to have retries baked in.
Takeaways
Working with external APIs takes a lot of work. The more critical our workflow is, the more important it is to have a solid error handling/fault tolerance strategy.
- Pay close attention to the concept being modeled to decide whether to go with a transactional or resumable error handling strategy;
- The transactional strategy requires API calls to be synchronous because we need to decide whether to commit or rollback everything;
- How we design and implement external rollbacks will always depend on particular API features and semantics;
- Always have at least a rollback step for external API interactions. The commit step may sometimes be implicit and will depend on API semantics;
- External API Commit and rollback steps can generally be offloaded to background jobs for increased fault tolerance;
- Idempotency is important for critical workflows;
- Properly handling and limiting concurrency is also important;
- See if your API provides features to handle concurrency; otherwise, try local solutions such as row-level locks or advisory locks.