--- title: 'Handling external API errors: A transactional approach' teaser: 'Clarifying a few possible ways to implement transactional workflows when working with external APIs. ' tags: error handling,api,distributed systems,fault tolerance,postgresql author: Thiago Araújo Silva published_on: 2024-01-23 --- Error handling and fault tolerance are often neglected aspects of development. How much does it cost to fix errors due to a poorly implemented error handling strategy or a complete lack thereof? How many API integrations are poorly put together, disregarding what can go wrong? How much data do we have to fix due to catastrophic events of cascading errors that could have been prevented with well-thought-out code? Let's get down to the basics. This post is about building integrations with any system, third-party or not, over the network. In a previous post, we discussed the [resumable error handling strategy] and in [what situations it can be helpful]. Now, let's discuss the transactional strategy. ## When to choose a transactional strategy? Let's start with some recommendations to make the distinction between strategies clear. Choose a transactional error handling strategy when: - The workflow is composed of steps that need to be committed together - it's all or nothing; - There is tight coupling between the steps; - You can't bear temporary inconsistency; - The external service allows undoing or rolling back side effects; - The workflow has just a few steps and API requests (this is common but not a hard requirement). ## The example In our example, we're logging a list of order line items to the [Modern Treasury API], where we have a "ledger account" for a "buyer". Logging a line item creates a ledger transaction object from the buyer's ledger account to the vendor's ledger account. Let's imagine the following code: ```rb def log_order total_amount = order_line_items.sum(&:amount) if has_enough_funds?(total_amount) # Issues a synchronous HTTP request order_line_items.each do |line_item| result = log_line_item(line_item) # Issues a synchronous HTTP request line_item.update!( # Local database update external_transaction_id: result.transaction_id ) end :ok else :not_enough_funds end end ``` This code is a rough draft of what we need to do, with no regard for an error-handling strategy. It's not unusual to have multiple types of API requests in a transactional workflow, but for simplicity's sake we're using a single API request inside a loop. ## Transactional or resumable? When designing client API code, the first question to ask is "should it be transactional or resumable"? That can be determined by looking at the concept being modeled within. Can we _partially_ log an order with one or more line items? The answer is _no_. We can't afford temporary inconsistency because the customer shouldn't look at their order and not see all their line items, which is true at any arbitrary point in time. It's all or nothing. In resumable workflows, however, temporary inconsistency is bearable, and eventual consistency is reached through multiple retries in the worst-case scenario. ## ACID database transaction For our code to be transactional, we must submit our database commands to an [ACID transaction], as it sends `UPDATE` statements to the underlying database connection. We want to guarantee all database commands are rolled back if something goes wrong. ```rb def log_order # If the Ruby code raises an exception, the database # issues a `ROLLBACK` statement ApplicationRecord.transaction do # Code goes here end end ``` Database transactions are not a solution to all data consistency problems, but they are the proper solution to our "all or nothing" use case. Are we done yet? No! ACID transactions are only concerned about local database commands. We must still roll back the _external_ state from HTTP requests. ## External transactions The [Sagas pattern] instructs on how to roll back external transactions. Most Google results for "sagas pattern" will mention event-based microservices communicating through message brokers, which is not the case here. We're referring to any code that interacts with external APIs. The core concept, however, still applies: if the transaction orchestrator (our code) detects an error condition, compensating HTTP requests must be emitted to undo the changes made by the preceding HTTP requests. Let's apply this improvement to our code: ```rb def log_order ApplicationRecord.transaction do total_amount = order_line_items.sum(&:amount) if has_enough_funds?(total_amount) begin order_line_items.each do |line_item| result = log_line_item(line_item) # Issues a synchronous HTTP request line_item.update!(external_transaction_id: result.transaction_id) end rescue => e order_line_items.each do |line_item| if line_item.external_transaction_id.present? rollback_line_item(line_item.external_transaction_id) end end raise e end :ok else :not_enough_funds end end end ``` We've introduced a method call, `rollback_line_item`, to roll back the logged line items so far when encountering an error condition. For simplicity's sake, the loop that logs the line items is assertive, and there's no specific error condition to check other than rescuing exceptions. That's a significant first step, but we must be mindful of API semantics, leading us to our next topic. ### Designing external rollbacks What should the implementation of `rollback_line_item` look like? That depends on our API features, which should be carefully assessed. In Modern Treasury, we can't delete a ledger transaction, but we can _archive_ it, which seems like an excellent way to revert our operation and make its implementation more robust. To roll back a ledger transaction, we must ensure it's created in a `pending` state because `posted` ones are immutable and can't be rolled back. Also, we need to add a commit step that will move `pending` ledger transactions to `posted`. Let's change our orchestrator code, renaming `log_line_item` to `log_pending_line_item` and adding a commit step: ```rb def log_order ApplicationRecord.transaction do unless all_logged?(order_line_items) if has_enough_funds?(total_amount) begin order_line_items.each do |line_item| result = log_pending_line_item(line_item) # Issues a synchronous HTTP request line_item.update!(external_transaction_id: result.transaction_id) end rescue => e # ... end else return :not_enough_funds end end end # Commit step here order_line_items.each do |line_item| commit_line_item(line_item) # Issues a synchronous HTTP request end :ok end ``` The commit step should run after all ledger transactions are logged, apart from the ACID transaction. We can't commit ledger transactions as they are logged because earlier ones wouldn't be allowed to roll back if the current one results in an error. Also, we added an `unless all_logged?(order_line_items)` check for idempotency's sake to avoid double logging. Requirements will vary from API to API, so the main takeaway is to look up your API docs and carefully plan your implementation with error handling and fault tolerance in mind. ## Handling concurrency There's a critical path in our code subject to race conditions. Note the following `if` condition: ```rb if has_enough_funds?(total_amount) # Issues a synchronous HTTP request # ... end ``` Let's assume our buyer has a $10 balance. What if a web request that spends the full $10 is issued twice, and both make it to the `if` condition simultaneously? Yes, both would resolve to truthy and run the same code. The user would be spending what they don't have -- $20 instead of $10 -- which would result in a negative balance of -$10. The first question to ask is "does my API have concurrency handling features?". In the case of Modern Treasury, the answer is yes. When logging a ledger transaction, we can submit balance check parameters to lock on what the current balance should be after the operation. With that, the API simulates the operation and returns an error code if the after-balance is different than provided; otherwise, it goes ahead and performs the operation. We can send the following parameters along with our JSON payload: ```json { "available_balance": { "eq": WHAT_THE_BALANCE_SHOULD_BE_AFTER_THE_OPERATION } } ``` This feature renders our `has_enough_funds?` check useless because now the balance would be checked implicitly when logging each ledger transaction. If we raise an exception when the balance check fails, our code already knows how to roll back. Therefore, our code can be simplified, and we can also detect the specific exception to return the appropriate error condition: ```rb rescue InsufficientFundsError return :not_enough_funds end ``` If the API doesn't provide concurrency features, a possible solution is to use [row-level locking] to throttle concurrency: ```rb order_line_items.sort.first.with_lock do # All code goes here end ``` `order_line_items.sort.first.with_lock` would replace `ApplicationRecord.transaction`, as it has the same functionality but with [row-level locks] on top. ## Making commit and rollback fault-tolerant The API portion of our code now has commit and rollback steps, but they are unreliable. Be mindful that any code can fail, especially when making network calls. If either commit or rollback fails, our data would be inconsistent, and rerunning the code wouldn't correct it. Designing commits and rollbacks as units that can be independently retried solves our problem, so let's offload both steps to background jobs. ```rb def log_order failed_ledger_transaction_ids = [] result = :ok ApplicationRecord.transaction do unless all_logged?(order_line_items) # ... begin order_line_items.each do |line_item| result = log_pending_line_item(line_item) # Issues a synchronous HTTP request line_item.update!(external_transaction_id: result.transaction_id) end rescue => e order_line_items.each do |line_item| if line_item.external_transaction_id.present? failed_ledger_transaction_ids << line_item.external_transaction_id end end result = e.is_a?(NotEnoughFundsError) ? :not_enough_funds : :error end end end if failed_ledger_transaction_ids.any? # Rollback step here failed_ledger_transaction_ids.each do |external_ledger_transaction_id| # Issues an asynchronous HTTP request rollback_line_item_async(external_ledger_transaction_id) end else # Commit step here order_line_items.each do |line_item| # Issues an asynchronous HTTP request commit_line_item_async(line_item) end end result == :error ? raise(e) : result end ``` That makes our code more reliable, given that most failures on both steps would likely be transient errors that would succeed in a few retries. We're, of course, assuming the background solution to have retries baked in. ## Takeaways Working with external APIs takes a lot of work. The more critical our workflow is, the more important it is to have a solid error handling/fault tolerance strategy. - Pay close attention to the concept being modeled to decide whether to go with a transactional or [resumable error handling strategy]; - The transactional strategy requires API calls to be synchronous because we need to decide whether to commit or rollback everything; - How we design and implement external rollbacks will always depend on particular API features and semantics; - Always have at least a rollback step for external API interactions. The commit step may sometimes be implicit and will depend on API semantics; - External API Commit and rollback steps can generally be offloaded to background jobs for increased fault tolerance; - Idempotency is important for critical workflows; - Properly handling and limiting concurrency is also important; - See if your API provides features to handle concurrency; otherwise, try local solutions such as [row-level locks] or [advisory locks]. [resumable error handling strategy]: https://thoughtbot.com/blog/handling-errors-when-working-with-external-apis [what situations it can be helpful]: https://thoughtbot.com/blog/handling-errors-when-working-with-external-apis#when-to-choose-a-resumable-strategy [Sagas pattern]: https://blog.bernd-ruecker.com/saga-how-to-implement-complex-business-transactions-without-two-phase-commit-e00aa41a1b1b [Modern Treasury API]: https://docs.moderntreasury.com/platform/reference/ledger-transaction-object [ACID transaction]: https://en.wikipedia.org/wiki/ACID [row-level locks]: https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS [row-level locking]: https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS [advisory locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS