Make Phoenix Even Faster with a GenServer-backed Key Value Store

Josh Clayton

As developers, many of us build breakable toys in various programming languages when learning something new; the goal is to learn the language and not the intricacies of a new product, after all! Some of us reach for todo apps, others a messaging clone. If you’re like me, though, it’s URL shorteners.

I built my most recent URL shortener in Elixir, and decided to see how much raw speed I could get out of the application by using GenServer - Generic Server - to build out a key-value store to bypass the database when possible.

State in Elixir

A GenServer is one mechanism for leveraging state in Elixir, so it’s the perfect use for a transient, in-memory cache. In this example, I’ll be caching the Struct straight from Ecto and using the slug from the URL to determine the cache key.

Cache Interface

Now that we understand the underlying mechanism we’ll be using, the cache behavior is the next step:

# lib/link_cache/cache.ex

defmodule LinkCache.Cache do
  def fetch(slug, default_value_function) do
    case get(slug) do
      {:not_found} -> set(slug, default_value_function.())
      {:found, result} -> result
    end
  end
end

We’ll use LinkCache.Cache.fetch/2 to provide both a slug for lookup and a function to execute to populate the cache if no value is found. While LinkCache.Cache.get/1 and LinkCache.Cache.set/2 don’t exist yet, we can infer behavior: LinkCache.Cache.get/1 will either return a tuple of {:not_found} (at which point we’ll call LinkCache.Cache.set/2 to assign the value and return it), or {:found, result}, at which point we return the result directly.

To leverage this, let’s change RedirectTo.RedirectController:

--- a/web/controllers/redirect_controller.ex
+++ b/web/controllers/redirect_controller.ex
@@ -3,7 +3,9 @@ defmodule RedirectTo.RedirectController do
   alias RedirectTo.Queries

   def show(conn, %{"slug" => slug}) do
-    link = Queries.Link.by_slug(slug) |> Repo.one
+    link = LinkCache.Cache.fetch(slug, fn ->
+      Queries.Link.by_slug(slug) |> Repo.one
+    end)

     redirect conn, external: link.long_url
   end

Digging into our cache

With our usage out of the way, let’s build out the rest of LinkCache.Cache.

# lib/link_cache/cache.ex

defmodule LinkCache.Cache do
  use GenServer

  def start_link(opts \\ []) do
    GenServer.start_link(__MODULE__, [
      {:ets_table_name, :link_cache_table},
      {:log_limit, 1_000_000}
    ], opts)
  end

  def fetch(slug, default_value_function) do
    case get(slug) do
      {:not_found} -> set(slug, default_value_function.())
      {:found, result} -> result
    end
  end

  defp get(slug) do
    case GenServer.call(__MODULE__, {:get, slug}) do
      [] -> {:not_found}
      [{_slug, result}] -> {:found, result}
    end
  end

  defp set(slug, value) do
    GenServer.call(__MODULE__, {:set, slug, value})
  end

  # GenServer callbacks

  def handle_call({:get, slug}, _from, state) do
    %{ets_table_name: ets_table_name} = state
    result = :ets.lookup(ets_table_name, slug)
    {:reply, result, state}
  end

  def handle_call({:set, slug, value}, _from, state) do
    %{ets_table_name: ets_table_name} = state
    true = :ets.insert(ets_table_name, {slug, value})
    {:reply, value, state}
  end

  def init(args) do
    [{:ets_table_name, ets_table_name}, {:log_limit, log_limit}] = args

    :ets.new(ets_table_name, [:named_table, :set, :private])

    {:ok, %{log_limit: log_limit, ets_table_name: ets_table_name}}
  end
end

We’ve already seen LinkCache.Cache.fetch/2, so let’s move on to LinkCache.Cache.get/1 and LinkCache.Cache.set/2.

defp get(slug) do
  case GenServer.call(__MODULE__, {:get, slug}) do
    [] -> {:not_found}
    [{_slug, result}] -> {:found, result}
  end
end

GenServer.call/3 is a synchronous function we use to send a request to the server (__MODULE__, which is equivalent to LinkCache.Cache). The 2-tuple {:get, slug} (which we’ll pattern-match on) tells the server to retrieve the value with the key slug, and will return zero or one results due to the underlying storage mechanism we’re using.

defp set(slug, value) do
  GenServer.call(__MODULE__, {:set, slug, value})
end

Here, we’re performing another synchronous action to cache the value for the specific slug, returning the 2-tuple {slug, value}.

Both LinkCache.Cache.get/1 and LinkCache.Cache.set/2 are sending requests to __MODULE__, as discussed before; we need to hook into these behaviors, pattern-matching on the requests we’re sending, to trigger underlying behavior.

Above, we’d seen GenServer.call/3 used, with the appropriate tuples being passed; let’s look at the hooks we’ve written to handle these requests:

def handle_call({:get, slug}, _from, state) do
  %{ets_table_name: ets_table_name} = state
  result = :ets.lookup(ets_table_name, slug)
  {:reply, result, state}
end

def handle_call({:set, slug, value}, _from, state) do
  %{ets_table_name: ets_table_name} = state
  true = :ets.insert(ets_table_name, {slug, value})
  {:reply, value, state}
end

GenServer.handle_call/3 manages the request (either {:get, slug} or {:set, slug, value}), a 2-tuple of the request sender (PID and descriptor), which we’re ignoring here (_from), and the server’s internal state (state). Both reply to the request with a 3-tuple {:reply, link_struct, state} - with the link Struct available, we can then retrieve the URL and send the user on their way.

Erlang Term Storage

LinkCache.Cache.get/1 and LinkCache.Cache.set/2 each reference :ets, an Erlang module referencing the underlying ETS (Erlang Term Storage) architecture, an incredibly performant in-memory store with O(1) lookups for “set” storage, which we’re using.

In both definitions, we pattern-match on state to retrieve the table name for our link cache and perform both insert (for LinkCache.Cache.set/2) and lookup (for LinkCache.Cache.get/1) operations.

We configure :ets in LinkCache.Cache.start_link/1 and LinkCache.Cache.init/1:

def start_link(opts \\ []) do
  GenServer.start_link(__MODULE__, [
    {:ets_table_name, :link_cache_table},
    {:log_limit, 1_000_000}
  ], opts)
end

def init(args) do
  [{:ets_table_name, ets_table_name}, {:log_limit, log_limit}] = args

  :ets.new(ets_table_name, [:named_table, :set, :private])

  {:ok, %{log_limit: log_limit, ets_table_name: ets_table_name}}
end

These functions prepare LinkCache.Cache to be supervised (GenServer.start_link/3) with configuration (a table named :link_cache_table with room for 1 million 2-tuples), as well as create the table (:ets.new/2), configured for a named table with “set” storage.

Fault-tolerance with a supervisor

With the underlying cache built, we now want to ensure Phoenix keeps the cache running. Elixir promotes a strategy of letting things crash - and ensuring recovery is quick and straightforward - with Supervisors at the helm.

Let’s build out a supervisor to ensure LinkCache.Cache is monitored and runs correctly:

# lib/link_cache/supervisor.ex

defmodule LinkCache.Supervisor do
  use Supervisor

  def start_link do
    Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
  end

  def init(:ok) do
    children = [
      worker(LinkCache.Cache, [[name: LinkCache.Cache]])
    ]

    supervise(children, strategy: :one_for_one)
  end
end

Our LinkCache.Supervisor supervises only LinkCache.Cache, with a :one_for_one strategy.

We’ll also need to update our application to include this supervisor:

# lib/redirect_to.ex

defmodule RedirectTo do
  use Application

  def start(_type, _args) do
    import Supervisor.Spec, warn: false

    children = [
      supervisor(RedirectTo.Endpoint, []),
      supervisor(RedirectTo.Repo, []),
      supervisor(LinkCache.Supervisor, []),
    ]

    opts = [strategy: :one_for_one, name: RedirectTo.Supervisor]
    Supervisor.start_link(children, opts)
  end

  # ...
end

So, in addition to our RedirectTo.Endpoint and RedirectTo.Repo, our LinkCache.Supervisor will also be managed by RedirectTo; this ensures if anything happens to the cache, our application will restart it.

You can find a more thorough walkthrough of supervisors, supervision trees, and how to tie various components together on Elixir’s discussion of OTP topics.

Testing LinkCache.Cache

To test the cache, I wrote a small test ensuring the cache set the value based on return value of a function, and reused that value on a subsequent lookup.

# test/lib/link_cache/cache_test.exs

defmodule LinkCacheTest do
  use ExUnit.Case

  test "caches and finds the correct data" do
    assert LinkCache.Cache.fetch("A", fn ->
      %{id: 1, long_url: "http://www.example.com"}
    end) == %{id: 1, long_url: "http://www.example.com"}

    assert LinkCache.Cache.fetch("A", fn -> "" end) == %{id: 1, long_url: "http://www.example.com"}
  end
end

Again, the cache can store Structs, Maps, or all sorts of data structures. As is the case with any cache, invalidation might be tricky. The architecture of this application is such that the links themselves cannot be updated, so I’m not concerned about invalid data. Your mileage may vary.

Memory usage with 1 million 2-tuples

I’d admittedly chosen a log_limit of 1 million pairs without knowing the memory implications of that particular cache size. And while lookups are O(1) (constant time, regardless of the number of pairs in the cache), I did want to verify I wouldn’t be running out of memory.

How do we verify this?

First, fire up IEx running the Phoenix application:

iex -S mix phoenix.server

Next, run :observer.start() within IEx to see a GUI outlining various metrics of the running application.

Assuming ETS doesn’t make any sort of optimizations around similar values:

link = RedirectTo.Queries.Link.by_slug("A") |> RedirectTo.Repo.one
1..1_000_000 |> Enum.each(fn(i) -> LinkCache.Cache.fetch((i |> to_string), fn -> link end) end)

This inserts one million pairs into the cache, with the keys being string versions of each integer ("1", "1250", etc.) and the values being the Struct directly from Ecto.

Memory Usage Graph of Full Struct

Over the course of ~40s, all 1 million pairs were stored in-memory (insertion of ~25,000/s) with a result size of around 856MB, or just under 1kB per pair.

Storing a subset of the data results in significantly less memory usage:

link = RedirectTo.Queries.Link.by_slug("A") |> RedirectTo.Repo.one |> Map.take([:slug, :long_url])

Memory Usage Graph of Small Map

Benchmarks on Heroku

With the cache in place, I deployed the cache- and database-read versions of the application to Heroku, running on the free server and database.

In both tests, I used Siege against the HTTP version of the application, configured not to follow redirects, against the same slug with 500 concurrent users over 60s. In both cases, the Heroku dyno was running (no spin-up time).

DB read results

Transactions:                  29508 hits
Availability:                 100.00 %
Elapsed time:                  59.75 secs
Data transferred:               2.67 MB
Response time:                  0.70 secs
Transaction rate:             493.86 trans/sec
Throughput:                     0.04 MB/sec
Concurrency:                  343.44
Successful transactions:       29508
Failed transactions:               0
Longest transaction:            5.98
Shortest transaction:           0.04

I was impressed by these numbers; almost 30,000 requests per minute on free hardware, with an average response time of roughly 700ms, and this was hitting the database on each request. No connections were dropped.

Cache read results

Transactions:                  44829 hits
Availability:                 100.00 %
Elapsed time:                  59.38 secs
Data transferred:               4.06 MB
Response time:                  0.38 secs
Transaction rate:             754.95 trans/sec
Throughput:                     0.07 MB/sec
Concurrency:                  284.19
Successful transactions:       44829
Failed transactions:               0
Longest transaction:            8.39
Shortest transaction:           0.03

The throughput when reading from the cache was significantly better, serving 50% more requests. Additionally, the average response time was about half that of the run hitting the database.

Wrapping up

Even with Elixir’s incredible speed and concurrency support out of the box (30k RPM on free Heroku hardware with sub-second response times is pretty great), there are always ways to leverage certain technologies for what they’re great at. By leveraging Erlang’s ETS, GenServer, and supervisors, we were able to build a fault-tolerant, scalable in-memory cache to improve both performance and throughput.