---
title: Developing a voice AI app in Rails for drive-through ordering
teaser: In this post we will explore the code for an AI assistant in our application.
tags: artificial intelligence,rails,openai,actioncable,websocket,realtime,ai
author: Jose Blanco
published_on: 2025-05-06
---

# Building a Real Time AI Drive-Thru Experience with Rails and OpenAI

If you attended the live coding session last week with Chad Pytel and Svenja
Schäfer you are probably familiar with the favourite thoughtbot fast food
restaurant: Dinorex. If you missed it, you can watch the recording on
[YouTube](https://www.youtube.com/watch?v=h0MbROcEq5s).

In this post, we will explore some of the technical details of this proof of
concept, and add two fun features. First we will add the transcription
of what the assistant is saying to the screen, and then we will get our
assistant to greet the user as soon as we click the `start` button.

## The Idea

![A T-Rex having a burger](https://images.thoughtbot.com/og78ogv5lgy3lw11wujyyap3b2ps_image.png)

This proof of concept simulates a drive-thru ordering experience at "Dinorex" a
fictional fast-food restaurant with a dinosaur theme. We use OpenAI's GPT-4
model to create a conversational AI assistant that can take orders, respond with
dinosaur-themed puns, and process customer requests in real time.

For this reason we are using the [real time endpoint of OpenAI]
(https://platform.openai.com/docs/api-reference/realtime). This endpoint
allows to communicate with the model in real time using web sockets. The model
support text and audio, and it even includes transcriptions in the event
messages.

## Technical Stack

We are building this application using [Rails 8 and Action Cable]
(https://guides.rubyonrails.org/action_cable_overview.html). If you are not
familiar with Action Cable I would recommend to have a look at the Rails guides.
It helped me a lot at the beginning when I started looking at the code from the
live stream!

As a quick explanation, web sockets are a way for the browser and the server to
keep a connection open, so they can talk to each other continuously without
needing to reload the page.

Normally, in a web app, the browser sends a request, the server responds, and
that’s it. With Action Cable, the connection stays open, so the server can push
data to the browser whenever something happens — like a new message,
a notification, or a live update.

In our case we will use this to keep the connection open so we can communicate
in real time with OpenAI.

In the livestream we mentioned that the initial starting code is created by
Marcus Schappim, you can find the gist [here](https://gist.github.com/schappim/544b3bae95699a92396be8c58417af01)
This code and the additions done by Chad and Svenja will be the starting
point of our application.

## Welcome to Dinorex code!

To understand better how everything is connected we need to talk about the most
important classes and JavaScript files in the project.

On the server side we have the `OpenAiWebsocket` class. This class manages a
real time WebSocket connection between your Rails app and the model.
When it’s initialized, it sets up everything needed to connect to our model and
handle messages.

It connects in a background thread using [`EventMachine`](https://github.com/eventmachine/eventmachine),
then listens for updates from the model. When messages come in, it reacts to
specific types like audio transcripts or function calls, and uses Action Cable to
broadcast updates to the frontend in real time. For example, if a user says
something, the AI transcribes it, and that text or audio is instantly pushed to
the client.

Messages are sent to the model using a queue system to avoid conflicts, and the
class provides helper methods to build and enqueue different types of messages
— like starting a session, appending audio, or adding a conversation message.

In this class we need to pay attention to the `handle_message(message)` method.
Here is where we decide what to do and how to handle the data depending on the
message type we are receiving.

```ruby

  def handle_message(message)
    if message["type"] == "input_audio_buffer.speech_started"
      handle_input_audio_buffer_speech_started(message)
    elsif message["type"] == "response.audio.delta"
      broadcast_audio_delta(message["delta"])
    elsif message["type"] == "response.function_call_arguments.done"
      handle_response_function_call_arguments_done(message)
    else
      log_message(message)
    end
  end

```

We also have the `OpenAiChannel` class. This class handles the Action Cable
channel that connects a user to a live conversation with OpenAI’s API. When a
client subscribes, it creates a new order, sets up a web socket connection using
the `OpenAiWebsocket` class, and starts streaming updates to a channel named
after that order.

It also sends an initial system message to our model setting up how the
assistant needs to behave.

```ruby

  def subscribed
    order = Order.create!
    puts "Subscribed to open_ai_#{order.id}"
    stream_from "open_ai_#{order.id}"
    @openai_client = OpenAiWebsocket.new(order.id)
    @openai_client.connect
    @openai_client.session_update(event_id: SecureRandom.uuid)
    item = {
            "id": "msg_001",
            "type": "message",
            "status": "completed",
            "role": "system",
            "content": [
                {
                    "type": "input_text",
                    "text": "You are a drive-thru attendant at the fast food restaurant Dinorex. You are taking the order of a customer ho has just pulled up to the drive-thru at Dinorex. IMPORTANT: Keep responses SHORT and curt and polite and use dinosaur sounds and puns!!! #{background_info}"
                }
            ]
        }
    @openai_client.conversation_item_create(item: item, event_id: SecureRandom.uuid)
  end

  def background_info
    text = <<~TEXT
        Do not allow customers to order things not on the above menu.
        The items available for ordering at Dinorex are:

        #{MenuItem.all.map(&:to_json)}
      TEXT

    text
  end

```

The channel defines several methods that let the frontend interact with the AI.
For example, `append_audio` streams audio to the model, and `receive` handles both
audio and text input. When the connection is closed, the WebSocket is also
cleanly shut down.

Now how do we tight all together with our frontend?

Now we need to turn our attention to the JavaScript files where we have to handle
the messages. In `open_ai_channel.js` connects the frontend to the
`OpenAiChannel`over `ActionCable` and handles live audio streaming and real time
responses from the server.

When the page loads, it sets up a subscription to the channel. Once connected,
it can send audio, start or stop an order, or restart the AI session by calling
the corresponding server-side methods through perform.

It listens for data coming from the server — like AI-generated audio, a new
message, or a speech event — and reacts accordingly. For example, it can play
back audio or update the UI with new messages.

The core feature here is live voice interaction. When the user clicks the
“start” button, the app captures microphone input using the Web Audio
API, converts it to the right audio format, and continuously streams it as
base64-encoded chunks to the backend through the channel.

When the user stops recording, it closes the audio stream and tells the server
to end the session. Here is where the fun starts!

It's very cool to be able to speak with an AI assistant and get your order
placed but how annoying would it be if you cannot hear the assistant properly?
What about if you cannot hear well? Wouldn't it be cool to read what the
assistant is saying? Here is where the transcript can help!

## Adding the transcription

When an event is proccessed by the application we have access to different
attributes of the event. You can see how we parse the data in
`setup_event_handlers` and that on a message we are calling the
`handle_message(message)` method I was talking about earlier.

Typically, a message will have this format:

```ruby

{
  "type" => "response.done",
  "event_id" => "event_BKNIpipMdMQwwxSqoY3lu",
  "response" => {
    "object" => "realtime.response",
    "id" => "resp_BKNInr0CZfV49lMSzOCps",
    "status" => "completed",
    "status_details" => nil,
    "output" => [{
      "id" => "item_BKNInXwJoIYRorEj0NiwN",
      "object" => "realtime.item",
      "type" => "message",
      "status" => "completed",
      "role" => "assistant",
      "content" => [{
        "type" => "audio",
        "transcript" => "Rawr! Welcome to Dinorex! What can I triceratops for you today?"
      }]
    }],
    "conversation_id" => "conv_BKNIkEtOGCc0EBpxcN2IM", "modalities" => ["audio", "text"],
    "voice" => "sage", "output_audio_format" => "pcm16",
    "temperature" => 0.8, "max_output_tokens" => "inf", "
    usage" => {
      "total_tokens" => 1077, "input_tokens" => 930, "output_tokens" => 147, "input_token_details" => {
        "text_tokens" => 930, "audio_tokens" => 0, "cached_tokens" => 0, "cached_tokens_details" => {
          "text_tokens" => 0, "audio_tokens" => 0
        }
      }, "output_token_details" => {
        "text_tokens" => 34, "audio_tokens" => 113
      }
    },
    "metadata" => nil
  }}

```

In order to capture this message and send it to the frontend we need to modify
the `handle_message` method. An easy approach could be like this:

```ruby

#[...rest of the method...]

elsif message["type"] == "response.output_item.done"
  if message["item"]["type"] == "function_call"
    puts "Function call done: #{message}"
  else
    transcript = message["item"]["content"][0]["transcript"]

    ActionCable.server.broadcast("open_ai_#{@session_id}", {
      type: "new_message",
      message: transcript
    })
  end
else

#[...rest of the method...]

```

We are first waiting for an audio response to be finished and after that we are
checking if it is a function call as we do not want to send those to our
channel. If the `item_type` is not a function call, then we can broadcast
the message to the channel and include the transcription in the message.

Then in our open_ai_channel file we need to include this data type in the
received function:

```javascript

// Within received(data) function

else if (data.type === "new_message") {
        // Create a new DOM element to show the new message
        const messageContainer = document.getElementById("messages");
        const messageElement = document.createElement("div");
        messageElement.innerText = data.message;
        messageContainer.appendChild(messageElement);
      }

```

In our view we just need a `div` container with the id `messages` where to
append each of the messages as they are happening, and just with these two
bits of code we are transcribing the AI assistant in real time!

## Get the assistant to greet the customer after clicking start

Another cool feature we can add to our AI drive-thru assistant is that the
bot welcomes the user as soon as they click the `start` button.

To do this we need to play and extend a bit more our `OpenAiWebsocket` class.
We need to add a new method that we will use when we subscribe to the
channel and the beginning of the sessions. It can look like this:

```ruby

  def new_session(event_id:)
    enqueue_message(
      {
        "event_id": event_id,
        "type": "response.create",
        "response": {
          "modalities": ["audio", "text"],
          "instructions": "Please great the customer and ask them what they would like to order.",
          "voice": "sage",
          "tools": []
        }
      }
    )
  end

  alias_method :greet_customer, :new_session

```

Then we can use this new method in the `OpenAiChannel` class. This can be
an example:

```ruby

  def start_order
    @openai_client.greet_customer(event_id: SecureRandom.uuid)
  end

```

Let's tight this up with our Subscriber. In our OpenAI channel JavaScript file
we need to subscribe this new function as soon as the DOM loads.

```javascript

document.addEventListener("DOMContentLoaded", () => {
    subscription = consumer.subscriptions.create("OpenAiChannel", {
        connected() {
            console.log("Connected to OpenAI channel");
        },

        disconnected() {
            console.log("Disconnected from OpenAI channel");
        },

        sendAudio(audioData) {
            this.perform("append_audio", { type: "audio", audio: audioData });
        },
    
        // We can add here our new function!
        startOrder() {
            this.perform("start_order");
        }
    
        // [...rest of the code]

```

Then within our start button function we need to call the `start_order` function
on the subscription.

```javascript

  startButton.onclick = async () => {
    audioContext = new (window.AudioContext || window.webkitAudioContext)({
        sampleRate: 24000,
    });
    const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
            channelCount: 1,
            sampleRate: 24000,
            sampleSize: 16,
        },
    });
    const source = audioContext.createMediaStreamSource(stream);
    const processor = audioContext.createScriptProcessor(1024, 1, 1);

    source.connect(processor);
    processor.connect(audioContext.destination);

    // Send audio chunks continuously as they are processed
    processor.onaudioprocess = (e) => {
        const inputData = e.inputBuffer.getChannelData(0);
        const encodedAudio = base64EncodeAudio(new Float32Array(inputData));
        subscription.sendAudio(encodedAudio); // Send audio data via WebSocket
    };
    // Here is the new code :)
    subscription.startOrder();

    startButton.disabled = true;
    stopButton.disabled = false;
};

```

Almost done! Now everytime we click in the start button you should see a new
event that is hitting our new `start_order` method 🎉 but...wait, why
can't we hear the assistant?

## Modern Browsers quirks

This was a bit challenging and I spent some time debugging until I discovered
what was going on. Modern browsers block auto-play of audio unless the user has
interacted with the page first (like clicking or tapping).

To solve this we need to do some changes to our `application.js` file. First
we need to make `audioContext`, `audioBuffer` and `isPlaying` variables
available globally.

```javascript

window.audioContext = null;
window.audioBuffer = new Float32Array(0);
window.isPlaying = false;

```

And now the most important change. `initAudioContext` was private and
synchronous. In order to get the assistant talking as soon as we click on the
`start` button we need to handle the browser requirement that an `AudioContext`
can be resumed from a user gesture.

```javascript

window.initAudioContext = async function() {
  ...
  if (window.audioContext.state === 'suspended') {
    await window.audioContext.resume();
    console.log("AudioContext resumed from suspended state");
  }
}

```

Just with this change we prevent silent failures after clicking on our `start`
button 🎉! Below I've attached a video example of our application:

<div class="wistia_responsive_padding" style="padding:53.13% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><iframe src="https://fast.wistia.net/embed/iframe/svwoei629c?web_component=true&seo=false" title="drive-thru-example Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" width="100%" height="100%"></iframe></div></div>
<script src="https://fast.wistia.net/player.js" async></script>

Well done if you have made it up to here! I hope you have found this post
interesting and that it has helped you to play around with the original code
from the livestream by adding two new cool features!
