Building a Real Time AI Drive-Thru Experience with Rails and OpenAI

If you attended the live coding session last week with Chad Pytel and Svenja Schäfer you are probably familiar with the favourite thoughtbot fast food restaurant: Dinorex. If you missed it, you can watch the recording on YouTube.

In this post, we will explore some of the technical details of this proof of concept, and add two fun features. First we will add the transcription of what the assistant is saying to the screen, and then we will get our assistant to greet the user as soon as we click the start button.

The Idea

A T-Rex having a burger

This proof of concept simulates a drive-thru ordering experience at “Dinorex” a fictional fast-food restaurant with a dinosaur theme. We use OpenAI’s GPT-4 model to create a conversational AI assistant that can take orders, respond with dinosaur-themed puns, and process customer requests in real time.

For this reason we are using the real time endpoint of OpenAI. This endpoint allows to communicate with the model in real time using web sockets. The model support text and audio, and it even includes transcriptions in the event messages.

Technical Stack

We are building this application using Rails 8 and Action Cable. If you are not familiar with Action Cable I would recommend to have a look at the Rails guides. It helped me a lot at the beginning when I started looking at the code from the live stream!

As a quick explanation, web sockets are a way for the browser and the server to keep a connection open, so they can talk to each other continuously without needing to reload the page.

Normally, in a web app, the browser sends a request, the server responds, and that’s it. With Action Cable, the connection stays open, so the server can push data to the browser whenever something happens — like a new message, a notification, or a live update.

In our case we will use this to keep the connection open so we can communicate in real time with OpenAI.

In the livestream we mentioned that the initial starting code is created by Marcus Schappim, you can find the gist here This code and the additions done by Chad and Svenja will be the starting point of our application.

Welcome to Dinorex code!

To understand better how everything is connected we need to talk about the most important classes and JavaScript files in the project.

On the server side we have the OpenAiWebsocket class. This class manages a real time WebSocket connection between your Rails app and the model. When it’s initialized, it sets up everything needed to connect to our model and handle messages.

It connects in a background thread using EventMachine, then listens for updates from the model. When messages come in, it reacts to specific types like audio transcripts or function calls, and uses Action Cable to broadcast updates to the frontend in real time. For example, if a user says something, the AI transcribes it, and that text or audio is instantly pushed to the client.

Messages are sent to the model using a queue system to avoid conflicts, and the class provides helper methods to build and enqueue different types of messages — like starting a session, appending audio, or adding a conversation message.

In this class we need to pay attention to the handle_message(message) method. Here is where we decide what to do and how to handle the data depending on the message type we are receiving.


  def handle_message(message)
    if message["type"] == "input_audio_buffer.speech_started"
      handle_input_audio_buffer_speech_started(message)
    elsif message["type"] == "response.audio.delta"
      broadcast_audio_delta(message["delta"])
    elsif message["type"] == "response.function_call_arguments.done"
      handle_response_function_call_arguments_done(message)
    else
      log_message(message)
    end
  end

We also have the OpenAiChannel class. This class handles the Action Cable channel that connects a user to a live conversation with OpenAI’s API. When a client subscribes, it creates a new order, sets up a web socket connection using the OpenAiWebsocket class, and starts streaming updates to a channel named after that order.

It also sends an initial system message to our model setting up how the assistant needs to behave.


  def subscribed
    order = Order.create!
    puts "Subscribed to open_ai_#{order.id}"
    stream_from "open_ai_#{order.id}"
    @openai_client = OpenAiWebsocket.new(order.id)
    @openai_client.connect
    @openai_client.session_update(event_id: SecureRandom.uuid)
    item = {
            "id": "msg_001",
            "type": "message",
            "status": "completed",
            "role": "system",
            "content": [
                {
                    "type": "input_text",
                    "text": "You are a drive-thru attendant at the fast food restaurant Dinorex. You are taking the order of a customer ho has just pulled up to the drive-thru at Dinorex. IMPORTANT: Keep responses SHORT and curt and polite and use dinosaur sounds and puns!!! #{background_info}"
                }
            ]
        }
    @openai_client.conversation_item_create(item: item, event_id: SecureRandom.uuid)
  end

  def background_info
    text = <<~TEXT
        Do not allow customers to order things not on the above menu.
        The items available for ordering at Dinorex are:

        #{MenuItem.all.map(&:to_json)}
      TEXT

    text
  end

The channel defines several methods that let the frontend interact with the AI. For example, append_audio streams audio to the model, and receive handles both audio and text input. When the connection is closed, the WebSocket is also cleanly shut down.

Now how do we tight all together with our frontend?

Now we need to turn our attention to the JavaScript files where we have to handle the messages. In open_ai_channel.js connects the frontend to the OpenAiChannelover ActionCable and handles live audio streaming and real time responses from the server.

When the page loads, it sets up a subscription to the channel. Once connected, it can send audio, start or stop an order, or restart the AI session by calling the corresponding server-side methods through perform.

It listens for data coming from the server — like AI-generated audio, a new message, or a speech event — and reacts accordingly. For example, it can play back audio or update the UI with new messages.

The core feature here is live voice interaction. When the user clicks the “start” button, the app captures microphone input using the Web Audio API, converts it to the right audio format, and continuously streams it as base64-encoded chunks to the backend through the channel.

When the user stops recording, it closes the audio stream and tells the server to end the session. Here is where the fun starts!

It’s very cool to be able to speak with an AI assistant and get your order placed but how annoying would it be if you cannot hear the assistant properly? What about if you cannot hear well? Wouldn’t it be cool to read what the assistant is saying? Here is where the transcript can help!

Adding the transcription

When an event is proccessed by the application we have access to different attributes of the event. You can see how we parse the data in setup_event_handlers and that on a message we are calling the handle_message(message) method I was talking about earlier.

Typically, a message will have this format:


{
  "type" => "response.done",
  "event_id" => "event_BKNIpipMdMQwwxSqoY3lu",
  "response" => {
    "object" => "realtime.response",
    "id" => "resp_BKNInr0CZfV49lMSzOCps",
    "status" => "completed",
    "status_details" => nil,
    "output" => [{
      "id" => "item_BKNInXwJoIYRorEj0NiwN",
      "object" => "realtime.item",
      "type" => "message",
      "status" => "completed",
      "role" => "assistant",
      "content" => [{
        "type" => "audio",
        "transcript" => "Rawr! Welcome to Dinorex! What can I triceratops for you today?"
      }]
    }],
    "conversation_id" => "conv_BKNIkEtOGCc0EBpxcN2IM", "modalities" => ["audio", "text"],
    "voice" => "sage", "output_audio_format" => "pcm16",
    "temperature" => 0.8, "max_output_tokens" => "inf", "
    usage" => {
      "total_tokens" => 1077, "input_tokens" => 930, "output_tokens" => 147, "input_token_details" => {
        "text_tokens" => 930, "audio_tokens" => 0, "cached_tokens" => 0, "cached_tokens_details" => {
          "text_tokens" => 0, "audio_tokens" => 0
        }
      }, "output_token_details" => {
        "text_tokens" => 34, "audio_tokens" => 113
      }
    },
    "metadata" => nil
  }}

In order to capture this message and send it to the frontend we need to modify the handle_message method. An easy approach could be like this:


#[...rest of the method...]

elsif message["type"] == "response.output_item.done"
  if message["item"]["type"] == "function_call"
    puts "Function call done: #{message}"
  else
    transcript = message["item"]["content"][0]["transcript"]

    ActionCable.server.broadcast("open_ai_#{@session_id}", {
      type: "new_message",
      message: transcript
    })
  end
else

#[...rest of the method...]

We are first waiting for an audio response to be finished and after that we are checking if it is a function call as we do not want to send those to our channel. If the item_type is not a function call, then we can broadcast the message to the channel and include the transcription in the message.

Then in our openaichannel file we need to include this data type in the received function:


// Within received(data) function

else if (data.type === "new_message") {
        // Create a new DOM element to show the new message
        const messageContainer = document.getElementById("messages");
        const messageElement = document.createElement("div");
        messageElement.innerText = data.message;
        messageContainer.appendChild(messageElement);
      }

In our view we just need a div container with the id messages where to append each of the messages as they are happening, and just with these two bits of code we are transcribing the AI assistant in real time!

Get the assistant to greet the customer after clicking start

Another cool feature we can add to our AI drive-thru assistant is that the bot welcomes the user as soon as they click the start button.

To do this we need to play and extend a bit more our OpenAiWebsocket class. We need to add a new method that we will use when we subscribe to the channel and the beginning of the sessions. It can look like this:


  def new_session(event_id:)
    enqueue_message(
      {
        "event_id": event_id,
        "type": "response.create",
        "response": {
          "modalities": ["audio", "text"],
          "instructions": "Please great the customer and ask them what they would like to order.",
          "voice": "sage",
          "tools": []
        }
      }
    )
  end

  alias_method :greet_customer, :new_session

Then we can use this new method in the OpenAiChannel class. This can be an example:


  def start_order
    @openai_client.greet_customer(event_id: SecureRandom.uuid)
  end

Let’s tight this up with our Subscriber. In our OpenAI channel JavaScript file we need to subscribe this new function as soon as the DOM loads.


document.addEventListener("DOMContentLoaded", () => {
    subscription = consumer.subscriptions.create("OpenAiChannel", {
        connected() {
            console.log("Connected to OpenAI channel");
        },

        disconnected() {
            console.log("Disconnected from OpenAI channel");
        },

        sendAudio(audioData) {
            this.perform("append_audio", { type: "audio", audio: audioData });
        },

        // We can add here our new function!
        startOrder() {
            this.perform("start_order");
        }

        // [...rest of the code]

Then within our start button function we need to call the start_order function on the subscription.


  startButton.onclick = async () => {
    audioContext = new (window.AudioContext || window.webkitAudioContext)({
        sampleRate: 24000,
    });
    const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
            channelCount: 1,
            sampleRate: 24000,
            sampleSize: 16,
        },
    });
    const source = audioContext.createMediaStreamSource(stream);
    const processor = audioContext.createScriptProcessor(1024, 1, 1);

    source.connect(processor);
    processor.connect(audioContext.destination);

    // Send audio chunks continuously as they are processed
    processor.onaudioprocess = (e) => {
        const inputData = e.inputBuffer.getChannelData(0);
        const encodedAudio = base64EncodeAudio(new Float32Array(inputData));
        subscription.sendAudio(encodedAudio); // Send audio data via WebSocket
    };
    // Here is the new code :)
    subscription.startOrder();

    startButton.disabled = true;
    stopButton.disabled = false;
};

Almost done! Now everytime we click in the start button you should see a new event that is hitting our new start_order method 🎉 but…wait, why can’t we hear the assistant?

Modern Browsers quirks

This was a bit challenging and I spent some time debugging until I discovered what was going on. Modern browsers block auto-play of audio unless the user has interacted with the page first (like clicking or tapping).

To solve this we need to do some changes to our application.js file. First we need to make audioContext, audioBuffer and isPlaying variables available globally.


window.audioContext = null;
window.audioBuffer = new Float32Array(0);
window.isPlaying = false;

And now the most important change. initAudioContext was private and synchronous. In order to get the assistant talking as soon as we click on the start button we need to handle the browser requirement that an AudioContext can be resumed from a user gesture.


window.initAudioContext = async function() {
  ...
  if (window.audioContext.state === 'suspended') {
    await window.audioContext.resume();
    console.log("AudioContext resumed from suspended state");
  }
}

Just with this change we prevent silent failures after clicking on our start button 🎉! Below I’ve attached a video example of our application:

Well done if you have made it up to here! I hope you have found this post interesting and that it has helped you to play around with the original code from the livestream by adding two new cool features!