Building a Real Time AI Drive-Thru Experience with Rails and OpenAI
If you attended the live coding session last week with Chad Pytel and Svenja Schäfer you are probably familiar with the favourite thoughtbot fast food restaurant: Dinorex. If you missed it, you can watch the recording on YouTube.
In this post, we will explore some of the technical details of this proof of
concept, and add two fun features. First we will add the transcription
of what the assistant is saying to the screen, and then we will get our
assistant to greet the user as soon as we click the start
button.
The Idea
This proof of concept simulates a drive-thru ordering experience at “Dinorex” a fictional fast-food restaurant with a dinosaur theme. We use OpenAI’s GPT-4 model to create a conversational AI assistant that can take orders, respond with dinosaur-themed puns, and process customer requests in real time.
For this reason we are using the real time endpoint of OpenAI. This endpoint allows to communicate with the model in real time using web sockets. The model support text and audio, and it even includes transcriptions in the event messages.
Technical Stack
We are building this application using Rails 8 and Action Cable. If you are not familiar with Action Cable I would recommend to have a look at the Rails guides. It helped me a lot at the beginning when I started looking at the code from the live stream!
As a quick explanation, web sockets are a way for the browser and the server to keep a connection open, so they can talk to each other continuously without needing to reload the page.
Normally, in a web app, the browser sends a request, the server responds, and that’s it. With Action Cable, the connection stays open, so the server can push data to the browser whenever something happens — like a new message, a notification, or a live update.
In our case we will use this to keep the connection open so we can communicate in real time with OpenAI.
In the livestream we mentioned that the initial starting code is created by Marcus Schappim, you can find the gist here This code and the additions done by Chad and Svenja will be the starting point of our application.
Welcome to Dinorex code!
To understand better how everything is connected we need to talk about the most important classes and JavaScript files in the project.
On the server side we have the OpenAiWebsocket
class. This class manages a
real time WebSocket connection between your Rails app and the model.
When it’s initialized, it sets up everything needed to connect to our model and
handle messages.
It connects in a background thread using EventMachine
,
then listens for updates from the model. When messages come in, it reacts to
specific types like audio transcripts or function calls, and uses Action Cable to
broadcast updates to the frontend in real time. For example, if a user says
something, the AI transcribes it, and that text or audio is instantly pushed to
the client.
Messages are sent to the model using a queue system to avoid conflicts, and the class provides helper methods to build and enqueue different types of messages — like starting a session, appending audio, or adding a conversation message.
In this class we need to pay attention to the handle_message(message)
method.
Here is where we decide what to do and how to handle the data depending on the
message type we are receiving.
def handle_message(message)
if message["type"] == "input_audio_buffer.speech_started"
handle_input_audio_buffer_speech_started(message)
elsif message["type"] == "response.audio.delta"
broadcast_audio_delta(message["delta"])
elsif message["type"] == "response.function_call_arguments.done"
handle_response_function_call_arguments_done(message)
else
log_message(message)
end
end
We also have the OpenAiChannel
class. This class handles the Action Cable
channel that connects a user to a live conversation with OpenAI’s API. When a
client subscribes, it creates a new order, sets up a web socket connection using
the OpenAiWebsocket
class, and starts streaming updates to a channel named
after that order.
It also sends an initial system message to our model setting up how the assistant needs to behave.
def subscribed
order = Order.create!
puts "Subscribed to open_ai_#{order.id}"
stream_from "open_ai_#{order.id}"
@openai_client = OpenAiWebsocket.new(order.id)
@openai_client.connect
@openai_client.session_update(event_id: SecureRandom.uuid)
item = {
"id": "msg_001",
"type": "message",
"status": "completed",
"role": "system",
"content": [
{
"type": "input_text",
"text": "You are a drive-thru attendant at the fast food restaurant Dinorex. You are taking the order of a customer ho has just pulled up to the drive-thru at Dinorex. IMPORTANT: Keep responses SHORT and curt and polite and use dinosaur sounds and puns!!! #{background_info}"
}
]
}
@openai_client.conversation_item_create(item: item, event_id: SecureRandom.uuid)
end
def background_info
text = <<~TEXT
Do not allow customers to order things not on the above menu.
The items available for ordering at Dinorex are:
#{MenuItem.all.map(&:to_json)}
TEXT
text
end
The channel defines several methods that let the frontend interact with the AI.
For example, append_audio
streams audio to the model, and receive
handles both
audio and text input. When the connection is closed, the WebSocket is also
cleanly shut down.
Now how do we tight all together with our frontend?
Now we need to turn our attention to the JavaScript files where we have to handle
the messages. In open_ai_channel.js
connects the frontend to the
OpenAiChannel
over ActionCable
and handles live audio streaming and real time
responses from the server.
When the page loads, it sets up a subscription to the channel. Once connected, it can send audio, start or stop an order, or restart the AI session by calling the corresponding server-side methods through perform.
It listens for data coming from the server — like AI-generated audio, a new message, or a speech event — and reacts accordingly. For example, it can play back audio or update the UI with new messages.
The core feature here is live voice interaction. When the user clicks the “start” button, the app captures microphone input using the Web Audio API, converts it to the right audio format, and continuously streams it as base64-encoded chunks to the backend through the channel.
When the user stops recording, it closes the audio stream and tells the server to end the session. Here is where the fun starts!
It’s very cool to be able to speak with an AI assistant and get your order placed but how annoying would it be if you cannot hear the assistant properly? What about if you cannot hear well? Wouldn’t it be cool to read what the assistant is saying? Here is where the transcript can help!
Adding the transcription
When an event is proccessed by the application we have access to different
attributes of the event. You can see how we parse the data in
setup_event_handlers
and that on a message we are calling the
handle_message(message)
method I was talking about earlier.
Typically, a message will have this format:
{
"type" => "response.done",
"event_id" => "event_BKNIpipMdMQwwxSqoY3lu",
"response" => {
"object" => "realtime.response",
"id" => "resp_BKNInr0CZfV49lMSzOCps",
"status" => "completed",
"status_details" => nil,
"output" => [{
"id" => "item_BKNInXwJoIYRorEj0NiwN",
"object" => "realtime.item",
"type" => "message",
"status" => "completed",
"role" => "assistant",
"content" => [{
"type" => "audio",
"transcript" => "Rawr! Welcome to Dinorex! What can I triceratops for you today?"
}]
}],
"conversation_id" => "conv_BKNIkEtOGCc0EBpxcN2IM", "modalities" => ["audio", "text"],
"voice" => "sage", "output_audio_format" => "pcm16",
"temperature" => 0.8, "max_output_tokens" => "inf", "
usage" => {
"total_tokens" => 1077, "input_tokens" => 930, "output_tokens" => 147, "input_token_details" => {
"text_tokens" => 930, "audio_tokens" => 0, "cached_tokens" => 0, "cached_tokens_details" => {
"text_tokens" => 0, "audio_tokens" => 0
}
}, "output_token_details" => {
"text_tokens" => 34, "audio_tokens" => 113
}
},
"metadata" => nil
}}
In order to capture this message and send it to the frontend we need to modify
the handle_message
method. An easy approach could be like this:
#[...rest of the method...]
elsif message["type"] == "response.output_item.done"
if message["item"]["type"] == "function_call"
puts "Function call done: #{message}"
else
transcript = message["item"]["content"][0]["transcript"]
ActionCable.server.broadcast("open_ai_#{@session_id}", {
type: "new_message",
message: transcript
})
end
else
#[...rest of the method...]
We are first waiting for an audio response to be finished and after that we are
checking if it is a function call as we do not want to send those to our
channel. If the item_type
is not a function call, then we can broadcast
the message to the channel and include the transcription in the message.
Then in our openaichannel file we need to include this data type in the received function:
// Within received(data) function
else if (data.type === "new_message") {
// Create a new DOM element to show the new message
const messageContainer = document.getElementById("messages");
const messageElement = document.createElement("div");
messageElement.innerText = data.message;
messageContainer.appendChild(messageElement);
}
In our view we just need a div
container with the id messages
where to
append each of the messages as they are happening, and just with these two
bits of code we are transcribing the AI assistant in real time!
Get the assistant to greet the customer after clicking start
Another cool feature we can add to our AI drive-thru assistant is that the
bot welcomes the user as soon as they click the start
button.
To do this we need to play and extend a bit more our OpenAiWebsocket
class.
We need to add a new method that we will use when we subscribe to the
channel and the beginning of the sessions. It can look like this:
def new_session(event_id:)
enqueue_message(
{
"event_id": event_id,
"type": "response.create",
"response": {
"modalities": ["audio", "text"],
"instructions": "Please great the customer and ask them what they would like to order.",
"voice": "sage",
"tools": []
}
}
)
end
alias_method :greet_customer, :new_session
Then we can use this new method in the OpenAiChannel
class. This can be
an example:
def start_order
@openai_client.greet_customer(event_id: SecureRandom.uuid)
end
Let’s tight this up with our Subscriber. In our OpenAI channel JavaScript file we need to subscribe this new function as soon as the DOM loads.
document.addEventListener("DOMContentLoaded", () => {
subscription = consumer.subscriptions.create("OpenAiChannel", {
connected() {
console.log("Connected to OpenAI channel");
},
disconnected() {
console.log("Disconnected from OpenAI channel");
},
sendAudio(audioData) {
this.perform("append_audio", { type: "audio", audio: audioData });
},
// We can add here our new function!
startOrder() {
this.perform("start_order");
}
// [...rest of the code]
Then within our start button function we need to call the start_order
function
on the subscription.
startButton.onclick = async () => {
audioContext = new (window.AudioContext || window.webkitAudioContext)({
sampleRate: 24000,
});
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
channelCount: 1,
sampleRate: 24000,
sampleSize: 16,
},
});
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(1024, 1, 1);
source.connect(processor);
processor.connect(audioContext.destination);
// Send audio chunks continuously as they are processed
processor.onaudioprocess = (e) => {
const inputData = e.inputBuffer.getChannelData(0);
const encodedAudio = base64EncodeAudio(new Float32Array(inputData));
subscription.sendAudio(encodedAudio); // Send audio data via WebSocket
};
// Here is the new code :)
subscription.startOrder();
startButton.disabled = true;
stopButton.disabled = false;
};
Almost done! Now everytime we click in the start button you should see a new
event that is hitting our new start_order
method 🎉 but…wait, why
can’t we hear the assistant?
Modern Browsers quirks
This was a bit challenging and I spent some time debugging until I discovered what was going on. Modern browsers block auto-play of audio unless the user has interacted with the page first (like clicking or tapping).
To solve this we need to do some changes to our application.js
file. First
we need to make audioContext
, audioBuffer
and isPlaying
variables
available globally.
window.audioContext = null;
window.audioBuffer = new Float32Array(0);
window.isPlaying = false;
And now the most important change. initAudioContext
was private and
synchronous. In order to get the assistant talking as soon as we click on the
start
button we need to handle the browser requirement that an AudioContext
can be resumed from a user gesture.
window.initAudioContext = async function() {
...
if (window.audioContext.state === 'suspended') {
await window.audioContext.resume();
console.log("AudioContext resumed from suspended state");
}
}
Just with this change we prevent silent failures after clicking on our start
button 🎉! Below I’ve attached a video example of our application:
Well done if you have made it up to here! I hope you have found this post interesting and that it has helped you to play around with the original code from the livestream by adding two new cool features!