Running AI client-side

Everyone is doing AI nowadays. While using LLMs or diffusion models via API is a great way to use AI, it is also possible to run AI client-side. This article will show you how to do it and some of the trade-offs involved.

What can be done?

Actually, quite a lot. We’re going to use the Transformers.js library to run AI client-side. It supports a variety of tasks, in different categories. Some examples are:

Natural Language Processing: summarization, translation, and question answering.
Computer Vision: object detection, upscaling images, and removing backgrounds.
Audio: automatic speech recognition and text to speech.

You can see the full list of tasks and models here. I’ll use a Rails app to demonstrate some of the features. The full code is available in this repo.

Describing images

Writing alt text for images is important for accessibility, and AI can help giving us a head start. Install the @xenova/transformers package (I used yarn, but you might have a different setup in your project), and you’re good to go.

The idea is to use an image to text model to automatically add a description to an image when the user uploads it. Unfortunately, Action text doesn’t have events for when an image finishes uploading – should be out in Rails 8, though –, so we’ll put the description in an <output> tag and let the user copy from there.

Assuming we have a basic Article model with a title and content, on the app/views/articles/_form.html.erb view we can add the following:

<%= form_with(
  model: article,
  data: {
    controller: "autocaption",
    action: "trix-attachment-add->autocaption#saveAttachment"
  }) do |form| %>

  <!-- the rest of the form -->

  <div>
    <%= button_tag "Describe image", type: :button, class: "secondary", data: {action: "autocaption#describeImage", autocaption_target: "trigger"} %>
    <span><strong>Description: </strong><output data-autocaption-target="output"></output></span>
  </div>
<% end %>

After setting up the controller and the targets, we save the attachment when the user uploads one. The autocaption_controller.js file will look like this:

import { Controller } from "@hotwired/stimulus"
import { pipeline, RawImage } from '@xenova/transformers';

export default class extends Controller {
  static targets = ["output", "trigger"]

  async initialize() {
    this.triggerTarget.disabled = true;
    this.captioner = await pipeline('image-to-text');
  }

  saveAttachment(event) {
    this.attachment = event.attachment
    this.triggerTarget.disabled = false;
  }

  async describeImage() {
    const previousLabel = this.triggerTarget.textContent;
    this.triggerTarget.textContent = 'Analyzing...';
    this.triggerTarget.disabled = true;

    const img = await RawImage.fromBlob(this.attachment.file);
    const caption = (await this.captioner(img))[0].generated_text;

    this.outputTarget.textContent = caption;
    this.triggerTarget.textContent = previousLabel;
  }
}

After loading the model, we use to describe the image and update the output text with the description. Here’s that in action:

Text to speech

Let’s use the text to speech model to read the article content for the users. The view is straightforward:

<article data-controller="text-to-speech">
  <header class="flex items-center gap-4">
    <h1><%= @article.title %></h1>

    <%= button_tag "Read aloud", type: :button, class: "secondary", data: { action: "text-to-speech#play", text_to_speech_target: "trigger" } %>
  </header>

  <div data-text-to-speech-target="text">
    <%= @article.content %>
  </div>

  <!-- ... -->
</article>

Again, we setup the controller and the targets. We also have a button to start the reading. The text_to_speech_controller.js file will look like this:

import { Controller } from "@hotwired/stimulus"
import { pipeline } from '@xenova/transformers';

const speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';

export default class extends Controller {
  static targets = ["text", "trigger"]

  async initialize() {
    this.triggerTarget.disabled = true;
    this.synthesizer = await pipeline('text-to-speech');
    this.triggerTarget.disabled = false;
  }

  async play() {
    const previousLabel = this.triggerTarget.textContent;
    this.triggerTarget.textContent = 'Preparing...';
    this.triggerTarget.disabled = true;

    const audioData = await this.synthesizer(this.textTarget.textContent, { speaker_embeddings });

    this.triggerTarget.textContent = 'Reading...';
    this.#playAudio(audioData);

    this.triggerTarget.textContent = previousLabel;
    this.triggerTarget.disabled = false;
  }

  #playAudio(audioData) {
    const audioContext = new (window.AudioContext || window.webkitAudioContext)();
    const audioBuffer = audioContext.createBuffer(1, audioData.audio.length, audioData.sampling_rate);
    audioBuffer.copyToChannel(audioData.audio, 0, 0);

    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioContext.destination);

    source.start();
  }
}

Pretty similar to the previous one, but with some extra work to play the audio – in a production scenario you could save the audio as a wav file and play it –. It will look like this:

When client-side AI is useful?

Let’s look at the pros and cons of running AI client-side to understand when it is useful.

Pros

Privacy: no data is sent to a server. Users don’t have to worry about their data being stored or shared. This is especially important for sensitive data or in countries with strict data protection laws.
Cost: no need to pay for a server. The AI is running in the users’ browser. Useful for small projects or prototypes.
Offline: once the model is downloaded, it can run offline. This is useful for PWAs, mobile apps, or as a fallback for an external API not being available.

Cons

Performance: client-side AI is often slower than server-side AI. It also depends on the user’s device, which can make the whole experience slow.
Quality: client-side AI is often less accurate than server-side AI, since the models are much smaller to run on the client. If you need high accuracy, this is not the way to go.
Bandwidth: client-side AI requires the model to be downloaded, which can be large and slow depending on the user’s connection. The models aren’t shared across websites, which is taxing on the user’s bandwidth. This is one of the reasons why Chrome is experimenting with shipping an AI model with the browser.

You are now equipped to make a better decision on whether to use client-side AI or not. If privacy and performance are concerns, you might also consider hosting your own models, or even using transformers in your backend.