How to use an open source LLM model locally and remotely

Jose Blanco

When I first started researching how we can use an open source AI model, it seemed daunting initially. There is a lot of fragmented information on the internet from many different sources, making it difficult to start your project quickly.

The goal of this post is to have one easy-to-read article that will help you set up and run an open source AI model locally using a wrapper around the model named Ollama. We will also talk about how to install Ollama in a virtual machine and access it remotely.

What is Ollama?

Ollama is a tool that helps us run large language models on our local machine and makes experimentation more accessible. Among many features, it exposes an endpoint that we can use to interact with a model. In the case of this tutorial, we will use the /api/chat endpoint.

Let’s start!

First, we will need to download Ollama

curl -fsSL https://ollama.com/install.sh | sh

Once downloaded, we must pull one of the models that Ollama supports and we would like to run.

In our case, we will use openhermes2.5-mistral. OpenHermes 2.5 is a fine-tuned version of the model Mistral 7B. It was trained on 1,000,000 entries of primarily GPT-4 generated data, as well as other high-quality datasets. This model beats the majority of the other models on various benchmarks.

To pull this model we need to run the following command in our terminal

ollama pull openhermes2.5-mistral

To first test that everything is working as expected, we can use our terminal. First let’s run Ollama ollama serve. In a separate terminal window, let’s try the following:

curl http://localhost:11434/api/chat -d '{
    "model": "openhermes2.5-mistral",
    "messages": [{ "role": "user", "content": "Are you a robot?" }]
}'

This is the type of response we will get:

{"model":"openhermes2.5-mistral","created_at":"2024-02-06T19:02:41.653849Z",
"message":{"role":"assistant","content":"I"},"done":false}
{"model":"openhermes2.5-mistral","created_at":"2024-02-06T19:02:41.736723Z",
"message":{"role":"assistant","content":" am"},"done":false}
{"model":"openhermes2.5-mistral","created_at":"2024-02-06T19:02:41.82815Z",
"message":{"role":"assistant","content":" not"},"done":false}

Thankfully in this call to the /api/chatendpoint, we can add some parameters. In our case we will pass the stream: false parameter so when we get the response, we will get the whole response in one go.

curl http://localhost:11434/api/chat -d '{
    "model": "openhermes2.5-mistral",
    "messages": [{ "role": "user", "content": "Are you a robot?" }],
    "stream": false
}'

This is the typical response we will get back from the endpoint:

{"model":"openhermes2.5-mistral",
  "created_at":"2024-01-30T11:52:56.244775Z",
  "message":{"role":"assistant",
  "content":"No, I'm not a robot. I am an AI-powered chatbot designed to provide
   helpful information and engage in conversation with users like yourself. 😊},
  "done":true,
  "total_duration":22872800167,
  "load_duration":7669185084,
  "prompt_eval_count":21,
  "prompt_eval_duration":284746000,
  "eval_count":195,"eval_duration":14899744000}

Amazing! This is how easy we can interact with an open source model in our local machine and start playing around with it. This response is generated using your local machine computing power so what about running the model in a virtual machine?

Using Digital Ocean to install any LLM in our server

One of the easiest (and cheapest) ways I’ve found to set up Ollama with an open-source model in a virtual machine is by using Digital Ocean’s droplets. Droplet is just how Digital Ocean calls their virtual machines.

First we will need to open an account with them, and add a payment method. Normally adding $5 is more than enough to play around and launch a virtual machine.

Once we have done that we will have access to our projects. As default there is one project already created for us named first-project. Let’s click there and then in spin up a droplet. The following page is where we can configure our virtual machine.

These machines are CPU-based and lack a GPU, so you can anticipate a slightly slower response from the model compared to your own machine.

With this we just need to set up a password or SSH key and create the virtual machine by clicking in the Create Droplet button!

Now the fun starts!

Once our virtual machine is created, it will get assigned an IP that we will be able to access via ssh. To do this just open your terminal and run the following ssh root@ip_of_your_address.

Depending on what type of security we have set up, we will be asked to insert the password or it will log us in using your SSH key. Nice, you are inside your virtual machine!

How to set up Ollama in the virtual machine

To set up Ollama in the virtual machine is quite similar to the steps we have followed to install it locally. Access the virtual machine with the command ssh root@ip_of_your_address and download Ollama.

curl https://ollama.ai/install.sh | sh

Notice after the installation that we get a log saying where we can access Ollama API >>> The Ollama API is now available at 0.0.0.0:11434..

We need to stop ollama service as we will need to start it while setting up one environment variable.

service ollama stop

Now we need to set up the ollama host env variable and start ollama!

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Nice! We have now running Ollama in the virtual machine. Without closing that window, open another terminal window and access again the virtual machine ssh root@ip_of_your_address to pull the model.

ollama pull openhermes2.5-mistral

Once this is completed let’s open another terminal window in your local machine and try the following:

curl http://my_virtual_machine_ip:11434/api/chat -d '{
"model": "openhermes2.5-mistral",
"messages": [{ "role": "user", "content": "Hello" }],
"stream": false
}'

Amazing! You will get a response but this time Ollama is running in a virtual machine instead of your local machine.

What if we want to run our model in the server forever?

Currently when we quit the virtual machine terminal we cannot call anymore our endpoint.

In the case that we want your virtual machine to be running Ollama with the open source model non stop you can do the following. Run each command separately.

curl https://ollama.ai/install.sh | sh

ollama pull openhermes2.5-mistral

service ollama stop

nohup env OLLAMA_HOST=0.0.0.0:11434 ollama serve

nohup is a command available on Unix-based systems such as our Ubuntu distribution, that keeps processes running even after exiting the terminal. It prevents the processes from receiving the HUP (hangup) signal.

This way we are running Ollama in the background and we can close the terminal window without stopping the service. Now go ahead and try to call the endpoint from your local machine.

Voilà! You will get a response from the model running in your virtual machine. This is great as we can now access our model from anywhere and anytime!

Conclusion

As we have seen, get started with Ollama and a LLM open source model to start to poke around and see what we can do with is straightforward. This is a great first step to create an application that uses real AI.