The context of this story is me, joining an experiment by colleagues with LLMs . They have Mac computers; I have a computer that runs on PopOS. By now, this is very rarely an issue; we have tools to handle different environments. This is not entirely the case with LLMs, for some reasons. If I were to guess, it is because we’re still in the early days.
So the application is communicating with an LLM via Ollama. Ollama seems amazing; it’s a terminal app that allows you to run quite a few LLMs in your terminal and can also be used as a server locally.
Except that it doesn’t work for all Linux computers. You might have heard that. LLMs work best on GPUs. Well, NVIDIA GPU drivers have compatibility issues with Linux computers. Installation can be challenging, and there are sometimes compatibility issues with kernels. Hence, installing (again) an NVIDIA driver on my laptop. I quickly have an `EOF error’ when playing with Ollama. An issue that is known by the maintainers.
To fix the
EOF error I thought of a simple solution. My colleagues are using Ollama as an
As we have a
Ollama:Client class in our code base and the API URL as an ENV variable,
I could deploy Ollama in a server instead of pointing to my local host port.
Get up and running with large language models locally. That’s how Ollama describes
itself on the website and in its GitHub repository. Somehow, I didn’t read that correctly.
Or I decided to ignore it. Or maybe it was just denial.
Ollama has a dockerfile, so I thought I’d try to deploy with docker on Heroku. Heroku was not happy. The image was too big for Heroku’s taste. So I looked, but not too long, into another way to deploy simply somewhere. And I found an article stating that fly.io had a GPU now. This is not true for everybody, as it is in the beta version.
So I thought that I’d rely less on a service and give aws a try.
My main experience is with the AWS EC2 service. So I decided to run an instance.
This instance was a simple Linux instance that wasn’t too big.
From that instance, I connect from AWS to the instance terminal, and I run
the curl command that loads the ollama installer. It installs it.
I decide to run ollama after it is installed, and then I have the same thing I have
uri = URI.parse(ENDPOINT_URL)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 500
That was not an acceptable solution.
Huggingface has become the de facto hub for LLMs. Think of Huggingface as a github for LLMs. And it does more. Huggingface offers ways to deploy models. This was the solution, I thought.
Deploying straight from the platform — that’s amazing. Just three clicks away and I had a URL I could communicate with on an instance that shuts itself off automatically if inactive for 15 minutes (meaning less expenses).
I made a little curl request to the URL Huggingface provided for me, and everything ran smoothly.
So I’m happy and optimistic and change the
OLLAMA_URL in my
.env file and restart the app.
It simply couldn’t be that simple. After all this time, the simple solution was not even that simple.
The issue was that Ollama has an API contract. Huggingface inference endpoints have another.
Like Huggingface, it requires an authorization key in the headers.
That’s the first issue, and that’s an easy one. I’ll just add another variable in the
write the code in the request to add that header attribute, and we’re good to go.
And no, it’s still not that simple.
The whole contract is different. The format of the response is different. Not totally different but different enough to be incompatible.
So why not create my own Python API that I can run locally?
It’s simple enough, combining code given by Huggingface with an API framework like fastapi.
The code is only a few lines long and easy to understand.
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import Union
from fastapi import FastAPI
app = FastAPI()
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model)
def read_generate(message: str):
inputs = tokenizer(message, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
return_message = tokenizer.decode(outputs, skip_special_tokens=True)
With these 13 lines, in theory, I can query http://127.0.0.1:8000/generate with a post a request and pass a message, and I should receive a generated message from this model.
I say this in theory, because if you run this code, you’ll find yourself downloading it 19 times. ~4.98G to get the model from Huggingface… Which is easy enough with a stable and fast connection.
Another way is to find a
gguf format for the model, which is lighter than the usual trained model format.
And instead of using the Huggingface
You can use the ctransformers library. But then again,
doing this, I ran into NVIDIA issues.
Locally, I’m now using lm-studio, which downloads libraries in ‘gguf’. format and run them on my Linux without issues. And it can run a local web server with yet another response format. Also, I’m not going to deploy an application like LM Studio, though it could mean that I can have the same response locally as I have in production. The reason for that is because their server is not customizable, and just like my issue with Ollama, I would need at some point a token in my headers to secure the API.
Since then, I have also met other solutions like aws sagemaker, the aws service dedicated to anything machine learning. It is though the wild-west out there when it comes to test locally and have the same thing locally as you have in production. I’d guess that things will improve in a near future, so these issues of inconsistencies of contracts and local environment ease of use are going to quickly continue to develop.