Understanding open source LLMs

You’ve been exploring Large Language Models (LLMs) and tinkering with them locally, but still not clear on some of the associated terminologies. If you’re new to experimentation, consider reading my wonderful colleague Jose Blanco’s post on using open-source LLMs both locally and remotely first. This post aims to guide developers in making informed decisions when selecting LLM models that meet their needs. It assumes familiarity with developer-friendly Ollama or an equivalent LLM backend, as well as some experience running a model on your machine.

Base, Instruct and Chat models

Base (or Text) - designed for foundational purposes, excels at completing text and predicting the next set of words. It’s relatively easy to train and adapt this model into an Instruct, Chat, or Code model. If we’re writing a novel, this model can be particularly useful as it tends to be more creative in its output.
Instruct - fine-tuned version of the Base model, specifically optimised for Question & Answer tasks. It’s designed to provide accurate and concise responses to single-turn questions. If we’re looking for precise answers in domains such as math, science, or coding, this model would be an ideal choice due to its ability to provide targeted and informative responses.
Chat - designed for conversational purposes, excelling in multi-turn dialogues where maintaining contextual awareness is crucial. This type is well-suited for applications that require sustained conversations, such as chatbots or role-playing scenarios.
Code - shares similarities with Instruct, as it’s also a fine-tuned version of the Base model. However, this type has been specifically trained on code data and optimized for “Fill-In-the-Middle” (FIM) tasks, where it can complete partially written code snippets. Examples of such models include Github Copilot and similar tools that assist developers in writing code by providing intelligent suggestions and completions. CodeQwen, DeepSeek Coder and Codestral are some popular open source LLMs that offer coding capabilities.

In Introducing Meta Llama 3, Meta explains how the Base model was trained to be an Instruct model and the post is worth a read on one of the best open source LLM available as of now.

As an Ollama user, we can easily distinguish between these models by looking for tags associated with each model by opening the list of models based on those tags.

Examples: mistral with tags and command-r with tags

# Mistral
latest
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

7b
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

instruct
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

text
495ae085225b • 4.1GB • Updated 2 months ago

When we run any of the following ollama commands, the same model is downloaded from the repository.

ollama run mistral
ollama run mistral:latest
ollama run mistral:instruct

This means that unless we’re interested in fine-tuning the base (text for mistral) models, Ollama by default downloads and runs the instruct model of mistral as that is the most helpful model for end users. This can be confirmed from the hash key of 2ae6f6dd7a3d that is shown for the model. In contrast, the text model has a different hash key 495ae085225b, which suggests that it is a base model that must be fine-tuned to become useful.

Parameters or Model size (8B, 14B, 70B and more)

Large AI models boast billions of parameters, which makes them better at capturing more complex patterns and relationships in data. However, they require significant computing resources involving GPUs and TPUs during training and also need expensive hardware to run them once trained. Smaller models in contrast require less resources to train and run, but they’re less accurate. They struggle with complex inferences that demand a deep understanding of context but are well suited to run on everyday machines such as our laptops. Having many parameters in a model doesn’t always mean it’s the best. What matters most is the quality of the data used to train it, not just the quantity. For now, let’s assume that more parameters mean a better model.

When we open Llama 3 on Ollama, we notice that there are two Llama 3 models: 8B and 70B

# Llama 3
70b
786f3184aec0 • 40GB • Updated 2 weeks ago

8b
365c0bd3c000 • 4.7GB • Updated 2 weeks ago

The 70 billion model is larger and more capable but also needs more resources (40 GB of GPU memory) to run whereas the 8 billion model which is smaller and lighter can be run with much less resources (4.7 GB of GPU memory).

# To download the 8B model
ollama run llama3:8b

# To download the 70B model
ollama run llama3:70b

Quantisation

Quantisation is a technique that allows users to run large language models on devices with limited memory and resources, resulting in faster inferences. Quantisation as is a complex topic and as a general rule, larger quantisation values tend to produce more accurate results. For most of the models available on Ollama, we find quantisations values such as fp16, q8_0, q6_K, q5_0, q4_0 and so on.

Based on Llama3 with tags, we can find various versions of the “Llama 3 8B” model at different levels of quantisation.

# Llama 3 8B
8b-instruct-q4_0
365c0bd3c000 • 4.7GB • Updated 2 weeks ago

8b-instruct-q4_1
ec0a6ea1fb9b • 5.1GB • Updated 2 weeks ago

8b-instruct-q5_0
a13deb0590c7 • 5.6GB • Updated 2 weeks ago

8b-instruct-q5_1
662158bc9277 • 6.1GB • Updated 2 weeks ago

8b-instruct-q8_0
1b8e49cece7f • 8.5GB • Updated 2 weeks ago

8b-instruct-q2_K
2745d547f376 • 3.2GB • Updated 2 weeks ago

8b-instruct-q3_K_S
0891d012c467 • 3.7GB • Updated 2 weeks ago

8b-instruct-q3_K_M
726a1960bfed • 4.0GB • Updated 2 weeks ago

8b-instruct-q3_K_L
df7624806f55 • 4.3GB • Updated 2 weeks ago

8b-instruct-q4_K_S
469db3d82576 • 4.7GB • Updated 2 weeks ago

8b-instruct-q4_K_M
9b8f3f3385bf • 4.9GB • Updated 2 weeks ago

8b-instruct-q5_K_S
7eae3176f5b5 • 5.6GB • Updated 2 weeks ago

8b-instruct-q5_K_M
af3083b17cd4 • 5.7GB • Updated 2 weeks ago

8b-instruct-q6_K
e4a3943fcd76 • 6.6GB • Updated 2 weeks ago

8b-instruct-fp16
c666fe422df7 • 16GB • Updated 2 weeks ago

For the above model, based on the quantisation values, it means 8b-instruct-q8_0 is better than 8b-instruct-q6_K which is better than 8b-instruct-q4_0 and so on. By default, Ollama uses a 4 bit (q4_0) quantised model for any LLM. For example, running ollama run llama3 will download and run the 4-bit (q4_0) quantised model of “Llama 3 8B”. But we can choose to run the 8-bit (q8_0) quantised model by running ollama run llama3:8b-instruct-q8_0.

Context size (8k, 32k, 128k)

In the LLM world, context size is a critical parameter that refers to the amount of input text that can be processed by the model. The context size is typically measured in terms of the number of tokens, where a token is usually a character or a word of text.

Example: “I enjoyed reading this informative blog post!” is broken down into 8 tokens where each word and the exclamation mark are counted as individual tokens.

LLM tasks such as coding that requires a deeper understanding of a codebase might benefit from a larger context size. Similarly for conversational tasks, a larger context size might be helpful allowing the LLM to capture the nuances of the conversation and respond more coherently and consistently as the conversation progresses. For simpler single-turn or fact-based questions, a smaller context size can be sufficient as that can typically generate a faster and accurate response. While a larger context size isn’t always better, it’s generally true that an LLM with a larger context size can perform better.

Using Phi3 with tags, “Phi3” models that support different context sizes can be downloaded and tested.

# Phi-3-medium-14b
14b-medium-128k-instruct-q4_0
3aeb385c7040 • 7.9GB • Updated 13 days ago

14b-medium-4k-instruct-q4_0
1e67dff39209 • 7.9GB • Updated 2 weeks ago

# To download the 128k context size model
ollama pull phi3:14b-medium-128k-instruct-q4_0

# To download the 4k context size model
ollama pull phi3:14b-medium-4k-instruct-q4_0

Licenses and legal

Before utilising such open source LLMs for any work/project (Language assistance, coding assistance, commercial deployment), it is essential to review the license terms to ensure compliance.

MIT License
Mistral AI Non-Production License 
META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Licenses can vary widely, ranging from permissive open-source licenses like MIT or Apache to more restrictive proprietary licenses. Understanding the license terms ensures that developers and product owners can use a model without legal repercussions.

Conclusion

At the end of this post, if codeqwen:7b-code-v1.5-q8_0 holds any meaning, then this post has achieved its goal. We should hopefully feel empowered to experiment with open-source large language models on our machine, confident in our understanding of the type, parameters, quantisation, and context sizes that define them. Happy experimenting with LLMs!