---
title: Understanding open source LLMs
teaser: Do you think you can run any Large Language Model (LLM) on your machine?
tags: open source,artificial intelligence,language models,ollama
author: Rakesh Arunachalam
published_on: 2024-06-17
---

You've been exploring Large Language Models (LLMs) and tinkering with them
locally, but still not clear on some of the associated terminologies. If you're
new to experimentation, consider reading my wonderful colleague [Jose Blanco]'s
post on [using open-source LLMs both locally and remotely] first. This post aims
to guide developers in making informed decisions when selecting LLM models that
meet their needs. It assumes familiarity with developer-friendly [Ollama] or an
equivalent LLM backend, as well as some experience running a model on your
machine.

[Jose Blanco]: https://thoughtbot.com/blog/authors/jose-blanco
[using open-source LLMs both locally and remotely]: https://thoughtbot.com/blog/how-to-use-open-source-LLM-model-locally
[Ollama]: https://ollama.com/

## Base, Instruct and Chat models

1. **Base (or Text)** - designed for foundational purposes, excels at completing
   text and predicting the next set of words. It's relatively easy to train and
   adapt this model into an **Instruct**, **Chat**, or **Code** model. If we're
   writing a novel, this model can be particularly useful as it tends to be more
   creative in its output.
2. **Instruct** - fine-tuned version of the **Base** model, specifically
   optimised for Question & Answer tasks. It's designed to provide accurate and
   concise responses to single-turn questions. If we're looking for precise
   answers in domains such as math, science, or coding, this model would be an
   ideal choice due to its ability to provide targeted and informative
   responses.
3. **Chat** - designed for conversational purposes, excelling in multi-turn
   dialogues where maintaining contextual awareness is crucial. This type is
   well-suited for applications that require sustained conversations, such as
   chatbots or role-playing scenarios.
4. **Code** - shares similarities with **Instruct**, as it's also a fine-tuned
   version of the **Base** model. However, this type has been specifically
   trained on code data and optimized for "Fill-In-the-Middle" (FIM) tasks,
   where it can complete partially written code snippets. Examples of such
   models include Github Copilot and similar tools that assist developers in
   writing code by providing intelligent suggestions and completions.
   [CodeQwen], [DeepSeek Coder] and [Codestral] are some popular open source
   LLMs that offer coding capabilities.

In [Introducing Meta Llama 3], Meta explains how the **Base** model was trained
to be an **Instruct** model and the post is worth a read on one of the best open
source LLM available as of now.

As an Ollama user, we can easily distinguish between these models by looking for
tags associated with each model by opening the list of models based on those
tags.

Examples: [mistral with tags] and [command-r with tags]

```sh
# Mistral
latest
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

7b
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

instruct
2ae6f6dd7a3d • 4.1GB • Updated 2 weeks ago

text
495ae085225b • 4.1GB • Updated 2 months ago
```

When we run any of the following `ollama` commands, the same model is downloaded
from the repository.

```sh
ollama run mistral
ollama run mistral:latest
ollama run mistral:instruct
```

This means that unless we're interested in fine-tuning the **base** (text for
mistral) models, Ollama by default downloads and runs the **instruct** model of
mistral as that is the most helpful model for end users. This can be confirmed
from the hash key of `2ae6f6dd7a3d` that is shown for the model. In contrast,
the **text** model has a different hash key `495ae085225b`, which suggests that
it is a base model that must be fine-tuned to become useful.

[Introducing Meta Llama 3]: <https://ai.meta.com/blog/meta-llama-3/>
[mistral with tags]: https://ollama.com/library/mistral/tags
[command-r with tags]: https://ollama.com/library/command-r/tags
[CodeQwen]: https://ollama.com/library/codeqwen
[DeepSeek Coder]: https://ollama.com/library/deepseek-coder
[Codestral]: https://ollama.com/library/codestral

## Parameters or Model size (8B, 14B, 70B and more)

Large AI models boast billions of parameters, which makes them better at
capturing more complex patterns and relationships in data. However, they require
significant computing resources involving GPUs and TPUs during training and also
need expensive hardware to run them once trained. Smaller models in contrast
require less resources to train and run, but they're less accurate. They
struggle with complex inferences that demand a deep understanding of context but
are well suited to run on everyday machines such as our laptops. Having many
parameters in a model doesn't always mean it's the best. What matters most is
the quality of the data used to train it, not just the quantity. For now, let's
assume that more parameters mean a better model.

When we open [Llama 3] on Ollama, we notice that there are two Llama 3 models:
**8B** and **70B**

```sh
# Llama 3
70b
786f3184aec0 • 40GB • Updated 2 weeks ago

8b
365c0bd3c000 • 4.7GB • Updated 2 weeks ago
```

The 70 billion model is larger and more capable but also needs more resources
(40 GB of GPU memory) to run whereas the 8 billion model which is smaller and
lighter can be run with much less resources (4.7 GB of GPU memory).

```sh
# To download the 8B model
ollama run llama3:8b

# To download the 70B model
ollama run llama3:70b
```

[Llama 3]: https://ollama.com/library/llama3

## Quantisation

Quantisation is a technique that allows users to run large language models on
devices with limited memory and resources, resulting in faster inferences.
Quantisation as is a complex topic and as a general rule, larger quantisation
values tend to produce more accurate results. For most of the models available
on Ollama, we find quantisations values such as `fp16`, `q8_0`, `q6_K`, `q5_0`,
`q4_0` and so on.

Based on [Llama3 with tags], we can find various versions of the "Llama 3 8B"
model at different levels of quantisation.

```sh
# Llama 3 8B
8b-instruct-q4_0
365c0bd3c000 • 4.7GB • Updated 2 weeks ago

8b-instruct-q4_1
ec0a6ea1fb9b • 5.1GB • Updated 2 weeks ago

8b-instruct-q5_0
a13deb0590c7 • 5.6GB • Updated 2 weeks ago

8b-instruct-q5_1
662158bc9277 • 6.1GB • Updated 2 weeks ago

8b-instruct-q8_0
1b8e49cece7f • 8.5GB • Updated 2 weeks ago

8b-instruct-q2_K
2745d547f376 • 3.2GB • Updated 2 weeks ago

8b-instruct-q3_K_S
0891d012c467 • 3.7GB • Updated 2 weeks ago

8b-instruct-q3_K_M
726a1960bfed • 4.0GB • Updated 2 weeks ago

8b-instruct-q3_K_L
df7624806f55 • 4.3GB • Updated 2 weeks ago

8b-instruct-q4_K_S
469db3d82576 • 4.7GB • Updated 2 weeks ago

8b-instruct-q4_K_M
9b8f3f3385bf • 4.9GB • Updated 2 weeks ago

8b-instruct-q5_K_S
7eae3176f5b5 • 5.6GB • Updated 2 weeks ago

8b-instruct-q5_K_M
af3083b17cd4 • 5.7GB • Updated 2 weeks ago

8b-instruct-q6_K
e4a3943fcd76 • 6.6GB • Updated 2 weeks ago

8b-instruct-fp16
c666fe422df7 • 16GB • Updated 2 weeks ago
```

For the above model, based on the quantisation values, it means
`8b-instruct-q8_0` is better than `8b-instruct-q6_K` which is better than
`8b-instruct-q4_0` and so on. By default, Ollama uses a 4 bit (`q4_0`) quantised
model for any LLM. For example, running `ollama run llama3` will download and
run the 4-bit (`q4_0`) quantised model of "Llama 3 8B". But we can choose to run
the 8-bit (`q8_0`) quantised model by running `ollama run
llama3:8b-instruct-q8_0`.

[Llama3 with tags]: https://ollama.com/library/llama3/tags

## Context size (8k, 32k, 128k)

In the LLM world, context size is a critical parameter that refers to the amount
of input text that can be processed by the model. The context size is typically
measured in terms of the number of tokens, where a token is usually a character
or a word of text.

Example: "I enjoyed reading this informative blog post!" is broken down into 8
tokens where each word and the exclamation mark are counted as individual
tokens.

LLM tasks such as coding that requires a deeper understanding of a codebase
might benefit from a larger context size. Similarly for conversational tasks, a
larger context size might be helpful allowing the LLM to capture the nuances of
the conversation and respond more coherently and consistently as the
conversation progresses. For simpler single-turn or fact-based questions, a
smaller context size can be sufficient as that can typically generate a faster
and accurate response. While a larger context size isn't always better, it's
generally true that an LLM with a larger context size can perform better.

Using [Phi3 with tags], "Phi3" models that support different context sizes can
be downloaded and tested.

```sh
# Phi-3-medium-14b
14b-medium-128k-instruct-q4_0
3aeb385c7040 • 7.9GB • Updated 13 days ago

14b-medium-4k-instruct-q4_0
1e67dff39209 • 7.9GB • Updated 2 weeks ago
```

```sh
# To download the 128k context size model
ollama pull phi3:14b-medium-128k-instruct-q4_0

# To download the 4k context size model
ollama pull phi3:14b-medium-4k-instruct-q4_0
```

[Phi3 with tags]: https://ollama.com/library/phi3/tags

## Licenses and legal

Before utilising such open source LLMs for any work/project (Language
assistance, coding assistance, commercial deployment), it is essential to review
the license terms to ensure compliance.

```
MIT License
Mistral AI Non-Production License 
META LLAMA 3 COMMUNITY LICENSE AGREEMENT
```

Licenses can vary widely, ranging from permissive open-source licenses like MIT
or Apache to more restrictive proprietary licenses. Understanding the license
terms ensures that developers and product owners can use a model without legal
repercussions.

## Conclusion

At the end of this post, if `codeqwen:7b-code-v1.5-q8_0` holds any meaning, then
this post has achieved its goal. We should hopefully feel empowered to
experiment with open-source large language models on our machine, confident in
our understanding of the type, parameters, quantisation, and context sizes that
define them. Happy experimenting with LLMs!
