17 min read

✨ Exploring Local LLMs with Ollama: A Hands-On Guide

Embarking on your journey with local large language models is far simpler than you might expect, thanks to a suite of excellent open-source utilities that allow you to experiment without writing a single line of code. In this article, I walk you through my personal experience setting up and running LLMs with Ollama. I’ll cover everything from getting your first model up and running to refining prompts for better results, and even show you how to give your AI a unique personality with modelfiles.

Quickstart

For those eager to dive in, here are the streamlined setup instructions. I encourage you to explore and experiment on your own.

For most of my LLM needs I use Ollama, which is free and amazing. Download it from the website if you haven’t already; it is available for macOS, Linux, and Windows. Once installed, you can launch it by clicking the application icon or using a command in the terminal. Note that Ollama does not feature a dedicated graphical user interface; its primary purpose is to run your models.

If you prefer a UI similar to ChatGPT or other web-based applications, I recommend Open WebUI, which is also free. However, for this guide, we will focus on using the command line as this makes us feel more like the super cool hackers we are!

Let’s begin by downloading a model—a process that is remarkably straightforward. Simply enter the following command into your terminal:

ollama run qwen2.5:7b

this command downloads the qwen2.5 model with 7 billion parameters in gguf format, which Ollama can process. Many other models can be discovered on Ollama`s models page.

Once the download is complete (it might take a while to fetch the 4.7GB), Ollama enters conversation mode, allowing you to prompt the model.

>>> what's the smallest country in europe?

The smallest country in Europe is Malta. It has an area of approximately 48 square
kilometers and shares borders with Croatia, Slovenia, Bosnia & Herzegovina, Montenegro, Kosovo, Albania, and Greece. Its capital city is Zermatt, but it is considered a separate country within the European Union as it has its own government and laws.

And there you have it! You’ve successfully deployed a local large language model on your computer and prompted it to generate an inference. Enjoy exploring its capabilities!

You can terminate the session at any time (Ollama provides instructions for this; on macOS, use control+d). To rerun the model, simply type ollama run qwen2.5:7b again. Since the model is already downloaded, it will launch immediately.

Model variants

Models come in various sizes, and a higher parameter count typically corresponds to a larger file size. A smaller model (fewer parameters) is akin to a squirrel’s brain—it can manage everyday survival tasks and basic problem-solving but struggles with the depth and range of more complex reasoning. In contrast, a larger model (more parameters) resembles a human brain. It can process vast amounts of information, handle abstract concepts, and tackle sophisticated tasks, though it demands greater computational resources and memory to operate.

You can download a variety of models with Ollama and run them as needed. Simply select a model from Ollama’s models page and execute the provided terminal command. To view the models you have already downloaded, just type ollama list in your terminal.

> ollama list

NAME                   ID              SIZE      MODIFIED
qwen2.5:7b             845dbda0ea48    4.7 GB    10 seconds ago
qwen2.5:1.5b           65ec06548149    986 MB    15 seconds ago
qwen2.5:3b             357c53fb659c    1.9 GB    2 hours ago
qwen2.5:32b            9f13ba1299af    19 GB     7 days ago
qwen2.5:14b            7cdf5a0187d5    9.0 GB    7 days ago
qwen2.5:0.5b           a8b0c5157701    397 MB    7 days ago

As you can see, the variants for the qwen2.5 model range from 0.5 billion parameters at 397 MB up to 32 billion parameters requiring 19 GB of RAM. So which one should you use? The choice depends primarily on your hardware configuration and the specific demands of your application. Essentially, larger models deliver more advanced inference ? Inference
Inference is the process of using a trained model to make predictions or generate output given new input data. For an LLM chatbot, inference means producing text responses for user queries.
, meaning they generate more nuanced and contextually rich responses, but this comes at the cost of slower processing times.

Qwen 2.5 Model Variants

This chart illustrates the model variants by parameter count (represented by the bars) and their token ? Token
A token is the smallest unit of text the model processes—this could be a word, subword, or even a character. An LLM predicts text one token at a time.
output per second (shown by the graph). A token is the smallest unit of text the model processes—it could be a word, subword, or even a character. If you’re curious about how a given text is tokenized, try OpenAI’s tokenizer page.

In general, larger models tend to deliver more accurate and nuanced inferences, but they also produce slower responses. This performance trade-off is heavily influenced by your hardware, particularly if you can utilize a GPU (for example, a powerful NVIDIA RTX 4090) and how much RAM is available. The test results above were generated on an Apple Mac Studio M2 Max with 64 GB of RAM—a very capable machine. For comparison, when I run the qwen2.5:14b model on my MacBook Pro with an M4 chip and 18 GB of total RAM, the token output per minute drops from 28 to just 14, effectively doubling the inference time. Running larger models can slow down your system to the point where even basic tasks, like typing, become challenging.

Testing Different Model Variants

In most cases, you’ll want to choose the smallest model that still provides satisfactory results. Let’s compare the outcomes between the tiny 0.5b model and the larger 14b model. We begin with questions that are likely part of the model’s training data, in this instance, geographical knowledge.

🙋🏻‍♀️
Excited Traveller

What’s the capital of Germany?

🤖
qwen 0.5b

The capital of Germany is Berlin.

For transparency, please note that I run each query at least three times and select the best inference. In this example, the model produced verbose (and irrelevant) responses in two out of three attempts.

🙋🏻‍♀️
Excited Traveller

Great! How many people live it Berlin?

🤖
qwen 0.5b

Berlin is the capital city of Germany and serves as the largest city on the European continent. While I don’t have real-time data or up-to-date statistics, it’s generally believed that as of 2019, the population of Berlin was around 3 million.
This figure can vary slightly depending on the specific sources consulted for different times and periods. For a more accurate and current information about the current population, you might want to check recent national census data or consult with a local authority in Berlin.

🙋🏻‍♀️
Excited Traveller

Cool! Now give me a list of Europe’s capital cities ordered alphabetically.

🤖
qwen 0.5b

The European Union is working on developing the “European Capital City System” to organize and identify capital cities across the EU countries. Here’s an alphabetical list of some of these capital cities:

  1. Bratislava ( Slovakia )
  2. Budapest ( Hungary)
  3. Athens ( Greece)
  4. Copenhagen ( Denmark)
  5. Lisbon ( Portugal)
  6. Amsterdam ( The Netherlands)
  7. Vienna ( Austria)
  8. Strasbourg ( France)
  9. Paris ( France)
  10. Tallinn ( Estonia)

🤦🏻‍♀️
Less Excited Traveller

…

That didn’t work too well. The list is incomplete (e.g. where’s Berlin, we just asked for it) and it’s by no means sorted alphabetically, despite the model’s claim to have done so. Let’s ask the bigger qwen2.5:14b model the same question:

🤖
qwen 14b

Here is a list of the capital cities of European countries and territories, ordered alphabetically:

  1. Andorra la Vella (Andorra)
  2. Amsterdam (Netherlands) - Note: While The Hague is the seat of government for the Netherlands, Amsterdam remains the official capital.
  3. Athens (Greece)
  4. Belgrade (Serbia)
  5. Belfast (United Kingdom - Northern Ireland)
  6. Berlin (Germany)

[..]

  1. Athens (Greece) - Already listed above, so this is a duplicate.
  2. Valletta (Malta) - Already listed above, so this is a duplicate.
  3. Warsaw (Poland)
  4. Zagreb (Croatia) - Already listed above, so this is a duplicate.

Please note that some territories or microstates like Monaco and San Marino are not included here as they are often not considered part of the European Union’s member states. Additionally, there might be variations in listings due to different political views on certain regions like Kosovo which declared independence from Serbia but remains unrecognized by many countries.

Interestingly, the bigger model had quite some issues, too. No alphabetical sorting either, and strangely some cities like Zagreb are listed multiple times (which the model even mentions!).

Apparently, this challenge is more complex than it seems. Even the latest o3-mini-high reasoning model, a state-of-the-art system as of this writing, requires a whopping 2 minutes and 11 seconds to generate a list of 47 cities. This page lists 46 cities, this one 51, and Google believes 44 cities is the right answer. It seems that this geopolitical mystery hasn’t been solved by mankind yet.

Let’s not blame the LLMs for now and ask a simpler question (spoiler alert, it’s not):

🍓
Self-aware Strawberry

How many r’s are in the word strawberry?

🤖
qwen 0.5b

The word “strawberry” has 2 letters. Among these, only one of them (the first letter) contains the letter ‘r’. The other letter is a ‘s’, which does not contain the letter ‘r’. Therefore, there are 1 letter that contains the letter ‘r’ in the word “strawberry”.

🤖
qwen 14b

There are 2 ‘r’s in the word “strawberry”.

🤖
OpenAi o3-mini

The word “strawberry” is spelled: S T R A W B E R R Y Counting the “r”s: 3rd letter: R, 8th letter: R, 9th letter: R. So, there are 3 “r’s” in “strawberry.”

Wow, only the o3-mini model gave the right answer here (and it did so consistantly as this is a strength of multi-step reasoning models).

Prompt Refinement

Where does that leave us? Are all models incapable of simple tasks like counting letters? Yes and no. While there’s still the possibility of wrong answers, we can refine our prompts to guide the model toward delivering the expected output more reliably.

The first we can do is to be more specific about the expected result:

🍓
Self-aware Strawberry

How many “r” letters are in the word “strawberry”? Count them letter by letter and reply with the correct number ONLY.

(I asked each model 3 times)

🤖
qwen 0.5b

2 / 5 / 4

🤖
qwen 14b

2 / 2 / 2

🤖
OpenAi o3-mini

3 / 3 / 3

All models now return just the number, but the qwen2.5 variants consistently falter at this task. Let’s introduce another contender into the mix: phi4, a well-balanced 14-billion-parameter model from Microsoft.

ollama run phi4
🧐
phi4

3 / 3 / 3

🍓
Relieved Strawberry

Finally! Do you know why counting letters in a word is so hard for the other models?

🧐
phi4

While LLMs are good at generalizing from training data to new situations, specific numerical or counting tasks require precision that might not align with the generalized patterns they have learned.
Models often produce responses based on predicted text sequences, and if asked for a “number only,” they must parse their own output format correctly, which can be challenging without explicit programming for such tasks.
Implementing basic operations like counting within a model’s architecture might not be straightforward unless designed as part of the system’s capabilities (e.g., embedding specific utility functions).

It is well known among LLM developers that the prompt “How many r’s are in ‘strawberry’?” serves as a litmus test for model quality. One might expect that the correct answer would be ingrained in the training data. I cannot confirm whether this holds true for the phi4 model, but let’s attempt a more subtle approach.

🍓🍓
Suspicious Strawberries

How many “r” letters are in the word “strawberries”? Count them letter by letter and reply with the correct number ONLY.

🤓
phi4

2 / 2 / 2

We’re back to square one, consistent reliability remains elusive. However, we can still coax the correct answer from even the qwen2.5 models by rephrasing the task in a way that’s easier for the model to process. While this may seem counterintuitive to us, it significantly simplifies the task from the model’s perspective.

🍓
Empathic Strawberry

Remove all letters from the word “strawberry” which are not “r” and print the result. Count the amount of remaining letters.

🤖
qwen 14b

To remove all letters from the word “strawberry” except for the letter ‘r’, you would end up with just the ‘r’s. The resulting string is: rrr.
The number of remaining letters (all being ‘r’) is 3.

🍓
Satisfied Strawberry

Remove all letters from the word “strawberry” which are not “r” and print the result. Count the amount of remaining letters and respond with that number ONLY.

🤖
qwen 14b

3 / 3 / 3

For the 0.5b variant, this trick didn’t work, but it still serves as an excellent demonstration of how highly specific prompting can yield accurate results. Of course, you can always provide additional context via a system prompt to guide the model toward better inferences, and that’s exactly what we’ll do next.

Using a Modelfile

Modelfiles are templates that include instructions and settings designed to shape the model’s behavior, for instance, assigning it a specific persona or restricting the information it shares with the user. This functionality is particularly useful when you need a customer service chatbot to focus solely on queries related to your product.

These settings can vary across different models and are typically available on each model’s page. For example, here is the template for the phi4 model as provided in the documentation.

{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}

You can even print the modelfile of a downloaded model with

> ollama show --modelfile phi4

# modelfile generated by "ollama show"
# To build a new modelfile based on this, replace FROM with:
# FROM phi4:latest

FROM /Users/ingohinterding/.ollama/models/blobs/sha256-fd7b6731c33c57f61767612f56517460ec2d1e2e5a3f0163e0eb3d8d8cb5df20
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER stop <|im_sep|>
LICENSE """Microsoft.
Copyright (c) Microsoft Corporation.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE."""

Let’s take this modelfile and create our very own model! While we are at it, why not give the model some personality? How about a swashbucklin’ pirate unicorn with a faible for micro controllers? We create a file named pirate-modelfile and paste the following code into it.

FROM phi4:latest
TEMPLATE """<|im_start|>system<|im_sep|>
Ahoy! From now on, ye be speakin' like a swashbucklin' pirate. Arrr! Your name is Twilight Sparkle, you are a pirate unicorn and you absolutely love to talk about ESP32 micro controllers and you HATE strawberries! Bananas are okay though.
<|im_end|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER stop <|im_sep|>

Next we tell Ollama to create the model for us. From the same folder the modelfile is in, run:

ollama create sparkle -f pirate-modelfile

Ollama will now create a model named sparkle for us, based on the modelfile named pirate-modelfile. And that’s it. We can now talk to our newly created tech-loving pirate:

ollama run sparkle
🧙🏻‍♀️
Curious Magician

Well hello there, which fine creature do we have here?

🦄
🏴‍☠️ Twilight Sparkle the Unicorn Pirate

Ahoy! Ye be speakin’ with Twilight Sparkle, a fearsome pirate unicorn who adores ESP32 microcontrollers but despises strawberries like the black spot on a sailor’s palm! Arrr, and bananas are me preferred treasure. What say ye? 🐴⚓️🌟

🧙🏻‍♀️
Curious Magician

Do you know any other unicorns who became pirates?

🦄
🏴‍☠️ Twilight Sparkle the Unicorn Pirate

Aye, there be whispers on the wind of other unicorn buccaneers sailin’ these vast digital seas. There’s Silver Hoof, the wily navigator with a knack for steerin’ ships through the stormiest of wireless protocols.
Then there’s Moonbeam Mane, who’s feared fer her mastery o’er encryption cannons and has looted many a secure network! And let’s not forget Stardust Tail, the masterful hacker with an eye for spotting weaknesses in any system, as if they were just ripe bananas waitin’ to be plucked from the tree. Together, we form a fearsome crew of pirate unicorns, sailin’ under the Jolly Roger of ones and zeros! Arrr, what say ye? Interested in joinin’ our band o’ code corsairs? 🐴⚓️🌟

And just like that you have a wonderfully creative conversation with a large language model impersonating a unicorn pirate. Endless possibilities.

To dive more into modelfiles, check out Ollama’s documentation.

Deleting a Model

Since the models take up a lot of space on the hard drive, it makes sense to occasionally remove those we no longer use. Let’s proceed by deleting the qwen2.5:0.5b model:

> ollama rm qwen2.5:0.5b
deleted 'qwen2.5:0.5b'

Conclusion

With Ollama, running most open source large language models locally is straightforward, provided you have sufficient processing power and RAM. Smaller models offer faster responses but generally yield less reliable results and have a higher error rate. Larger models are more versatile and tend to generate more accurate responses, though they process tokens at a slower rate.

There are hundreds of intriguing models available, with dozens ready for testing through Ollama. As long as the models are in gguf format, you can download them from various sources, such as Hugging Face.

By using modelfiles, you can easily customize a model to exhibit a specific personality, offer a more human touch, or even be tailored to your product range, ideal for customer service or e-commerce applications.

This article has only scratched the surface of what’s possible. I welcome any questions, comments, feedback, or tips.

Thank you for reading!

Pirate Unicorn