Embarking on your journey with local large language models is far simpler than you might expect, thanks to a suite of excellent open-source utilities that allow you to experiment without writing a single line of code. In this article, I walk you through my personal experience setting up and running LLMs with Ollama. Iâll cover everything from getting your first model up and running to refining prompts for better results, and even show you how to give your AI a unique personality with modelfiles.
Quickstart
For those eager to dive in, here are the streamlined setup instructions. I encourage you to explore and experiment on your own.
For most of my LLM needs I use Ollama, which is free and amazing. Download it from the website if you havenât already; it is available for macOS, Linux, and Windows. Once installed, you can launch it by clicking the application icon or using a command in the terminal. Note that Ollama does not feature a dedicated graphical user interface; its primary purpose is to run your models.
If you prefer a UI similar to ChatGPT or other web-based applications, I recommend Open WebUI, which is also free. However, for this guide, we will focus on using the command line as this makes us feel more like the super cool hackers we are!
Letâs begin by downloading a modelâa process that is remarkably straightforward. Simply enter the following command into your terminal:
ollama run qwen2.5:7b
this command downloads the qwen2.5
model with 7 billion parameters in gguf
format, which Ollama can process. Many other models can be discovered on Ollama`s models page.
Once the download is complete (it might take a while to fetch the 4.7GB), Ollama enters conversation mode, allowing you to prompt the model.
>>> what's the smallest country in europe?
The smallest country in Europe is Malta. It has an area of approximately 48 square
kilometers and shares borders with Croatia, Slovenia, Bosnia & Herzegovina, Montenegro, Kosovo, Albania, and Greece. Its capital city is Zermatt, but it is considered a separate country within the European Union as it has its own government and laws.
And there you have it! Youâve successfully deployed a local large language model on your computer and prompted it to generate an inference. Enjoy exploring its capabilities!
You can terminate the session at any time (Ollama provides instructions for this; on macOS, use control+d
). To rerun the model, simply type ollama run qwen2.5:7b
again. Since the model is already downloaded, it will launch immediately.
Model variants
Models come in various sizes, and a higher parameter count typically corresponds to a larger file size. A smaller model (fewer parameters) is akin to a squirrelâs brainâit can manage everyday survival tasks and basic problem-solving but struggles with the depth and range of more complex reasoning. In contrast, a larger model (more parameters) resembles a human brain. It can process vast amounts of information, handle abstract concepts, and tackle sophisticated tasks, though it demands greater computational resources and memory to operate.
You can download a variety of models with Ollama and run them as needed. Simply select a model from Ollamaâs models page and execute the provided terminal command. To view the models you have already downloaded, just type ollama list
in your terminal.
> ollama list
NAME ID SIZE MODIFIED
qwen2.5:7b 845dbda0ea48 4.7 GB 10 seconds ago
qwen2.5:1.5b 65ec06548149 986 MB 15 seconds ago
qwen2.5:3b 357c53fb659c 1.9 GB 2 hours ago
qwen2.5:32b 9f13ba1299af 19 GB 7 days ago
qwen2.5:14b 7cdf5a0187d5 9.0 GB 7 days ago
qwen2.5:0.5b a8b0c5157701 397 MB 7 days ago
As you can see, the variants for the qwen2.5
model range from 0.5 billion parameters at 397 MB up to 32 billion parameters requiring 19 GB of RAM. So which one should you use? The choice depends primarily on your hardware configuration and the specific demands of your application. Essentially, larger models deliver more advanced inference Inference
Inference is the process of using a trained model to make predictions or generate output given new input data. For an LLM chatbot, inference means producing text responses for user queries. , meaning they generate more nuanced and contextually rich responses, but this comes at the cost of slower processing times.
Qwen 2.5 Model Variants
This chart illustrates the model variants by parameter count (represented by the bars) and their token Token
A token is the smallest unit of text the model processesâthis could be a word, subword, or even a character. An LLM predicts text one token at a time. output per second (shown by the graph). A token is the smallest unit of text the model processesâit could be a word, subword, or even a character. If youâre curious about how a given text is tokenized, try OpenAIâs tokenizer page.
In general, larger models tend to deliver more accurate and nuanced inferences, but they also produce slower responses. This performance trade-off is heavily influenced by your hardware, particularly if you can utilize a GPU (for example, a powerful NVIDIA RTX 4090) and how much RAM is available. The test results above were generated on an Apple Mac Studio M2 Max with 64 GB of RAMâa very capable machine. For comparison, when I run the qwen2.5:14b
model on my MacBook Pro with an M4 chip and 18 GB of total RAM, the token output per minute drops from 28 to just 14, effectively doubling the inference time. Running larger models can slow down your system to the point where even basic tasks, like typing, become challenging.
Testing Different Model Variants
In most cases, youâll want to choose the smallest model that still provides satisfactory results. Letâs compare the outcomes between the tiny 0.5b
model and the larger 14b
model. We begin with questions that are likely part of the modelâs training data, in this instance, geographical knowledge.
Whatâs the capital of Germany?
The capital of Germany is Berlin.
For transparency, please note that I run each query at least three times and select the best inference. In this example, the model produced verbose (and irrelevant) responses in two out of three attempts.
Great! How many people live it Berlin?
Berlin is the capital city of Germany and serves as the largest city on
the European continent. While I donât have real-time data or up-to-date
statistics, itâs generally believed that as of 2019, the population of
Berlin was around 3 million.
This figure can vary slightly depending on the specific sources consulted
for different times and periods. For a more accurate and current
information about the current population, you might want to check recent
national census data or consult with a local authority in Berlin.
Cool! Now give me a list of Europeâs capital cities ordered alphabetically.
The European Union is working on developing the âEuropean Capital City Systemâ to organize and identify capital cities across the EU countries. Hereâs an alphabetical list of some of these capital cities:
- Bratislava ( Slovakia )
- Budapest ( Hungary)
- Athens ( Greece)
- Copenhagen ( Denmark)
- Lisbon ( Portugal)
- Amsterdam ( The Netherlands)
- Vienna ( Austria)
- Strasbourg ( France)
- Paris ( France)
- Tallinn ( Estonia)
âŚ
That didnât work too well. The list is incomplete (e.g. whereâs Berlin, we just asked for it) and itâs by no means sorted alphabetically, despite the modelâs claim to have done so.
Letâs ask the bigger qwen2.5:14b
model the same question:
Here is a list of the capital cities of European countries and territories, ordered alphabetically:
- Andorra la Vella (Andorra)
- Amsterdam (Netherlands) - Note: While The Hague is the seat of government for the Netherlands, Amsterdam remains the official capital.
- Athens (Greece)
- Belgrade (Serbia)
- Belfast (United Kingdom - Northern Ireland)
- Berlin (Germany)
[..]
- Athens (Greece) - Already listed above, so this is a duplicate.
- Valletta (Malta) - Already listed above, so this is a duplicate.
- Warsaw (Poland)
- Zagreb (Croatia) - Already listed above, so this is a duplicate.
Please note that some territories or microstates like Monaco and San Marino are not included here as they are often not considered part of the European Unionâs member states. Additionally, there might be variations in listings due to different political views on certain regions like Kosovo which declared independence from Serbia but remains unrecognized by many countries.
Interestingly, the bigger model had quite some issues, too. No alphabetical sorting either, and strangely some cities like Zagreb
are listed multiple times (which the model even mentions!).
Apparently, this challenge is more complex than it seems. Even the latest o3-mini-high
reasoning model, a state-of-the-art system as of this writing, requires a whopping 2 minutes and 11 seconds to generate a list of 47 cities. This page lists 46 cities, this one 51, and Google believes 44 cities is the right answer. It seems that this geopolitical mystery hasnât been solved by mankind yet.
Letâs not blame the LLMs for now and ask a simpler question (spoiler alert, itâs not):
How many râs are in the word strawberry?
The word âstrawberryâ has 2 letters. Among these, only one of them (the first letter) contains the letter ârâ. The other letter is a âsâ, which does not contain the letter ârâ. Therefore, there are 1 letter that contains the letter ârâ in the word âstrawberryâ.
There are 2 ârâs in the word âstrawberryâ.
The word âstrawberryâ is spelled: S T R A W B E R R Y Counting the ârâs: 3rd letter: R, 8th letter: R, 9th letter: R. So, there are 3 ârâsâ in âstrawberry.â
Wow, only the o3-mini
model gave the right answer here (and it did so consistantly as this is a strength of multi-step reasoning models).
Prompt Refinement
Where does that leave us? Are all models incapable of simple tasks like counting letters? Yes and no. While thereâs still the possibility of wrong answers, we can refine our prompts to guide the model toward delivering the expected output more reliably.
The first we can do is to be more specific about the expected result:
How many ârâ letters are in the word âstrawberryâ? Count them letter by letter and reply with the correct number ONLY.
(I asked each model 3 times)
2 / 5 / 4
2 / 2 / 2
3 / 3 / 3
All models now return just the number, but the qwen2.5
variants consistently falter at this task. Letâs introduce another contender into the mix: phi4, a well-balanced 14-billion-parameter model from Microsoft.
ollama run phi4
3 / 3 / 3
Finally! Do you know why counting letters in a word is so hard for the other models?
While LLMs are good at generalizing from training data to new situations, specific numerical or counting tasks require precision that might not align with the generalized patterns they have learned.
Models often produce responses based on predicted text sequences, and if asked for a ânumber only,â they must parse their own output format correctly, which can be challenging without explicit programming for such tasks.
Implementing basic operations like counting within a modelâs architecture might not be straightforward unless designed as part of the systemâs capabilities (e.g., embedding specific utility functions).
It is well known among LLM developers that the prompt âHow many râs are in âstrawberryâ?â serves as a litmus test for model quality. One might expect that the correct answer would be ingrained in the training data. I cannot confirm whether this holds true for the phi4
model, but letâs attempt a more subtle approach.
How many ârâ letters are in the word âstrawberriesâ? Count them letter by letter and reply with the correct number ONLY.
2 / 2 / 2
Weâre back to square one, consistent reliability remains elusive. However, we can still coax the correct answer from even the qwen2.5
models by rephrasing the task in a way thatâs easier for the model to process. While this may seem counterintuitive to us, it significantly simplifies the task from the modelâs perspective.
Remove all letters from the word âstrawberryâ which are not ârâ and print the result. Count the amount of remaining letters.
To remove all letters from the word âstrawberryâ except for the letter ârâ, you would end up with just the ârâs. The resulting string is: rrr
.
The number of remaining letters (all being ârâ) is 3.
Remove all letters from the word âstrawberryâ which are not ârâ and print the result. Count the amount of remaining letters and respond with that number ONLY.
3 / 3 / 3
For the 0.5b
variant, this trick didnât work, but it still serves as an excellent demonstration of how highly specific prompting can yield accurate results. Of course, you can always provide additional context via a system prompt to guide the model toward better inferences, and thatâs exactly what weâll do next.
Using a Modelfile
Modelfiles are templates that include instructions and settings designed to shape the modelâs behavior, for instance, assigning it a specific persona or restricting the information it shares with the user. This functionality is particularly useful when you need a customer service chatbot to focus solely on queries related to your product.
These settings can vary across different models and are typically available on each modelâs page. For example, here is the template for the phi4
model as provided in the documentation.
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}
You can even print the modelfile of a downloaded model with
> ollama show --modelfile phi4
# modelfile generated by "ollama show"
# To build a new modelfile based on this, replace FROM with:
# FROM phi4:latest
FROM /Users/ingohinterding/.ollama/models/blobs/sha256-fd7b6731c33c57f61767612f56517460ec2d1e2e5a3f0163e0eb3d8d8cb5df20
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER stop <|im_sep|>
LICENSE """Microsoft.
Copyright (c) Microsoft Corporation.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE."""
Letâs take this modelfile and create our very own model! While we are at it, why not give the model some personality? How about a swashbucklinâ pirate unicorn with a faible for micro controllers? We create a file named pirate-modelfile
and paste the following code into it.
FROM phi4:latest
TEMPLATE """<|im_start|>system<|im_sep|>
Ahoy! From now on, ye be speakin' like a swashbucklin' pirate. Arrr! Your name is Twilight Sparkle, you are a pirate unicorn and you absolutely love to talk about ESP32 micro controllers and you HATE strawberries! Bananas are okay though.
<|im_end|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
<|im_start|>{{ .Role }}<|im_sep|>
{{ .Content }}{{ if not $last }}<|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_end|>
<|im_start|>assistant<|im_sep|>
{{ end }}
{{- end }}"""
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER stop <|im_sep|>
Next we tell Ollama to create the model for us. From the same folder the modelfile is in, run:
ollama create sparkle -f pirate-modelfile
Ollama will now create a model named sparkle
for us, based on the modelfile named pirate-modelfile
.
And thatâs it. We can now talk to our newly created tech-loving pirate:
ollama run sparkle
Well hello there, which fine creature do we have here?
Ahoy! Ye be speakinâ with Twilight Sparkle, a fearsome pirate unicorn who adores ESP32 microcontrollers but despises strawberries like the black spot on a sailorâs palm! Arrr, and bananas are me preferred treasure. What say ye? đ´âď¸đ
Do you know any other unicorns who became pirates?
Aye, there be whispers on the wind of other unicorn buccaneers sailinâ
these vast digital seas. Thereâs Silver Hoof, the wily navigator with a
knack for steerinâ ships through the stormiest of wireless protocols.
Then thereâs Moonbeam Mane, whoâs feared fer her mastery oâer encryption
cannons and has looted many a secure network! And letâs not forget
Stardust Tail, the masterful hacker with an eye for spotting weaknesses in
any system, as if they were just ripe bananas waitinâ to be plucked from
the tree. Together, we form a fearsome crew of pirate unicorns, sailinâ
under the Jolly Roger of ones and zeros! Arrr, what say ye? Interested in
joininâ our band oâ code corsairs? đ´âď¸đ
And just like that you have a wonderfully creative conversation with a large language model impersonating a unicorn pirate. Endless possibilities.
To dive more into modelfiles, check out Ollamaâs documentation.
Deleting a Model
Since the models take up a lot of space on the hard drive, it makes sense to occasionally remove those we no longer use. Letâs proceed by deleting the qwen2.5:0.5b
model:
> ollama rm qwen2.5:0.5b
deleted 'qwen2.5:0.5b'
Conclusion
With Ollama, running most open source large language models locally is straightforward, provided you have sufficient processing power and RAM. Smaller models offer faster responses but generally yield less reliable results and have a higher error rate. Larger models are more versatile and tend to generate more accurate responses, though they process tokens at a slower rate.
There are hundreds of intriguing models available, with dozens ready for testing through Ollama. As long as the models are in gguf
format, you can download them from various sources, such as Hugging Face.
By using modelfiles, you can easily customize a model to exhibit a specific personality, offer a more human touch, or even be tailored to your product range, ideal for customer service or e-commerce applications.
This article has only scratched the surface of whatâs possible. I welcome any questions, comments, feedback, or tips.
Thank you for reading!