Learning LLM Basics with Ollama

We have recently started learning about LLMs and how we can integrate them into our development projects, and we will be documenting our learning in a series of blog posts. In our first installment of the series, we learned about the OpenAI API. In this one, we will experiment with Ollama, which is an open-source application that allows you to download, run, and interact with LLMs on your own hardware. By running models locally, you maintain complete control over your data and can use LLMs without an internet connection. It also allows you to easily experiment with different models.

If you’d like to follow along, all the code is available in Jupyter Notebooks on GitHub.

Prerequisites

First, install Ollama on your computer. You can then add any of the models available in the Ollama library using the ollama pull command. There are models of different sizes, indicated by the number of parameters (e.g. 2B, 7B, etc.) the model learned during its training. Larger models are usually more capable but also require more computational resources like RAM. Ollama’s GitHub repo has some advice on selecting models based on the available RAM.

For the code examples in this notebook, we will be using Meta’s Llama 3.2, which is a relatively small but capable model (3B parameters, 2GB in size). Download it with:

ollama pull llama3.2

Also, install the following packages from PyPI:

Using ollama run

We will use Ollama’s run command to ask the Llama model to tell us a kid-friendly joke, just like in the OpenAI blog post. In a terminal, run the following:

ollama run llama3.2 "Tell me a silly joke for a kid."
Here's one:

What do you call a group of cows playing instruments?

A moo-sical band!

I hope that made the kids giggle!

Raw JSON response using curl

Ollama creates an API service, which is available at http://localhost:11434 by default. Let’s use curl to make the same request and see the raw JSON response from the model.

curl http://localhost:11434/api/generate -s -d '{
  "model": "llama3.2",
  "prompt": "Tell me a silly joke for a kid.",
  "stream": false
}'

There’s a lot in the JSON response, but we can see the actual text response in response:

{
  "model": "llama3.2",
  "created_at": "2025-10-23T23:27:10.499906849Z",
  "response": "Here's one:\n\nWhat do you call a group of cows playing instruments?\n\nA moo-sical band!\n\nI hope the kid laughs!",
  "done": true,
  "done_reason": "stop",
  "context": [
    128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366,
    18, 271, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 30571, 22380,
    369, 264, 10585, 13, 128009, 128006, 78191, 128007, 271, 8586, 596, 832,
    1473, 3923, 656, 499, 1650, 264, 1912, 315, 46266, 5737, 24198, 1980, 32,
    46971, 1355, 950, 7200, 2268, 40, 3987, 279, 10585, 49336, 0
  ],
  "total_duration": 2365373600,
  "load_duration": 79954127,
  "prompt_eval_count": 34,
  "prompt_eval_duration": 85957773,
  "eval_count": 28,
  "eval_duration": 2198897279
}

By default, the /api/generate endpoint returns a stream of responses. Adding "stream": false in our request ensures we get a single JSON response.

Using the Python SDK

The same can be done using Ollama’s Python SDK:

import ollama

response = ollama.generate(
    model="llama3.2",
    prompt="Tell me a silly joke for a kid.",
)
print(response)

The response object will be similar to the JSON we saw above:

GenerateResponse(
    model='llama3.2',
    created_at='2025-10-23T23:41:04.109314989Z',
    done=True,
    done_reason='stop',
    total_duration=3061086297,
    load_duration=50503219,
    prompt_eval_count=34,
    prompt_eval_duration=85522560,
    eval_count=38,
    eval_duration=2924453316,
    response="Here's one that's sure to make a kid giggle:\n\nWhat do you call a group of cows playing instruments?\n\nA moo-sical band!\n\nHope the little ones enjoy it!",
    thinking=None,
    context=[128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 30571, 22380, 369, 264, 10585, 13, 128009, 128006, 78191, 128007, 271, 8586, 596, 832, 430, 596, 2771, 311, 1304, 264, 10585, 23401, 3491, 1473, 3923, 656, 499, 1650, 264, 1912, 315, 46266, 5737, 24198, 1980, 32, 46971, 1355, 950, 7200, 2268, 39115, 279, 2697, 6305, 4774, 433, 0]
)

As we can see, the response is very similar to what we got using curl and the API is quite straightforward to use. We can also extract the actual text from the response.

print(response.response)
Here's one that's sure to make a kid giggle:

What do you call a group of cows playing instruments?

A moo-sical band!

Hope the little ones enjoy it!

Using the llm CLI tool and Python library

llm is an open-source CLI tool and Python library for interacting with LLMs, created by Simon Willison. It works with both local models and remote APIs (from OpenAI, Anthropic’s Claude, Google’s Gemini, etc.).

Let’s use the CLI tool to run the same prompt we’ve been running. The output should be similar to what we got from ollama run.

llm -m llama3.2 "Tell me a silly joke for a kid."
Here's one:

What do you call a group of cows playing instruments?

A moo-sical band!

Kids love puns, and this one is sure to make them giggle!

We can also use the llm Python package to interact with the model.

import llm

model = llm.get_model("llama3.2")
response = model.prompt("Tell me a silly joke for a kid.")

The prompt will not be evaluated until you call response.text() or print(response).

print(response)
Here's one that kids usually love:

What do you call a group of cows playing instruments?

A moo-sical band!

I hope that made you giggle!

Unfortunately, we cannot get the raw JSON response with this tool, unlike with the ollama Python SDK. According to the docs, some model plugins make the JSON available through the response.json() method, but it seems the llm-ollama plugin does not do that.

print(response.json())
None

Conclusion

Ollama is indeed an easy-to-use tool, which gives us the ability to run LLMs locally. Pairing it with llm adds some powerful features, such as the ability to log all prompts and responses to a SQLite database. It also allows us to switch between local models and remote APIs if necessary, which may be quite useful when experimenting with various LLMs.

Next up, we’ll explore function calling.