Ollama
Overview
OLLAMA is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more.
Please note that it is for running LLM’s only, not for training them. You can however create customised models from existing models and even run those customised models.
Restrictions on use
OLLAMA is open source and freely distributed under MIT License.
Please note that the LLM’s license is different from that of ollama. Please check their respective license terms before using them.
Set up procedure
To access the software you must first load the appropriate modulefile.
apps/binapps/ollama/0.12.3
Location of downloaded LLM’s
When you try to run any model in Ollama, it will first download the model to the directory ~/.ollama in your home directory and then run it from there. This is the default location. Over time this directory can grow large as you run different LLM’s. It is therefore advised to changed the default location where Ollama stores the LLM’s to somewhere within your ~/scratch directory.
The default location which Ollama will use for storing LLM’s can be controlled by the environment variable OLLAMA_MODELS.
To change the default storage location, first create a directory inside your ~/scratch directory:
mkdir ~/scratch/ollama_models
Next, set that path as OLLAMA_MODELS by adding the following line to your ~/.bashrc file:
export OLLAMA_MODELS="~/scratch/ollama_models"
Finally, re-login to CSF3 or source your edited ~/.bashrc file:
source ~/.bashrc
Interactive mode testing
Please do NOT run OLLAMA on the login node. Jobs should be submitted to the compute nodes via batch.
You can test run OLLAMA in an interactive job/session. However, we strongly advise that you use batch jobs rather than interactive jobs. Use this for short duration testing only.
# Start an interactive job.
# Using long flag names
srun --partition=gpuL --gpus=1 --ntasks=1 --time=1-0 --pty bash
# Or using short flag names
srun -p gpuL -G 1 -n 1 -t 1-0 --pty bash
# Once resources are assigned for interactive job and you are logged in to an interactive node, run the following:
module purge
module load apps/binapps/ollama/0.12.3
unset ROCR_VISIBLE_DEVICES
ollama serve & # This will start the ollama server
# Uses port 11434 by default
ollama -v # Verify that ollama server is running
ollama run llama3.2 # This will download Llama 3.2 LLM and run it
ollama ps # This will list the running LLM
# You will be able to interact with the running LLM at this stage.
Hello # Interact with LLM
/bye # Exit interaction when done testing
ollama stop llama3.2 # Stop the running LLM
ollama ps # Verify that the LLM has been stopped
exit # This will end the interactive job/session and
# ollama server will be stopped.
GPU batch job submission
Single GPU job
Write a job submission script, for example:
#!/bin/bash --login #SBATCH -p gpuL # GPU partition. Available options for all: gpuL(L40s-48GB), gpuA(A100-80GB) # GPU partitions with restricted access: gpuA40GB #SBATCH -G 1 # (or --gpus=N) Number of GPUs #SBATCH -t 1-0 # Wallclock timelimit (1-0 is one day, 4-0 is max permitted) ### Optional flags #SBATCH -n 1 # (or --ntasks=) Number of CPU (host) cores (default is 1) # up to 12 CPU cores per GPU is permitted for gpuL, gpuA and gpuA40GB # Also affects host RAM allocated to job unless --mem=num used. # Load the module module purge module load cuda/12.6.2 module load apps/binapps/ollama/0.12.3 # Without the following line Ollama will run in CPU instead of GPU unset ROCR_VISIBLE_DEVICES # Start ollama server process and run the desired LLM export OLLAMA_HOST=0.0.0.0:11434 # This enables you to interact with the API remotely ollama serve & ollama run llama3.2 # below lines are for graceful stopping of the LLM # adjust as per wallclock time you have set for the job sleep 23h # To delay the execution of the command in the next line ollama stop llama3.2 # To stop Llama 3.2 model. However, this does not stops # the ollama server, which is stopped when the job ends
Submit the jobscript using: sbatch scriptname
Interacting with the LLM’s running in CSF3 compute node
Once your desired LLM is running in a GPU compute node in CSF3, you will want to interact with it.
You can use the different API’s provided by Ollama to interact with the LLM’s.
You can either:
- Interact from CSF3 login node itself
- Interact directly from your laptop
Following are some examples to do that:
1. Interacting from CSF3 login node
Once your job is running, check which node the job is running in using squeue command.
Then you can interact from the CSF3 login node by running:
curl http://nodeNNN:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Tell me a joke"
}'
You will get responses in chunks instead of a single blob since the default behaviour is to respond in JSON chunks.
If you prefer answer in a single blob, which is easier to read run the following instead, which prevents streaming in chunks:
curl http://nodeNNN:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Tell me a joke",
"stream": false
}'
2. Interacting from your laptop
To interact with the LLM running in a compute node of CSF3, you will first need to set up a SSH Tunnel to the compute node of CSF3 from your laptop.
You can setup the SSH tunnel by running the following command from your laptop (terminal or windows command prompt):
ssh -L 11434:nodeNNN:11434@csf3.itservices.manchester.ac.uk
After completing the authentication, you will be logged in to CSF3. Now it is important that you leave this terminal/window aside.
Linux/MAC Laptop example
Next, open a new terminal in your Linux/MAC laptop and then run the same commands as above, just change nodeNNN to localhost this time.
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Tell me a joke",
"stream": false
}'
Windows Laptop example
The command syntax for curl is different from that of Linux/MAC. Open a new command prompt window in your Windows laptop and then run the following command.
curl -X POST http://localhost:11434/api/generate -H "Content-Type: application/json" ^
-d "{\"model\":\"llama3.2\",\"prompt\":\"Tell me a joke\",\"stream\":false}"
Other API’s
The above example uses the GENERATE API which is fine for individual single-turn interactions.
For longer multi-turn conversations, which needs remembering the previous messages or the context, you can use the CHAT API.
Here’s and example:
#Linux/MAC Example
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "Content": "Tell me a nerd joke" }
],
"stream": false
}'
#Windows Example
curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" ^
-d "{\"model\":\"llama3.2\",\"messages\":[{\"role\":\"user\",\"content\":\"Tell me a nerd joke\"}],\"stream\":false}"
Further info
Updates
None.
