Published on

How to run LLMs locally ?

Authors
  • avatar
    Name
    Alex Zurka
    Twitter

Large language models like ChatGPT or Gemini simplifies our tasks drastically, but they ran by big techs, and don't allows us to perform our computations locally. Thankfully small teams (and some companies as well) brings us open models with open weights, that we can run on our hardware. The easiest way to get started is to serve models via ollama

Installation

If you're not using Arch Linux, then you need to have docker, as well as the appropriate drivers for you GPU installed.

Nvidia

Deb based distros:

1 Add repository

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update

2 Install the container toolkit

sudo apt-get install -y nvidia-container-toolkit

RPM based distros:

1 Add repository

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
    | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

2 Install the container toolkit

sudo yum install -y nvidia-container-toolkit

Configure docker to use the GPU

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Running container

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

AMD

Here we will need to pull image with tag rocm

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Running

To start chat, execute the following command

docker exec -it ollama ollama run llama3

docker compose

To use ollama as part of multicontainered application (or just for convenience), we can use docker-compose.yml file

services:
    ollama:
        container_name: ollama
        image: ollama/ollama:latest # or rocm
        volumes:
            - ollama:/root/.ollama
        ports:
            - 11434:11434
        networks:
            - app_network

networks:
  app_network:
    driver: bridge
    
volumes:
  ollama:

ollama volume will persist data between containers.

Consequently, to run it, we can use the following command:

docker compose up -d

Arch linux

Nvidia

Ensure that cuda toolkit is installed

sudo pacman -Syu cuda --needed

Next install ollama-cuda package

sudo pacman -S ollama-cuda

AMD

Ensure that rocm libraries are in place

paru rocm-opencl-runtime rocm-smi

Next install ollama-rocm package

sudo pacman -S ollama-rocm

Running

To start a chat, execute the following command

ollama run llama3

Additional notes

  1. We can check available models at ollama library
  2. We can access API at http://localhost:11434

Summary

Today we covered how to use ollama, a tool for running large language models. We explored different ways to run ollama, including using docker, docker-compose, and different architectures. We also learned how to access ollama's API and explore available models.