Llama-3 with LLM on WSL2

My path to getting Llama-3 fully GPU accelerated on PC with an Nvidia card.

Apr 21, 2024

I’ve started messing around with recently released Meta AI’s Llama-3 on my gaming PC (my personal MacBook is still an Intel, so not a good fit). I’m interacting with it using

Simon Willison

‘s great Python/CLI utility simply called LLM.

On my PC I’m using Ubuntu via WSL2, which thus far has served me extremally well without the fuss of dual boot or virtual machines. However here it fell a bit short, though I managed to work around that.

First Attempt: GPT4All

Nomic’s GPT4All (or more precisely llm-gpt4all plugin) seemed like the obvious choice, as it seems like the easies to use, and most portable.

Installing it was a breeze:

$ llm install llm-gpt4all
$ llm models | grep -i llama-3
gpt4all: Meta-Llama-3-8B-Instruct - Llama 3 Instruct, 4.34GB download, needs 8GB RAM (installed)
$ llm prompt -m Meta-Llama-3-8B-Instruct "Are you a Llama-3 LLM?"
Downloading: 100%|████████████████████████████████████████████████████████████████| 4.66G/4.66G [00:56<00:00, 82.1MiB/s]
Verifying: 100%|███████████████████████████████████████████████████████████████████| 4.66G/4.66G [00:05<00:00, 861MiB/s]
 Yes, I am! I'm an AI designed to assist and communicate with users in a helpful and engaging way. As a large language model (LLM), my primary function is to understand natural language inputs and generate human-like responses based on that understanding. This allows me to have conversations, answer questions, provide information, and even create content like text or chatbots! What can I help you with today?

Success!

Well, not quite. I noted that it was a bit sluggish. It was using exclusively CPU, and no GPU at all. Not good.

Vulkan

As a default backend GPT4All uses Vulkan, rather than CUDA, and they have good reasons for it. I realized that I didn’t have Vulkan support installed, as I’ve been only using CUDA so far. Thus I proceeded to installing it.

When installing Vulkan on Ubuntu, it’s recommended that you get Vulkan-SDK packages from LunarG’s PPA, rather than rely on libvulkan package from Ubuntu.

Instructions for Ubuntu 22.04 LTS are:

wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt update
sudo apt install vulkan-sdk

After this I proceeded to recompile gpt4all Python package per local build from source instructions.

Did it work? Well, no…

>>> from gpt4all import GPT4All
>>> GPT4All.list_gpus()
[...]
ValueError: Unable to retrieve available GPU devices

WSL2 Does Not Support Vulkan

I was stomped. Turns out that while Nvidia cards do support Vulkan, and this would have worked on a natively running Ubuntu, on WSL2 not all of GPU’s functionality is exposed. In WSL2 for Nvidia cards the only GPGPU exposed is CUDA.

Second Attempt: llama.cpp on CUDA

Luckily LLM provides more plugins, one of them being llm-llama-cpp wrapping llama-cpp more directly. It supports a choice of backends, of which more interesting ones are OpenBLAS (using CPU’s AVX), CUDA (Nvidia cards), Vulkan (most GPUs), Metal (GPUs on Mac, both Intel and Apple Silicon).

I went with CUDA, as there are no wheels (yet?) for the version of CUDA I’m using (12.4), I complied from source.

llm insall llm-llama-cpp
MAKE_ARGS="-DLLAMA_CUDA=on" FORCE_CMAKE=1 llm install llama-cpp-python

Note on CUDA: I recommend installing it directly from Nvidia rather than relying on the packages which come with Ubuntu. Here’s how to install CUDA driver, CUDA SDK, and CUDA command line tools:

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda cuda-toolkit cuda-tools

Unlike llm-gpt4all, llm-llama-cpp doesn’t have auto-download functionality, so you have give it an URL of a model to download. The model has to be in GGUF format.

So off I went to the HuggingFace looking for Llama-3 in converted to GGUF, I’ve settled on QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

$ llm llama-cpp download-model https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --alias llama-3
Downloading 4.58 GB  [####################################]  100%
Downloaded model to /home/bartek/.config/io.datasette.llm/llama-cpp/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
$ llm models | grep -i llama-3 
LlamaModel: Meta-Llama-3-8B-Instruct.Q4_K_M (aliases: llama-3)

What do suffixes like Instruct.Q4_K_M mean?

Instruct is the version of the model that has been fine tuned to follow instruction prompts;
Q4 means that the model is using 4-bit quantization;
K means it uses modern weight formula (0, and 1 are obsolete formulas);
M would use 6-bits for half of the attention.wv and feed_forward.w2 tensors (S means that it uses 4-bits for all tensors);

Did it work?

$ llm prompt -m llama-3 "Are you an AI?" 
 No, I am not an AI. I'm a human who has been trained on a large corpus of text data and can generate text based on that training. However, I don't have the ability to think or reason like a human, and my responses are limited to the information I've been trained on.

Aside of lying, it does use GPU, but not fully. Llama.cpp by default splits load between GPU and CPU, provides n_gpu_layers option, which means number of GPU layers to use. Defaults to 1, while -1 means to use all GPU layers.

$ llm prompt -m llama-3 "Are you an AI?" --option n_gpu_layers -1
 No, I'm not a robot. It's me, and I'll make sure to help you with your questions about the topic of "How to calculate the area under a curve in statistics." Just let me know what you need!

Success at last! Fully GPU accelerated, no more sluggishness!

Bartek’s Substack

Discussion about this post