Running Mixtral 8x7B Instruct on a Single Nvidia GPU using Llama CPP Python

Requirements

  • Python 3.8+
  • C compiler
    • Linux: gcc or clang
    • Visual Studio or MinGW
    • MacOS: Xcode

This installation guide was tested with the following system specifications and package versions:

OS            Arch Linux
Linux kernel  6.12.1-arch1-1 x86_64
CPU           AMD Ryzen 7950X
GPU           Nvidia RTX 4090
RAM           64GB 4800MHz

python            3.12.4
gcc               14.2.1 20240910
cuda              12.5.1
cuda-tools        12.5.1
cudnn             9.5.1.17
llama_cpp_python  0.3.2

Setup

It is recommended to create and activate a virtual environment with the desired Python version before continuing further. Using virtual environments keeps project Python packages separate from system packages.

Setup a Project Directory and Test File

Create a directory for your project and models:

mkdir -p ~/dev/mixtral/models

Create a test.py file to be used later for testing.

touch ~/dev/mixtral/test.py

Install the Llama CPP Python Binding Package

Install llama-cpp-python with CUDA support by setting the GGML_CUDA=on environment variable before installing:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

If you have accidentally already installed a version of llama-cpp-python that was compiled to use your CPU, then force a recompilation of llama-cpp-python with CUDA support and subsequent re-installation:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade

Confirm Support for GPU Offload

Once llama-cpp-python is installed, confirm that GPU offloading is available by running the following Python code within your virtual environment.

Open the test.py file:

nvim ~/dev/mixtral/test.py

Paste the following code, update the path to your Python libraries inside your virtual environment, then save and quit:

import pathlib

from llama_cpp.llama_cpp import load_shared_library


def is_gpu_available_v3() -> bool:
    lib = load_shared_library(
        "llama",
        pathlib.Path(
            "~/dev/venvs/mixtral/lib/python3.12/site-packages/llama_cpp/lib"
        ),
    )
    return bool(lib.llama_supports_gpu_offload())


print(is_gpu_available_v3())

Run the above code:

python test.py

The above code should print something similar to the following if successful:

...
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
True

Download a Mixtral 8x7B Instruct GGUF Model

Find a GGUF version of the Mixtral 8x7B model with the desired quantisation. Note that a quantisation level with a larger numbers of bits requires more memory but also offers better performance.

Models can be found on the mradermarcher/Mixtral-8x7B-Instruct-v0.1-GGUF Hugging Face page. Try different quantisation sizes and methods. There are some reports that the newer I-quants (IQ2_XXS, IQ3_S etc.) are only better if the whole model can be loaded into VRAM. K-quants (Q3_K_S, Q5_K_M etc.) still offer good performance if partially offloading the model to VRAM.

Download the desired model. Censored version:

curl -L -O --output-dir ~/dev/mixtral/models https://huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1.Q5_K_M.gguf

Uncensored version:

curl -L -O --output-dir ~/dev/mixtral/models/ https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF/resolve/main/dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf

Run the Model

Edit the test.py file:

nvim ~/dev/mixtral/test.py

Clear any content in the file then paste the following code. Be sure to update the model_path to your downloaded model, and adjust the number of GPU layers according to the amount of VRAM available. Save and quit when done.

llm = Llama(
    model_path="./models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf",
    n_gpu_layers=21,  # GPU acceleration (less layers = less VRAM)
    # seed=1337, # Uncomment to set a specific seed
    # n_ctx=2048,  # Uncomment to increase the context window
)

output = llm(
    "Q: Generate example sentences that contain multiple embedded clauses? A: ",  # Prompt
    max_tokens=None,  # Generate up to 32 tokens, set to None to generate up to the end of the context window
    stop=[
        "Q:",
        "\n",
    ],  # Stop generating just before the model would generate a new question
    echo=True,  # Echo the prompt back in the output
)  # Generate a completion, can also call create_completion

# print(output)  # for full output
print(output["choices"][0]["text"])  # for partial output that only shows the prompt and response

Run the model:

python ~/dev/mixtral/test.py

The following should be produced at the end of the output:

Q: Generate example sentences that contain multiple embedded clauses? A: 1. "Although I had studied all night, I didn't think I would pass the exam because I had trouble understanding the material." 2. "If you visit New York, I recommend that you see a Broadway show, especially if you enjoy musicals, because they are a unique experience." 3. "Because she didn't want to disturb her sleeping baby, she tiptoed quietly to the kitchen to make herself a cup of tea." 4. "The teacher praised the student who had worked hard throughout the semester and had earned an A in the class." 5. "Whenever I see a beautiful sunset, I am reminded of how fortunate I am to be alive and to witness the beauty of nature."