Running Mixtral 8x7B Instruct on a Single Nvidia GPU using Llama CPP Python
Requirements
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Visual Studio or MinGW
- MacOS: Xcode
This installation guide was tested with the following system specifications and package versions:
OS Arch Linux Linux kernel 6.12.1-arch1-1 x86_64 CPU AMD Ryzen 7950X GPU Nvidia RTX 4090 RAM 64GB 4800MHz python 3.12.4 gcc 14.2.1 20240910 cuda 12.5.1 cuda-tools 12.5.1 cudnn 9.5.1.17 llama_cpp_python 0.3.2
Setup
It is recommended to create and activate a virtual environment with the desired Python version before continuing further. Using virtual environments keeps project Python packages separate from system packages.
Setup a Project Directory and Test File
Create a directory for your project and models:
mkdir -p ~/dev/mixtral/models
Create a test.py file to be used later for testing.
touch ~/dev/mixtral/test.py
Install the Llama CPP Python Binding Package
            Install
            llama-cpp-python
            with CUDA support by setting the GGML_CUDA=on environment
            variable before installing:
          
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
            If you have accidentally already installed a version of
            llama-cpp-python that was compiled to use your CPU, then force
            a recompilation of llama-cpp-python with CUDA support and
            subsequent re-installation:
          
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade
Confirm Support for GPU Offload
            Once llama-cpp-python is installed, confirm that GPU offloading
            is available by running the following Python code within your virtual
            environment.
          
Open the test.py file:
nvim ~/dev/mixtral/test.py
Paste the following code, update the path to your Python libraries inside your virtual environment, then save and quit:
import pathlib
from llama_cpp.llama_cpp import load_shared_library
def is_gpu_available_v3() -> bool:
    lib = load_shared_library(
        "llama",
        pathlib.Path(
            "~/dev/venvs/mixtral/lib/python3.12/site-packages/llama_cpp/lib"
        ),
    )
    return bool(lib.llama_supports_gpu_offload())
print(is_gpu_available_v3())
          Run the above code:
python test.py
The above code should print something similar to the following if successful:
... ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes True
Download a Mixtral 8x7B Instruct GGUF Model
Find a GGUF version of the Mixtral 8x7B model with the desired quantisation. Note that a quantisation level with a larger numbers of bits requires more memory but also offers better performance.
Models can be found on the mradermarcher/Mixtral-8x7B-Instruct-v0.1-GGUF Hugging Face page. Try different quantisation sizes and methods. There are some reports that the newer I-quants (IQ2_XXS, IQ3_S etc.) are only better if the whole model can be loaded into VRAM. K-quants (Q3_K_S, Q5_K_M etc.) still offer good performance if partially offloading the model to VRAM.
Download the desired model. Censored version:
curl -L -O --output-dir ~/dev/mixtral/models https://huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1.Q5_K_M.gguf
Uncensored version:
curl -L -O --output-dir ~/dev/mixtral/models/ https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF/resolve/main/dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf
Run the Model
Edit the test.py file:
nvim ~/dev/mixtral/test.py
            Clear any content in the file then paste the following code. Be sure to
            update the
            model_path to your downloaded model, and adjust the number of
            GPU layers according to the amount of VRAM available. Save and quit when
            done.
          
llm = Llama(
    model_path="./models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf",
    n_gpu_layers=21,  # GPU acceleration (less layers = less VRAM)
    # seed=1337, # Uncomment to set a specific seed
    # n_ctx=2048,  # Uncomment to increase the context window
)
output = llm(
    "Q: Generate example sentences that contain multiple embedded clauses? A: ",  # Prompt
    max_tokens=None,  # Generate up to 32 tokens, set to None to generate up to the end of the context window
    stop=[
        "Q:",
        "\n",
    ],  # Stop generating just before the model would generate a new question
    echo=True,  # Echo the prompt back in the output
)  # Generate a completion, can also call create_completion
# print(output)  # for full output
print(output["choices"][0]["text"])  # for partial output that only shows the prompt and response
          Run the model:
python ~/dev/mixtral/test.py
The following should be produced at the end of the output:
Q: Generate example sentences that contain multiple embedded clauses? A: 1. "Although I had studied all night, I didn't think I would pass the exam because I had trouble understanding the material." 2. "If you visit New York, I recommend that you see a Broadway show, especially if you enjoy musicals, because they are a unique experience." 3. "Because she didn't want to disturb her sleeping baby, she tiptoed quietly to the kitchen to make herself a cup of tea." 4. "The teacher praised the student who had worked hard throughout the semester and had earned an A in the class." 5. "Whenever I see a beautiful sunset, I am reminded of how fortunate I am to be alive and to witness the beauty of nature."