Running Mixtral 8x7B Instruct on a Single Nvidia GPU using Llama CPP Python
Requirements
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Visual Studio or MinGW
- MacOS: Xcode
This installation guide was tested with the following system specifications and package versions:
OS Arch Linux Linux kernel 6.12.1-arch1-1 x86_64 CPU AMD Ryzen 7950X GPU Nvidia RTX 4090 RAM 64GB 4800MHz python 3.12.4 gcc 14.2.1 20240910 cuda 12.5.1 cuda-tools 12.5.1 cudnn 9.5.1.17 llama_cpp_python 0.3.2
Setup
It is recommended to create and activate a virtual environment with the desired Python version before continuing further. Using virtual environments keeps project Python packages separate from system packages.
Setup a Project Directory and Test File
Create a directory for your project and models:
mkdir -p ~/dev/mixtral/models
Create a test.py
file to be used later for testing.
touch ~/dev/mixtral/test.py
Install the Llama CPP Python Binding Package
Install
llama-cpp-python
with CUDA support by setting the GGML_CUDA=on
environment
variable before installing:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
If you have accidentally already installed a version of
llama-cpp-python
that was compiled to use your CPU, then force
a recompilation of llama-cpp-python
with CUDA support and
subsequent re-installation:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade
Confirm Support for GPU Offload
Once llama-cpp-python
is installed, confirm that GPU offloading
is available by running the following Python code within your virtual
environment.
Open the test.py
file:
nvim ~/dev/mixtral/test.py
Paste the following code, update the path to your Python libraries inside your virtual environment, then save and quit:
import pathlib from llama_cpp.llama_cpp import load_shared_library def is_gpu_available_v3() -> bool: lib = load_shared_library( "llama", pathlib.Path( "~/dev/venvs/mixtral/lib/python3.12/site-packages/llama_cpp/lib" ), ) return bool(lib.llama_supports_gpu_offload()) print(is_gpu_available_v3())
Run the above code:
python test.py
The above code should print something similar to the following if successful:
... ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes True
Download a Mixtral 8x7B Instruct GGUF Model
Find a GGUF version of the Mixtral 8x7B model with the desired quantisation. Note that a quantisation level with a larger numbers of bits requires more memory but also offers better performance.
Models can be found on the mradermarcher/Mixtral-8x7B-Instruct-v0.1-GGUF Hugging Face page. Try different quantisation sizes and methods. There are some reports that the newer I-quants (IQ2_XXS, IQ3_S etc.) are only better if the whole model can be loaded into VRAM. K-quants (Q3_K_S, Q5_K_M etc.) still offer good performance if partially offloading the model to VRAM.
Download the desired model. Censored version:
curl -L -O --output-dir ~/dev/mixtral/models https://huggingface.co/mradermacher/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/Mixtral-8x7B-Instruct-v0.1.Q5_K_M.gguf
Uncensored version:
curl -L -O --output-dir ~/dev/mixtral/models/ https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF/resolve/main/dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf
Run the Model
Edit the test.py
file:
nvim ~/dev/mixtral/test.py
Clear any content in the file then paste the following code. Be sure to
update the
model_path
to your downloaded model, and adjust the number of
GPU layers according to the amount of VRAM available. Save and quit when
done.
llm = Llama( model_path="./models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf", n_gpu_layers=21, # GPU acceleration (less layers = less VRAM) # seed=1337, # Uncomment to set a specific seed # n_ctx=2048, # Uncomment to increase the context window ) output = llm( "Q: Generate example sentences that contain multiple embedded clauses? A: ", # Prompt max_tokens=None, # Generate up to 32 tokens, set to None to generate up to the end of the context window stop=[ "Q:", "\n", ], # Stop generating just before the model would generate a new question echo=True, # Echo the prompt back in the output ) # Generate a completion, can also call create_completion # print(output) # for full output print(output["choices"][0]["text"]) # for partial output that only shows the prompt and response
Run the model:
python ~/dev/mixtral/test.py
The following should be produced at the end of the output:
Q: Generate example sentences that contain multiple embedded clauses? A: 1. "Although I had studied all night, I didn't think I would pass the exam because I had trouble understanding the material." 2. "If you visit New York, I recommend that you see a Broadway show, especially if you enjoy musicals, because they are a unique experience." 3. "Because she didn't want to disturb her sleeping baby, she tiptoed quietly to the kitchen to make herself a cup of tea." 4. "The teacher praised the student who had worked hard throughout the semester and had earned an A in the class." 5. "Whenever I see a beautiful sunset, I am reminded of how fortunate I am to be alive and to witness the beauty of nature."