How to install LLAMA CPP with CUDA (on Windows) | by Kaizin | Medium


This article provides a step-by-step guide on installing the LLAMA CPP model with CUDA support on a Windows system, enabling faster text generation using GPUs.
AI Summary available β€” skim the key points instantly. Show AI Generated Summary
Show AI Generated Summary

How to install LLAMA CPP with CUDA (on Windows)

As LLM such as OpenAI GPT becomes very popular, many attempts have been done to install LLM in local environment. The most famous LLM that we can install in local environment is indeed LLAMA models. However running LLMs requires lots of computing power even when just generating texts. Therefore we need GPUs to boost up the speed of generating.

Recently C/C++ port of LLAMA model has been developed. Since it is written in C/C++ language which is high-performance programming language, it could be running faster than ChatGPT with high-performance computing platform.

Although I don’t have such a high-performance computing platform, I tried to install some LLAMA cpp models with GPU enables.

Zephyr 7B

It is fine-tuned version of LLAMA and It shows great performance on Extraction, Coding, STEM, and Writing compare to other LLAMA models. LLAMA cpp team introduced a new format called GGUF for cpp models. Below repo contains model of GGUF format and I used this model to install.

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

To use LLAMA cpp, llama-cpp-python package should be installed. But to use GPU, we must set environment variable first. Make sure that there is no space,β€œβ€, or β€˜β€™ when set environment variable.

Since I use anaconda, run below codes to install llama-cpp-python.

# on anaconda prompt!set CMAKE_ARGS=-DLLAMA_CUBLAS=onpip install llama-cpp-python# if you somehow fail and need to re-install run below codes.# it ignore files that downloaded previously and re-install with new files.pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir --verbose

Running above code actually showed no errors, but you have to check if it is installed properly. When you run the model actually (with verbose True option), you can observe logs like below, and BLAS must be set as 1. Otherwise LLAMA model would not use GPU.

Was this article displayed correctly? Not happy with what you see?

Tabs Reminder: Tabs piling up in your browser? Set a reminder for them, close them and get notified at the right time.

Try our Chrome extension today!


Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more

Facebook

Save articles to reading lists
and access them on any device


Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more

Facebook

Save articles to reading lists
and access them on any device