Since the release of GenAI LLMs, we have started using them in one way or another. The most common way is through websites like the OpenAI website to use ChatGPT or Large Language Models via APIs like OpenAI’s GPT3.5 API, Google’s PaLM API, or through other websites like Hugging Face, Perplexity.ai, which allow us to interact with these Large Language Models.
In all these approaches, our data is sent outside our computer. They may be prone to cyber-attacks (though all these websites assure the highest security, we don’t know what might happen). Sometimes, we want to run these Large Language Models locally and if possible, tune them locally. In this article, we will go through this, i.e., setting up LLMs locally with Oobabooga.
This article was published as a part of the Data Science Blogathon.
Oobabooga is a text-generation web interface for Large Language Models. Oobabooga is a gradio-based web UI. Gradio is a Python library extensively used by Machine Learning enthusiasts to build Web Applications, and Oobabooga was built using this library. Oobabooga abstracts away all the complicated things needed to set up while trying to run a large language model locally. Oobabooga comes with a load of extensions to integrate other features.
With Oobabooga, you can provide the link for the model from Hugging Face, and it will download it, and you start inference the model right away. Oobabooga has many functionalities and supports different model backends like the GGML, GPTQ,exllama, and llama.cpp versions. You can even load a LoRA(Low-Rank Adaptation) with this UI on top of an LLM. Oobabooga lets you train the large language model to create chatbots / LoRAs. In this article, we will go through the installation of this software with Conda.
In this section, we will be creating a virtual environment using conda. So, to create a new environment, go to Anaconda Prompt and type the following.
conda create -n textgenui python=3.10.9
conda activate textgenui
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
Now, the above command will download the PyTorch GPU Python library. Note that the CUDA(GPU) version we are downloading is cu117. This can change occasionally, so visiting the official Pytorch Page to get the command to download the latest version is advised. And if you have no access to GPU, you can go ahead with the CPU version.
Now change the directory within the anaconda prompt to the directly where you will download the code. Now you can either download it from GitHub or use the git clone command to do it here I will be using the git clone command to clone the Oobabooga’s repository to the directory I want with the below command.
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
The above command will then install all the required packages/libraries, like hugging face, transformers, bitandbytes, gradio, etc., required to run the large language model. We are ready to launch the web UI, which we can do with the below command.
python server.py
Now, in the Anaconda Prompt, you will see that it will show you a URL http://localhost:7860 or http://127.0.0.1:7860. Now go to this URL in your browser, and the UI will appear and will look as follows.:
We have now successfully installed all the necessary libraries to start working with the text-generation-ui, and our next step will be to download the large language models
In this section, we will download a large language model from the Hugging Face and then try inferencing it and chatting with the LLM. For this, navigate to the Model section present in the top bar of the UI. This will open the model page that looks as follows:
Here on the right side, we see “Download Custom model or LoRA”; below, we see a text field with a download button. In this text field, we must provide the model’s path from the Hugging Face website, which the UI will download. Let’s try this with an example. For this, I will download the Nous-Hermes model based on the newly released Llama 2. So, I will go to that model card in the Hugging Face, which can be seen below
So I will be downloading a 13B GPTQ model(these models require GPU to run; if you want only the CPU version, then you can go with GGML models), which is the quantized version of the Nous-Hermes 13B model that is based on the Llama 2 model, To copy the path, you can click on the copy button. And now, we need to scroll down to see the different quantized versions of the Nous-Hermes 13B model.
Here, for example, we will choose the gptq-4bit-32g-actorder_True version of the Nous-Hermes-GPTQ model. So now the path for this model will be “TheBloke/Nous-Hermes-Llama2-GPTQ:gptq-4bit-32g-actorder_True”, where the part before the “:” indicates the model name and the part after the “:” indicates the quantized version type of the model. Now, we will paste this into the text box we saw earlier.
Now, we will click on the download button to download the model. This will take some time as the file size is 8GB. After the model is downloaded, click on the refresh button, present to the left of the Load button to refresh. Now select the model you want to use from the drop-down. Now, if the model is CPU version, you can click on the Load button as shown below.
We must allocate the GPU VRAM from the model if you use a GPU-type model, like the GPTQ one we downloaded here. As the model size is around 8GB, we will allocate around 10GB of memory to it(I have sufficient GPU VRAM, so providing 10 GB). Then, we click on the load button as shown below.
Now, after we click the load button, we go to the Session tab and change the mode. The mode will be changed from default to chat. Then, we click the Apply and restart buttons, as shown in the picture.
Now, we are ready to make inferences with our model, i.e., we can start interacting with the model that we have downloaded. Now go to the Text Generation tab, and it will look something like
So, it’s time to test our Nous-Hermes-13B Large Language Model that we downloaded from Hugging Face through the Text Generation UI. Let’s start the conversation.
We can see from the above that the model is indeed working fine. It didn’t do anything too creative, i.e., hallucinate. It rightly answered my questions. We can see that we have asked the large language model to generate a Python code for finding the Fibonacci series. The LLM has written a workable Python code that matches the input that I have given. Along with that, it even gave me an explanation of how it works. This way, you can download and run any model through the Text Generation UI, all of it locally, ensuring the privacy of your data.
In this article, we have gone through a step-by-step process of downloading text-generation-UI, which allows us to interact with the large language models directly within our local environment without being connected to the network. We have looked into how to download models of a specific version from Hugging Face and have learned what quantized methods the current application supports. This way, anyone can access a large language model, even the latest LlaMA 2, which we have seen in this article, a large language model that was based on the newly released LlaMA 2.
Some of the key takeaways from this article include:
A. It is a UI created with Gradio Package in Python that allows anyone to download and run any large language model locally.
A. We can download any models with this UI by just providing the model link to the UI. This model, we can obtain it from the Hugging Face website, which is the place holding 1000s of large language models.
A. No. Here, we are running the large language model completely on our local machine. We only need the internet when downloading the model; after that, we can infer the model without the internet thus everything happens locally within our computer. The data you use in the chat is not stored anywhere or going anywhere on the internet.
A. Yes, absolutely. You can either fully train any model that you download or create a LoRA out of it. We can download a vanilla large language model like LlaMA or LlaMA 2, train them from scratch with our custom data for any application, and then infer the model based on it.
A. Yes, we can run the quantized models like the 2bit, 4bit, 6bit, and 8bit quantized models on it. It fully supports the models quantized with GPTQ, GGML, and others like ExLlaMA and Llama.cpp. If you have a more giant GPU, you can run the whole model without quantization.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.