Fine-Tuning Llama 2 with QLoRa: Create Your Personalized Chatbot
Written on
Chapter 1: Introduction to Llama 2
In the realm of advanced language processing, Llama 2 stands out as a remarkable large language model (LLM) introduced by Meta. The model showcases exceptional performance across various public benchmarks, excelling in both natural language generation and coding tasks. Meta has also launched chat-oriented versions of Llama 2, resembling the capabilities of OpenAI's ChatGPT, thus allowing users to build effective chatbots.
Llama 2 comes in several configurations, including 7 billion, 13 billion, and 70 billion parameters, with a 34 billion version discussed in the research paper but not yet available. The 7B and 13B models are particularly suitable for local deployment due to advancements in quantization techniques like GPTQ or QLoRa, enabling users to fine-tune and operate these models on standard consumer hardware.
To guide you through the fine-tuning process of Llama 2 with QLoRa on instruction datasets, I’ll utilize Hugging Face's TRL library, which streamlines the fine-tuning of LLMs. By the end of this guide, you’ll have your very own Llama 2 chat model functioning on your machine.
The first video, titled "Llama 2 Fine-Tune with QLoRA [Free Colab]", explains the process of fine-tuning Llama 2 using QLoRa in a Colab environment, making it accessible for everyone.
Chapter 2: Accessing Llama 2
To begin, you need to register with Meta to gain access to Llama 2. Complete the registration form, and you should receive an email confirmation from Meta within an hour. Additionally, if you plan to use the Hugging Face Hub, you’ll need to create an account there. Ensure that the email linked to your Hugging Face account matches the one used for obtaining Llama 2 weights.
After logging into Hugging Face, navigate to a Llama 2 model card, follow the on-screen instructions, and check the necessary boxes to gain access to the model. This process may take a bit longer, but you should have access to Llama 2 on Hugging Face Hub within a day. Make sure to generate an access token in your Hugging Face settings, as it will be essential for further steps.
Section 2.1: Understanding QLoRa
QLoRa is a cutting-edge method for fine-tuning quantized LLMs. It functions similarly to LoRa but incorporates several enhancements to optimize memory efficiency.
In brief, the procedure is as follows:
- The model is loaded and quantized in real-time using a unique 4-bit precision (4-bit NormalFloat).
- A double quantization technique applies to the quantization constant.
- The parameters of the 4-bit LLM remain fixed while low-rank adapters are integrated and initialized on top of the model.
- The adapter parameters are trained using an Adam optimizer, with states stored on the CPU to minimize VRAM usage.
This approach allows for the training of additional LoRa parameters while retaining the original model in the background, resulting in a fine-tuned LLM that performs comparably to one fine-tuned via traditional methods.
The second video, "Efficient Fine-Tuning for Llama 2 on Custom Dataset with QLoRA on a Single GPU in Google Colab," dives into efficient fine-tuning techniques for Llama 2, providing insights into utilizing custom datasets effectively.
Section 2.2: Padding Llama 2
Llama 2 lacks a built-in padding token, which presents a challenge since most fine-tuning libraries expect one. Many tutorials suggest creating a pad token by duplicating the end-of-sequence (EOS) token. However, while functional, this is technically incorrect as the EOS token serves a crucial role in indicating the end of a sequence.
An alternative solution is to assign the padding token as an unknown (UNK) token, allowing it to appear anywhere in the sequence without complications related to padding direction.
The recommended method involves adding a dedicated padding token through the following code:
model_name = "meta-llama/Llama-2-7b-hf"
ACCESS_TOKEN = "hf_mytoken"
# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, use_auth_token=ACCESS_TOKEN)
# Create and add a new padding token to the tokenizer
tokenizer.add_special_tokens({"pad_token": ""})
model = LlamaForCausalLM.from_pretrained(model_name, use_auth_token=ACCESS_TOKEN)
# Adjust token embeddings
model.resize_token_embeddings(len(tokenizer))
# Set the padding token in the model configuration
model.config.pad_token_id = tokenizer.pad_token_id
Now, Llama 2 will effectively utilize the new padding token.
Chapter 3: Fine-Tuning Llama 2 on Guanaco
The notebook for fine-tuning Llama 2 is available through The Kaitchup: Inference with Llama 2 adapter. For inference, simply load the saved adapter atop the original model. Keep in mind that the Llama 2 7B model should be loaded as initially done for training.
Here’s a sample code snippet to generate responses:
model = PeftModel.from_pretrained(model, "./results/checkpoint-10")
def generate(instruction):
prompt = "### Human: " + instruction + "### Assistant: "
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
input_ids=input_ids,
generation_config=GenerationConfig(temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=256
)
for seq in generation_output.sequences:
output = tokenizer.decode(seq)
print(output.split("### Assistant: ")[1].strip())
generate("Tell me about gravitation.")
The output should resemble this:
"Gravitation is the force attracting all objects towards one another in the universe..."
Conclusion
Guanaco is a relatively compact dataset; for potentially better results, you may consider larger datasets like Alpaca. If you possess a more powerful GPU, such as an RTX 3080 or newer with 24 GB of VRAM, you can experiment with the 13B version of Llama 2 as well.
This guide was initially published in The Kaitchup. Subscribe for more in-depth articles and insights on AI and technology.