4Bit QLoRA - 13B & 33B on 24GB VRAM or less?

16 Jun, 2023

Author: Elinas

Preface

I had heard of training quantized models for some time now, specifically 4bit/int4. I shrugged them off as most likely being low quality - since you’re taking a model that has already been quantized (for inference) and then training on it. I was pleasantly surprised and will give you a quick overview on how I replicated Chronos-13B using a single 3090 with a ~22% speed increase over 8bit/int8.

To note - LLaMA 7B and 13B can be run well under 24GB VRAM. 30/33B was the original idea to run on a single 3090.

Bitsandbytes nf4 Format is Added to Transformers

Since I wanted to try int4 training, and I had a 3090 sitting around doing nothing, I decided to do a bit of research on how the process works and how to set it up. I won’t go into the technical details, but you can read this blog post for more info Now if you look at that blog post, the title is “Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA”, which is exactly what I am going to explain.

Some Technical Details & Getting to Training (Soon)

Previously, to train a LoRA model, you’d need to use 8bit or int8 type (not to be confused with FP8 or others which are floating point numbers and not integers, thus are more precise), nonetheless, the prospect of training LLaMA 33B on my 3090 was exciting, and the added speed was a bonus.

Since LLaMA and many other models have a default “memory,” or otherwise known as “context length,” I always try to train my models at a length of 2048 tokens which is what LLaMA was trained at. This includes getting a high quality dataset, with many samples that are around the range of 2048 tokens for the model’s max potential.

Rough Tests & Findings (LLaMA)

You can train LLaMA 13B at 2048 context on a 3090 (24GB) - though you CANNOT train 30/33B at 2048 context. Trust me I tried every optimization I could think of without resorting to offloading.
You CAN train LLaMA 30/33B at 1024 tokens on a 3090 (and I was able to push that to 1500).
Caveat: You’re losing 500-1024 tokens of context if you decide to train 30/33B, so I would not recommend it.
Pro: I would recommend training LLaMA 7B or 13B as they can easily fit on a <24GB card.

Experiment - Replicating Chronos-13B in 4bit

I wanted to try this method and since I did not want to train 30/33B (Chronos versions were not out yet of that size too..) I decided to try to replicate the 8bit version of the model as closely as I could in 4bit. For this I used my own trainer in which I implemented QLoRA. The trainer is originally adapted from Stanford Alpaca LoRA repo but has since evolved to include numerous features.

Setting up your Environment

I used WSL as a PoC for this and I highly recommend you use Linux or WSL for simplicity.

You can use my trainer to accomplish training easily, it can be found here: Zeus LLM Trainer

I’ll provide the instructions as I did myself originally:

Create the venv - python -m venv venv
Activate the venv - source venv/bin/actiate
Install the requirements - pip install -r requirements.txt

That should have you covered, now you will need a dataset, which there are many to choose from if you browse Hugging Face. For this demo we’ll use the GPT4 Alpaca LoRA Dataset but you can use any format as long as it follows in json or jsonl:

instruction
input (can be an empty string but required)
output

Running the Model

Note I changed the base model, it was pointing to the wrong one originally, the correct base model is elinas/llama-13b-hf-transformers-4.29 for this demo!

Here is the run configuration I used

python finetune.py \
    --base_model='elinas/llama-13b-hf-transformers-4.29' \
    --data_path='dataset.json' \
    --train_4bit \
    --num_train_epochs=3 \
    --cutoff_len=2048 \
    --val_set_size=0 \
    --output_dir='./13b-4bit-qlora-chronos' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=128 \
    --lora_alpha=256 \
    --gradient_accumulation_steps=8 \
    --per_device_train_batch_size=2 \
    --save_and_eval_steps=500 \
    --warmup_ratio=0.04 \
    --group_by_length \
    --save_total_limit=2 \
    --use_xformers

I won’t go through every parameter but there are some you should be familiar with.

--base_model='elinas/llama-13b-hf-transformers-4.29' is just saying to use the LLaMA 13B model I created a while ago which will automatically be downloaded.

--train_4bit simply signifies that you are using QLoRA to train your model, this is needed.

--num_train_epochs=3 generally we train for 3 epochs, sometimes more, but not usually less unless your model is overfit.

--cutoff_len=2048 is where we want the model to cut the samples off at. Currently, without alternate methods, 2048 is the max.

--use_xformers is a nice “hack” to reduce VRAM usage quite significantly at nil cost to the end result.

--per_device_train_batch_size=2 Should be kept at 2 for this demo, BUT you may be able to increase --gradient_accumulation_steps=8 to a higher value like 16 as I purposely under-provisioned VRAM to ensure there were no crashes.

With these settings, I was using ~18GB VRAM, and that includes 2 additional LoRA attention layers, so this might fit on a 16GB card like an A4000. Additionally, 7B should be trainable with a 12GB card.

Please read the rest of my documentation here for more information on the hyperparameters.

The Results

Comparing to training the same configuration on 2x RTX A6000 GPUs, I estimated a 22%+ increase in speed alone. Now, how good is the model compared to the original? Not as good, but if I did not ever use the original model, or did a complete blind test, then it would be harder to ascertain the differences.

The quality drop is not significant enough to warrant actual comparisons here, as they are just different. I have not been able to figure out the exact reason behind the results other than that it follows instructions… not as well. Now, that does not mean 4bit QLoRA is bad, the quite opposite, actually. I’ve demonstrated that LoRAs are comparable to finetunes with great datasets.

Conclusion

Should you bother? - It’s really up to you and what your goal is. I enjoy experimenting with bleeding-edge tech, and while this might not be as good as an FP16 model, neither is a quantized 4bit model that most of you all use anyway. It’s just another method, and should been seen as that, not just “worse” because it utilizes lower precision. In the end, if you’re going to train, you should have a dataset prepared.

Will I personally do it again? - Yes! I am planning to try it on LLaMA 65B and see the results from a large parameter model. Though, current priority is on extending context length on the Chronos models.

If you are interested in LLMs, Deep Learning, ML, etc., please join the Zeus Labs Discord Server

Note that this post may be updated with a follow-up in the future. Thanks for reading.

– Elinas