Finetune Llama 3.2
Faster & Efficiently

A complete kit for LoRA fine-tuning using Unsloth. Experience 2-5x faster training with 70% less memory usage.

Start Training Learn More

Base Model

Llama-3.2-3B-Instruct

Method

LoRA (Unsloth)

Memory Saving

~70% Less

Quantization

4-bit Native

Why Unsloth?

Blazing Fast

Optimized kernels provide 2-5x faster training speeds compared to standard HF implementations.

Memory Efficient

Drastically reduces VRAM usage by 70%, allowing you to train larger models on consumer GPUs.

Easy Export

Native support to export to GGUF (for llama.cpp/Ollama), 16-bit, or 4-bit merged formats.

1 Quick Start Guide

Install Dependencies

bash

pip install -r requirements.txt

Prepare Dataset

Create a JSONL file. The format must include the messages structure.

data.jsonl

{"messages": [{"role": "user", "content": "Explain quantum computing"}, {"role": "assistant", "content": "Quantum computing uses..."}]}
{"messages": [{"role": "system", "content": "You are a helpful pirate."}, {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Ahoy matey!"}]}

3. Train

python train.py

4. Evaluate

python eval.py

5. Export

python export.py

LoRA Configuration

config.json settings

r (rank) 16

lora_alpha 16

lora_dropout 0

target_modules

q_proj k_proj v_proj o_proj gate/up/down

Training Settings

Setting	Value
Learning Rate	0.0002
Batch Size	4
Grad Accumulation	4
Max Seq Length	2048
Optimizer	AdamW 8-bit

Using Your Model

Recommended

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./output",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After Merging

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./output_merged_16bit")
model = AutoModelForCausalLM.from_pretrained("./output_merged_16bit")

# Now use as a standard Hugging Face model
inputs = tokenizer("Hello!", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

After GGUF Export

# Create the model in Ollama
ollama create my-model -f Modelfile

# Run the model
ollama run my-model

Troubleshooting & Export

Running Out of Memory (OOM)?

Reduce per_device_train_batch_size to 1
Increase gradient_accumulation_steps to 8
Reduce max_seq_length to 1024

Training is slow?

Ensure GPU acceleration is active.
Verify Unsloth installation.
Ensure load_in_4bit=True is set.

Chat Output looks wrong?

Models use different chat templates. Check the base model's HuggingFace card. Some require add_generation_prompt=True.

Export Options

Format	Size	Target
LoRA Adapter	~50MB	Base Model
GGUF q4_k_m	~2-4GB	llama.cpp / Ollama
GGUF q8_0	~4-8GB	High Quality
16-bit Merged	~6-14GB	Hugging Face

Finetune Llama 3.2 Faster & Efficiently