Powered by Unsloth

Finetune Llama 3.2
Faster & Efficiently

A complete kit for LoRA fine-tuning using Unsloth. Experience 2-5x faster training with 70% less memory usage.

Base Model
Llama-3.2-3B-Instruct
Method
LoRA (Unsloth)
Memory Saving
~70% Less
Quantization
4-bit Native

Why Unsloth?

Blazing Fast

Optimized kernels provide 2-5x faster training speeds compared to standard HF implementations.

Memory Efficient

Drastically reduces VRAM usage by 70%, allowing you to train larger models on consumer GPUs.

Easy Export

Native support to export to GGUF (for llama.cpp/Ollama), 16-bit, or 4-bit merged formats.

1 Quick Start Guide

Install Dependencies

bash
pip install -r requirements.txt

Prepare Dataset

Create a JSONL file. The format must include the messages structure.

data.jsonl
{"messages": [{"role": "user", "content": "Explain quantum computing"}, {"role": "assistant", "content": "Quantum computing uses..."}]}
{"messages": [{"role": "system", "content": "You are a helpful pirate."}, {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Ahoy matey!"}]}

3. Train

python train.py

4. Evaluate

python eval.py

5. Export

python export.py

LoRA Configuration

config.json settings
r (rank) 16
lora_alpha 16
lora_dropout 0
target_modules
q_proj k_proj v_proj o_proj gate/up/down

Training Settings

Setting Value
Learning Rate 0.0002
Batch Size 4
Grad Accumulation 4
Max Seq Length 2048
Optimizer AdamW 8-bit

Using Your Model

Recommended
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./output",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Troubleshooting & Export

Running Out of Memory (OOM)?
  • Reduce per_device_train_batch_size to 1
  • Increase gradient_accumulation_steps to 8
  • Reduce max_seq_length to 1024
Training is slow?
  • Ensure GPU acceleration is active.
  • Verify Unsloth installation.
  • Ensure load_in_4bit=True is set.
Chat Output looks wrong?

Models use different chat templates. Check the base model's HuggingFace card. Some require add_generation_prompt=True.

Export Options

Format Size Target
LoRA Adapter ~50MB Base Model
GGUF q4_k_m ~2-4GB llama.cpp / Ollama
GGUF q8_0 ~4-8GB High Quality
16-bit Merged ~6-14GB Hugging Face