A complete kit for LoRA fine-tuning using Unsloth. Experience 2-5x faster training with 70% less memory usage.
Optimized kernels provide 2-5x faster training speeds compared to standard HF implementations.
Drastically reduces VRAM usage by 70%, allowing you to train larger models on consumer GPUs.
Native support to export to GGUF (for llama.cpp/Ollama), 16-bit, or 4-bit merged formats.
pip install -r requirements.txt
Create a JSONL file. The format must include the messages structure.
{"messages": [{"role": "user", "content": "Explain quantum computing"}, {"role": "assistant", "content": "Quantum computing uses..."}]}
{"messages": [{"role": "system", "content": "You are a helpful pirate."}, {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Ahoy matey!"}]}
python train.py
python eval.py
python export.py
| Setting | Value |
|---|---|
| Learning Rate | 0.0002 |
| Batch Size | 4 |
| Grad Accumulation | 4 |
| Max Seq Length | 2048 |
| Optimizer | AdamW 8-bit |
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./output",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Hello!"}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(formatted, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
per_device_train_batch_size to 1gradient_accumulation_steps to 8max_seq_length to 1024load_in_4bit=True is set.Models use different chat templates. Check the base model's HuggingFace card. Some require add_generation_prompt=True.
| Format | Size | Target |
|---|---|---|
| LoRA Adapter | ~50MB | Base Model |
| GGUF q4_k_m | ~2-4GB | llama.cpp / Ollama |
| GGUF q8_0 | ~4-8GB | High Quality |
| 16-bit Merged | ~6-14GB | Hugging Face |