A technical guide explaining the fine-tuning process for domain-specific LLMs, with rationale for each step.
Goal: Fine-tune Qwen2.5-Coder-14B for TagUI (browser automation DSL)
Why Fine-Tuning?: Base models don't know domain-specific syntax. Qwen-Coder is excellent at Python/JS but generates invalid TagUI code. Fine-tuning teaches the specific DSL patterns.
Hardware: NVIDIA DGX Spark (GB10 GPU, 128GB RAM)
| Framework | Pros | Cons |
|---|---|---|
| HuggingFace + PEFT | Simple API, great docs, large community | Manual optimization, less GPU efficiency |
| Axolotl | Easy configs, good for beginners | Less control, harder to debug |
| LLaMA-Factory | All-in-one, many templates | Opinionated, harder customization |
| NVIDIA NeMo | Native DGX optimization, production-grade | Steeper learning curve, NVIDIA-specific |
- Hardware match: DGX Spark is NVIDIA hardware, NeMo is NVIDIA-optimized
- FSDP2 + Triton: Native distributed training, memory-efficient
- Production path: Same framework used for NVIDIA's own models
- Container ecosystem: Pre-built
nvcr.iocontainers with all dependencies - LoRA support: First-class PEFT integration with NeMo Automodel
Trade-off accepted: Steeper learning curve, but better GPU utilization and production readiness.
- Catastrophic forgetting: Model loses general knowledge
- Memory intensive: Requires 2x GPU memory (model + gradients)
- Slow iteration: Full model updates are expensive
- Small adapters: Only trains ~500MB vs 28GB full model
- Base model frozen: Preserves general capabilities
- Fast training: Small checkpoints, quick iteration
- Easy swapping: Can switch adapters for different domains
peft:
dim: 32 # LoRA rank - proven optimal
alpha: 64 # 2x rank is standard
match_all_linear: true # All layers targeted
precision: bf16-mixed # NOT fp8 - causes issues on some hardwareNeMo requires JSONL format - one JSON object per line, easy to stream.
Maps directly to chat format. NeMo's column_mapping converts to prompts.
{"instruction": "Generate a TagUI script for the following task.", "input": "Log into website and download report", "output": "https://example.com/login\nwait 2\ntype #username as myuser\ntype #password as mypass\nclick #login\nwait 3\nclick #download"}Only compute loss on outputs, not prompts. Prevents model from learning to repeat instructions.
dataset:
answer_only_loss_mask: truePre-configured environment with:
- CUDA drivers
- PyTorch with NVIDIA optimizations
- NeMo framework
- All dependencies resolved
| Parameter | Value | Why |
|---|---|---|
lr |
2e-5 | Conservative - prevents catastrophic forgetting |
epochs |
2-3 | Prevents overfitting on small datasets |
gradient_clip |
1.0 | Stability - prevents gradient explosion |
global_batch_size |
16 | Memory-safe on 128GB GPU |
# Inside NeMo container
python /opt/Automodel/examples/llm_finetune/finetune.py \
--config /workspace/configs/nemo/tagui_qwen_14b_lora.yaml- Loss curve: Should decrease smoothly, final ~0.2-0.6
- Grad norm: Should stay <50, spikes indicate instability
- TensorBoard:
tensorboard --logdir=/workspace/logs --port=6006
This is the critical path from training output to production deployment.
Problem: LoRA produces adapter weights (~500MB), not a standalone model. The base model + adapters must be loaded separately.
Solution: Merge adapter weights into base model weights.
Result: Single self-contained model with fine-tuned behavior.
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-14B-Instruct")
model = PeftModel.from_pretrained(base_model, "path/to/lora/adapter")
# Merge into single model
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("path/to/merged")Output: ~28GB HuggingFace model directory with merged weights.
Problem: HuggingFace format is not optimized for inference:
- Multiple safetensor shards
- Requires full PyTorch/transformers stack
- Not portable to other runtimes
What is GGUF?: Binary format designed for llama.cpp and Ollama:
- Single file containing all weights
- Optimized memory layout
- Supports CPU, GPU, and hybrid inference
- Built-in quantization support
Benefits:
- Portability: Single file, works with Ollama/llama.cpp
- Efficiency: Optimized memory access patterns
- Flexibility: Can run on CPU or GPU
Conversion Command:
cd llama.cpp
python convert_hf_to_gguf.py \
/path/to/merged-model \
--outfile /path/to/model.gguf \
--outtype bf16bf16 vs Quantized:
| Format | Size | Quality | Speed |
|---|---|---|---|
| bf16 | 28GB | Best | GPU only |
| q8_0 | 15GB | Very good | Faster |
| q4_k_m | 8GB | Good | Fastest |
For domain-specific work, bf16 preserves maximum accuracy.
Problem: GGUF file alone has no:
- Chat template (how to format prompts)
- System prompt (domain-specific instructions)
- Generation parameters (temperature, stop tokens)
Solution: Modelfile wraps GGUF with configuration.
Modelfile Example:
FROM ./model-bf16.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop <|im_end|>
PARAMETER temperature 0.7
SYSTEM """You are a TagUI automation expert. Generate valid TagUI scripts.
CRITICAL RULES:
1. URLs are written directly (NO 'navigate' command)
2. No try/catch - use 'if exist()' for error handling
3. count() must be pre-assigned before loops
"""Create Ollama Model:
ollama create my-model -f Modelfile
ollama run my-model "Write a script to..."Result: Production-ready API endpoint at localhost:11434.
Training → LoRA Adapters → Merge → HF Model → GGUF → Ollama
| | | | | |
NeMo ~500MB 28GB+500MB 28GB 28GB API ready
Why this pipeline?
- LoRA: Efficient training, small checkpoints
- Merge: Self-contained model, no adapter loading
- GGUF: Portable format, single file
- Ollama: Production serving with API
# Start NeMo container
docker run --gpus all -it \
-v $HOME/project:/workspace \
nvcr.io/nvidia/nemo:25.09 bash
# Run training
python /opt/Automodel/examples/llm_finetune/finetune.py \
--config /workspace/configs/nemo/finetune_lora.yaml# Inside container with transformers + peft
python scripts/deployment/merge_lora.pycd llama.cpp
python convert_hf_to_gguf.py \
/workspace/checkpoints/merged-model \
--outfile /workspace/models/model-bf16.gguf \
--outtype bf16cd models
ollama create my-model -f Modelfile
ollama run my-model "test prompt"- Model: raoulbia/tagui-qwen-gguf
- TagUI: github.com/aisingapore/TagUI
- NeMo: docs.nvidia.com/nemo-framework
- llama.cpp: github.com/ggml-org/llama.cpp