LLM Fine-Tuning: Training & Conversion Guide

A technical guide explaining the fine-tuning process for domain-specific LLMs, with rationale for each step.

1. Overview

Goal: Fine-tune Qwen2.5-Coder-14B for TagUI (browser automation DSL)

Why Fine-Tuning?: Base models don't know domain-specific syntax. Qwen-Coder is excellent at Python/JS but generates invalid TagUI code. Fine-tuning teaches the specific DSL patterns.

Hardware: NVIDIA DGX Spark (GB10 GPU, 128GB RAM)

2. Framework Choice: Why NeMo?

Options Considered

Framework	Pros	Cons
HuggingFace + PEFT	Simple API, great docs, large community	Manual optimization, less GPU efficiency
Axolotl	Easy configs, good for beginners	Less control, harder to debug
LLaMA-Factory	All-in-one, many templates	Opinionated, harder customization
NVIDIA NeMo	Native DGX optimization, production-grade	Steeper learning curve, NVIDIA-specific

Why We Chose NeMo

Hardware match: DGX Spark is NVIDIA hardware, NeMo is NVIDIA-optimized
FSDP2 + Triton: Native distributed training, memory-efficient
Production path: Same framework used for NVIDIA's own models
Container ecosystem: Pre-built nvcr.io containers with all dependencies
LoRA support: First-class PEFT integration with NeMo Automodel

Trade-off accepted: Steeper learning curve, but better GPU utilization and production readiness.

3. Training Approach: Why LoRA?

Full Fine-Tuning Problems

Catastrophic forgetting: Model loses general knowledge
Memory intensive: Requires 2x GPU memory (model + gradients)
Slow iteration: Full model updates are expensive

LoRA Advantages

Small adapters: Only trains ~500MB vs 28GB full model
Base model frozen: Preserves general capabilities
Fast training: Small checkpoints, quick iteration
Easy swapping: Can switch adapters for different domains

Configuration Choices

peft:
  dim: 32              # LoRA rank - proven optimal
  alpha: 64            # 2x rank is standard
  match_all_linear: true   # All layers targeted

precision: bf16-mixed  # NOT fp8 - causes issues on some hardware

4. Training Data Format

Why JSONL?

NeMo requires JSONL format - one JSON object per line, easy to stream.

Why instruction/input/output Fields?

Maps directly to chat format. NeMo's column_mapping converts to prompts.

Example Format

{"instruction": "Generate a TagUI script for the following task.", "input": "Log into website and download report", "output": "https://example.com/login\nwait 2\ntype #username as myuser\ntype #password as mypass\nclick #login\nwait 3\nclick #download"}

Why answer_only_loss?

Only compute loss on outputs, not prompts. Prevents model from learning to repeat instructions.

dataset:
  answer_only_loss_mask: true

5. Training Execution

Why NeMo Container?

Pre-configured environment with:

CUDA drivers
PyTorch with NVIDIA optimizations
NeMo framework
All dependencies resolved

Key Hyperparameters

Parameter	Value	Why
`lr`	2e-5	Conservative - prevents catastrophic forgetting
`epochs`	2-3	Prevents overfitting on small datasets
`gradient_clip`	1.0	Stability - prevents gradient explosion
`global_batch_size`	16	Memory-safe on 128GB GPU

Training Command

# Inside NeMo container
python /opt/Automodel/examples/llm_finetune/finetune.py \
  --config /workspace/configs/nemo/tagui_qwen_14b_lora.yaml

Monitoring

Loss curve: Should decrease smoothly, final ~0.2-0.6
Grad norm: Should stay <50, spikes indicate instability
TensorBoard: tensorboard --logdir=/workspace/logs --port=6006

6. Post-Training Conversions

This is the critical path from training output to production deployment.

6.1 LoRA Merge - WHY?

Problem: LoRA produces adapter weights (~500MB), not a standalone model. The base model + adapters must be loaded separately.

Solution: Merge adapter weights into base model weights.

Result: Single self-contained model with fine-tuned behavior.

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-14B-Instruct")
model = PeftModel.from_pretrained(base_model, "path/to/lora/adapter")

# Merge into single model
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("path/to/merged")

Output: ~28GB HuggingFace model directory with merged weights.

6.2 GGUF Conversion - WHY?

Problem: HuggingFace format is not optimized for inference:

Multiple safetensor shards
Requires full PyTorch/transformers stack
Not portable to other runtimes

What is GGUF?: Binary format designed for llama.cpp and Ollama:

Single file containing all weights
Optimized memory layout
Supports CPU, GPU, and hybrid inference
Built-in quantization support

Benefits:

Portability: Single file, works with Ollama/llama.cpp
Efficiency: Optimized memory access patterns
Flexibility: Can run on CPU or GPU

Conversion Command:

cd llama.cpp
python convert_hf_to_gguf.py \
  /path/to/merged-model \
  --outfile /path/to/model.gguf \
  --outtype bf16

bf16 vs Quantized:

Format	Size	Quality	Speed
bf16	28GB	Best	GPU only
q8_0	15GB	Very good	Faster
q4_k_m	8GB	Good	Fastest

For domain-specific work, bf16 preserves maximum accuracy.

6.3 Ollama Model - WHY?

Problem: GGUF file alone has no:

Chat template (how to format prompts)
System prompt (domain-specific instructions)
Generation parameters (temperature, stop tokens)

Solution: Modelfile wraps GGUF with configuration.

Modelfile Example:

FROM ./model-bf16.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop <|im_end|>
PARAMETER temperature 0.7

SYSTEM """You are a TagUI automation expert. Generate valid TagUI scripts.

CRITICAL RULES:
1. URLs are written directly (NO 'navigate' command)
2. No try/catch - use 'if exist()' for error handling
3. count() must be pre-assigned before loops
"""

Create Ollama Model:

ollama create my-model -f Modelfile
ollama run my-model "Write a script to..."

Result: Production-ready API endpoint at localhost:11434.

7. The Conversion Pipeline

Training → LoRA Adapters → Merge → HF Model → GGUF → Ollama
   |           |              |         |         |        |
  NeMo     ~500MB        28GB+500MB   28GB      28GB   API ready

Why this pipeline?

LoRA: Efficient training, small checkpoints
Merge: Self-contained model, no adapter loading
GGUF: Portable format, single file
Ollama: Production serving with API

8. Key Commands (Copy-Paste Ready)

Training

# Start NeMo container
docker run --gpus all -it \
  -v $HOME/project:/workspace \
  nvcr.io/nvidia/nemo:25.09 bash

# Run training
python /opt/Automodel/examples/llm_finetune/finetune.py \
  --config /workspace/configs/nemo/finetune_lora.yaml

LoRA Merge

# Inside container with transformers + peft
python scripts/deployment/merge_lora.py

GGUF Conversion

cd llama.cpp
python convert_hf_to_gguf.py \
  /workspace/checkpoints/merged-model \
  --outfile /workspace/models/model-bf16.gguf \
  --outtype bf16

Ollama Creation

cd models
ollama create my-model -f Modelfile
ollama run my-model "test prompt"

9. Resources

Model: raoulbia/tagui-qwen-gguf
TagUI: github.com/aisingapore/TagUI
NeMo: docs.nvidia.com/nemo-framework
llama.cpp: github.com/ggml-org/llama.cpp

raoulbia-ai/llm-finetuning-training-conversion-guide.md

Select an option

No results found