Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save raoulbia-ai/f4a01596e080ef3fbb67c15d5e572d85 to your computer and use it in GitHub Desktop.

Select an option

Save raoulbia-ai/f4a01596e080ef3fbb67c15d5e572d85 to your computer and use it in GitHub Desktop.
LLM Fine-Tuning: Training & Conversion Guide - LoRA, GGUF, Ollama Pipeline

LLM Fine-Tuning: Training & Conversion Guide

A technical guide explaining the fine-tuning process for domain-specific LLMs, with rationale for each step.

1. Overview

Goal: Fine-tune Qwen2.5-Coder-14B for TagUI (browser automation DSL)

Why Fine-Tuning?: Base models don't know domain-specific syntax. Qwen-Coder is excellent at Python/JS but generates invalid TagUI code. Fine-tuning teaches the specific DSL patterns.

Hardware: NVIDIA DGX Spark (GB10 GPU, 128GB RAM)


2. Framework Choice: Why NeMo?

Options Considered

Framework Pros Cons
HuggingFace + PEFT Simple API, great docs, large community Manual optimization, less GPU efficiency
Axolotl Easy configs, good for beginners Less control, harder to debug
LLaMA-Factory All-in-one, many templates Opinionated, harder customization
NVIDIA NeMo Native DGX optimization, production-grade Steeper learning curve, NVIDIA-specific

Why We Chose NeMo

  • Hardware match: DGX Spark is NVIDIA hardware, NeMo is NVIDIA-optimized
  • FSDP2 + Triton: Native distributed training, memory-efficient
  • Production path: Same framework used for NVIDIA's own models
  • Container ecosystem: Pre-built nvcr.io containers with all dependencies
  • LoRA support: First-class PEFT integration with NeMo Automodel

Trade-off accepted: Steeper learning curve, but better GPU utilization and production readiness.


3. Training Approach: Why LoRA?

Full Fine-Tuning Problems

  • Catastrophic forgetting: Model loses general knowledge
  • Memory intensive: Requires 2x GPU memory (model + gradients)
  • Slow iteration: Full model updates are expensive

LoRA Advantages

  • Small adapters: Only trains ~500MB vs 28GB full model
  • Base model frozen: Preserves general capabilities
  • Fast training: Small checkpoints, quick iteration
  • Easy swapping: Can switch adapters for different domains

Configuration Choices

peft:
  dim: 32              # LoRA rank - proven optimal
  alpha: 64            # 2x rank is standard
  match_all_linear: true   # All layers targeted

precision: bf16-mixed  # NOT fp8 - causes issues on some hardware

4. Training Data Format

Why JSONL?

NeMo requires JSONL format - one JSON object per line, easy to stream.

Why instruction/input/output Fields?

Maps directly to chat format. NeMo's column_mapping converts to prompts.

Example Format

{"instruction": "Generate a TagUI script for the following task.", "input": "Log into website and download report", "output": "https://example.com/login\nwait 2\ntype #username as myuser\ntype #password as mypass\nclick #login\nwait 3\nclick #download"}

Why answer_only_loss?

Only compute loss on outputs, not prompts. Prevents model from learning to repeat instructions.

dataset:
  answer_only_loss_mask: true

5. Training Execution

Why NeMo Container?

Pre-configured environment with:

  • CUDA drivers
  • PyTorch with NVIDIA optimizations
  • NeMo framework
  • All dependencies resolved

Key Hyperparameters

Parameter Value Why
lr 2e-5 Conservative - prevents catastrophic forgetting
epochs 2-3 Prevents overfitting on small datasets
gradient_clip 1.0 Stability - prevents gradient explosion
global_batch_size 16 Memory-safe on 128GB GPU

Training Command

# Inside NeMo container
python /opt/Automodel/examples/llm_finetune/finetune.py \
  --config /workspace/configs/nemo/tagui_qwen_14b_lora.yaml

Monitoring

  • Loss curve: Should decrease smoothly, final ~0.2-0.6
  • Grad norm: Should stay <50, spikes indicate instability
  • TensorBoard: tensorboard --logdir=/workspace/logs --port=6006

6. Post-Training Conversions

This is the critical path from training output to production deployment.

6.1 LoRA Merge - WHY?

Problem: LoRA produces adapter weights (~500MB), not a standalone model. The base model + adapters must be loaded separately.

Solution: Merge adapter weights into base model weights.

Result: Single self-contained model with fine-tuned behavior.

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-14B-Instruct")
model = PeftModel.from_pretrained(base_model, "path/to/lora/adapter")

# Merge into single model
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("path/to/merged")

Output: ~28GB HuggingFace model directory with merged weights.


6.2 GGUF Conversion - WHY?

Problem: HuggingFace format is not optimized for inference:

  • Multiple safetensor shards
  • Requires full PyTorch/transformers stack
  • Not portable to other runtimes

What is GGUF?: Binary format designed for llama.cpp and Ollama:

  • Single file containing all weights
  • Optimized memory layout
  • Supports CPU, GPU, and hybrid inference
  • Built-in quantization support

Benefits:

  • Portability: Single file, works with Ollama/llama.cpp
  • Efficiency: Optimized memory access patterns
  • Flexibility: Can run on CPU or GPU

Conversion Command:

cd llama.cpp
python convert_hf_to_gguf.py \
  /path/to/merged-model \
  --outfile /path/to/model.gguf \
  --outtype bf16

bf16 vs Quantized:

Format Size Quality Speed
bf16 28GB Best GPU only
q8_0 15GB Very good Faster
q4_k_m 8GB Good Fastest

For domain-specific work, bf16 preserves maximum accuracy.


6.3 Ollama Model - WHY?

Problem: GGUF file alone has no:

  • Chat template (how to format prompts)
  • System prompt (domain-specific instructions)
  • Generation parameters (temperature, stop tokens)

Solution: Modelfile wraps GGUF with configuration.

Modelfile Example:

FROM ./model-bf16.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop <|im_end|>
PARAMETER temperature 0.7

SYSTEM """You are a TagUI automation expert. Generate valid TagUI scripts.

CRITICAL RULES:
1. URLs are written directly (NO 'navigate' command)
2. No try/catch - use 'if exist()' for error handling
3. count() must be pre-assigned before loops
"""

Create Ollama Model:

ollama create my-model -f Modelfile
ollama run my-model "Write a script to..."

Result: Production-ready API endpoint at localhost:11434.


7. The Conversion Pipeline

Training → LoRA Adapters → Merge → HF Model → GGUF → Ollama
   |           |              |         |         |        |
  NeMo     ~500MB        28GB+500MB   28GB      28GB   API ready

Why this pipeline?

  1. LoRA: Efficient training, small checkpoints
  2. Merge: Self-contained model, no adapter loading
  3. GGUF: Portable format, single file
  4. Ollama: Production serving with API

8. Key Commands (Copy-Paste Ready)

Training

# Start NeMo container
docker run --gpus all -it \
  -v $HOME/project:/workspace \
  nvcr.io/nvidia/nemo:25.09 bash

# Run training
python /opt/Automodel/examples/llm_finetune/finetune.py \
  --config /workspace/configs/nemo/finetune_lora.yaml

LoRA Merge

# Inside container with transformers + peft
python scripts/deployment/merge_lora.py

GGUF Conversion

cd llama.cpp
python convert_hf_to_gguf.py \
  /workspace/checkpoints/merged-model \
  --outfile /workspace/models/model-bf16.gguf \
  --outtype bf16

Ollama Creation

cd models
ollama create my-model -f Modelfile
ollama run my-model "test prompt"

9. Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment