Radxa Dragon Q6A - AI Quick Start Guide

This guide enables Hardware Accelerated AI on the Radxa Dragon Q6A. We will run Llama 3.2 (LLM) on the NPU and Whisper (Speech) on the CPU to create a fully voice-interactive system.

Hardware: Radxa Dragon Q6A (QCS6490)
OS: Ubuntu 24.04 Noble (T7 Image or newer)
Status: ✅ Verified Working (Jan 2026)

🛠️ Step 1: System Preparation

Run these commands once to install drivers and set permissions.

1. Install Dependencies

sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 \
    python3-pip python3.12-venv libportaudio2 ffmpeg git alsa-utils

2. Set Permanent NPU Permissions

This ensures you don't get "Permission Denied" errors after rebooting.

sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF

# Apply immediately
sudo udevadm control --reload-rules
sudo udevadm trigger

3. Create Python Virtual Environment

We use a virtual environment to prevent dependency conflicts with system packages.

# Create and activate
python3 -m venv ~/qai-venv
source ~/qai-venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install AI tools (Whisper via Transformers, Audio libraries)
pip install "transformers[torch]" librosa soundfile sounddevice accelerate

Note: We use HuggingFace Transformers for Whisper instead of qai_hub_models because the QCS6490 NPU requires quantized models, and the qai_hub_models Whisper variants don't support quantization for this device.

🦙 Step 2: Setup Llama 3.2 (NPU)

We use the 4096-context model for better conversation memory.

1. Download Model

(Note: requires ~2GB space)

# Ensure you are NOT in the venv for this part (using system tools for binary download)
deactivate 2>/dev/null

# Install downloader
pip3 install modelscope --break-system-packages

# Download
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .

# Make the runner executable
chmod +x genie-t2t-run

2. Create "Chat" Shortcut

Create a simple script to run the NPU model.

cd ~/llama-4k
cat << 'EOF' > chat
#!/bin/bash
cd ~/llama-4k
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"

# Llama 3 Prompt Format
PROMPT="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json -p "$PROMPT"
EOF
chmod +x chat

Test it: ~/llama-4k/chat "What is the capital of France?"

🎙️ Step 3: Setup Whisper (Speech-to-Text)

We run Whisper on the CPU using HuggingFace Transformers. This approach is reliable and produces accurate transcriptions.

1. Create "Transcribe" Script

cat << 'EOF' > ~/transcribe.sh
#!/bin/bash
# Transcribe audio using Whisper via HuggingFace Transformers

source ~/qai-venv/bin/activate

python3 << PYTHON
from transformers import pipeline
import warnings
import sys

# Suppress deprecation warnings
warnings.filterwarnings("ignore")

# Use whisper-tiny for speed, or whisper-small for accuracy
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-tiny",
    device="cpu"
)

result = pipe("$1", generate_kwargs={"language": "en", "task": "transcribe"})
print(result["text"].strip())
PYTHON
EOF
chmod +x ~/transcribe.sh

Model Options:

openai/whisper-tiny (39M params) - Fastest, ~1x realtime

openai/whisper-base (74M params) - Balanced

openai/whisper-small (244M params) - Most accurate, ~2.3x realtime

2. Verify Audio

Download a sample file to test the system.

cd ~
wget https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac -O test_audio.wav
~/transcribe.sh test_audio.wav

Expected Output: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes..."

3. (Optional) Benchmark Transcription Speed

cat << 'EOF' > ~/benchmark_whisper.sh
#!/bin/bash
source ~/qai-venv/bin/activate

python3 << 'PYTHON'
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import time

# Load model
print("Loading Whisper Small...")
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Load audio
audio, sr = librosa.load("test_audio.wav", sr=16000)
duration = len(audio) / sr
print(f"Audio: {duration:.2f} seconds")

# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
start = time.time()
with torch.no_grad():
    ids = model.generate(inputs.input_features, language="en", task="transcribe")
elapsed = time.time() - start

text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(f"\nTranscription: {text}")
print(f"Time: {elapsed:.2f}s | Realtime factor: {elapsed/duration:.2f}x")
PYTHON
EOF
chmod +x ~/benchmark_whisper.sh

🤖 Step 4: The "Jarvis" Demo (Voice to AI)

Combine both tools! This script records your voice, converts it to text, sends it to Llama, and prints the answer.

1. Create the Voice Assistant Script

cat << 'EOF' > ~/voice-chat.sh
#!/bin/bash

RECORDING="$HOME/my_voice.wav"

echo "🔴 Recording... (Press Ctrl+C to stop, or wait 5 seconds)"
arecord -d 5 -f S16_LE -r 16000 -c 1 -t wav "$RECORDING" 2>/dev/null
echo "✅ Processing..."

# 1. Speech to Text (Whisper on CPU)
echo "🎙️ Transcribing..."
USER_TEXT=$(~/transcribe.sh "$RECORDING")
echo "🗣️  You said: $USER_TEXT"

if [ -z "$USER_TEXT" ]; then
    echo "❌ No speech detected."
    exit 1
fi

# 2. Text to Intelligence (Llama on NPU)
echo "🤖 AI Thinking..."
~/llama-4k/chat "$USER_TEXT"
EOF
chmod +x ~/voice-chat.sh

2. Run It

Plug in a USB microphone and run:

~/voice-chat.sh

📊 Performance Summary

Component	Model	Processor	Performance
Brain	Llama 3.2 1B (4096)	NPU (Hexagon)	~15 tokens/sec
Ears	Whisper Tiny	CPU (Kryo)	~1x realtime
Ears	Whisper Small	CPU (Kryo)	~2.3x realtime
Memory	System RAM	Shared	~2.5 GB Total

Why not Whisper on NPU?
The QCS6490 NPU requires quantized (INT8) model I/O, but the qai_hub_models Whisper variants only support float precision. Qualcomm's pre-quantized Whisper models require AIMET-ONNX which isn't available for aarch64 Linux. CPU inference via Transformers is the most reliable path.

🛠️ Advanced: Using Larger Whisper Models

For better accuracy with longer audio or difficult accents:

# Edit ~/transcribe.sh and change the model line:
# whisper-tiny  → whisper-base  → whisper-small → whisper-medium

# Or create a high-accuracy version:
cat << 'EOF' > ~/transcribe-accurate.sh
#!/bin/bash
source ~/qai-venv/bin/activate

python3 << PYTHON
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore")

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    device="cpu"
)

result = pipe("$1", generate_kwargs={"language": "en", "task": "transcribe"})
print(result["text"].strip())
PYTHON
EOF
chmod +x ~/transcribe-accurate.sh

🐛 Troubleshooting

Issue	Solution
`Permission denied` (/dev/fastrpc)	Run the Step 1 udev commands and reboot.
`genie-t2t-run: not found`	Ensure you are in `~/llama-4k` and run `chmod +x genie-t2t-run`.
`ModuleNotFoundError` (Python)	Run `source ~/qai-venv/bin/activate` before using scripts.
Whisper outputs gibberish	Audio may be corrupt. Check with `aplay your_audio.wav`.
`ALSA lib...` warnings	Safe to ignore; audio still records correctly.
Slow Whisper performance	Use `whisper-tiny` instead of `whisper-small`.

📚 References

Radxa Dragon Q6A Wiki
HuggingFace Whisper Models
Qualcomm AI Hub
ai-hub-models GitHub - File issues for NPU Whisper support

Foadsf/RADXA_DRAGON_Q6A_NPU_QUICKSTART.md

Select an option

No results found