This guide enables Hardware Accelerated AI on the Radxa Dragon Q6A. We will run Llama 3.2 (LLM) on the NPU and Whisper (Speech) on the CPU to create a fully voice-interactive system.
Hardware: Radxa Dragon Q6A (QCS6490)
OS: Ubuntu 24.04 Noble (T7 Image or newer)
Status: β
Verified Working (Jan 2026)
Run these commands once to install drivers and set permissions.
sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 \
python3-pip python3.12-venv libportaudio2 ffmpeg git alsa-utilsThis ensures you don't get "Permission Denied" errors after rebooting.
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
# Apply immediately
sudo udevadm control --reload-rules
sudo udevadm triggerWe use a virtual environment to prevent dependency conflicts with system packages.
# Create and activate
python3 -m venv ~/qai-venv
source ~/qai-venv/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install AI tools (Whisper via Transformers, Audio libraries)
pip install "transformers[torch]" librosa soundfile sounddevice accelerateNote: We use HuggingFace Transformers for Whisper instead of qai_hub_models because the QCS6490 NPU requires quantized models, and the qai_hub_models Whisper variants don't support quantization for this device.
We use the 4096-context model for better conversation memory.
(Note: requires ~2GB space)
# Ensure you are NOT in the venv for this part (using system tools for binary download)
deactivate 2>/dev/null
# Install downloader
pip3 install modelscope --break-system-packages
# Download
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .
# Make the runner executable
chmod +x genie-t2t-runCreate a simple script to run the NPU model.
cd ~/llama-4k
cat << 'EOF' > chat
#!/bin/bash
cd ~/llama-4k
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"
# Llama 3 Prompt Format
PROMPT="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json -p "$PROMPT"
EOF
chmod +x chatTest it: ~/llama-4k/chat "What is the capital of France?"
We run Whisper on the CPU using HuggingFace Transformers. This approach is reliable and produces accurate transcriptions.
cat << 'EOF' > ~/transcribe.sh
#!/bin/bash
# Transcribe audio using Whisper via HuggingFace Transformers
source ~/qai-venv/bin/activate
python3 << PYTHON
from transformers import pipeline
import warnings
import sys
# Suppress deprecation warnings
warnings.filterwarnings("ignore")
# Use whisper-tiny for speed, or whisper-small for accuracy
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-tiny",
device="cpu"
)
result = pipe("$1", generate_kwargs={"language": "en", "task": "transcribe"})
print(result["text"].strip())
PYTHON
EOF
chmod +x ~/transcribe.shModel Options:
openai/whisper-tiny(39M params) - Fastest, ~1x realtimeopenai/whisper-base(74M params) - Balancedopenai/whisper-small(244M params) - Most accurate, ~2.3x realtime
Download a sample file to test the system.
cd ~
wget https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac -O test_audio.wav
~/transcribe.sh test_audio.wavExpected Output: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes..."
cat << 'EOF' > ~/benchmark_whisper.sh
#!/bin/bash
source ~/qai-venv/bin/activate
python3 << 'PYTHON'
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import time
# Load model
print("Loading Whisper Small...")
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Load audio
audio, sr = librosa.load("test_audio.wav", sr=16000)
duration = len(audio) / sr
print(f"Audio: {duration:.2f} seconds")
# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
start = time.time()
with torch.no_grad():
ids = model.generate(inputs.input_features, language="en", task="transcribe")
elapsed = time.time() - start
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(f"\nTranscription: {text}")
print(f"Time: {elapsed:.2f}s | Realtime factor: {elapsed/duration:.2f}x")
PYTHON
EOF
chmod +x ~/benchmark_whisper.shCombine both tools! This script records your voice, converts it to text, sends it to Llama, and prints the answer.
cat << 'EOF' > ~/voice-chat.sh
#!/bin/bash
RECORDING="$HOME/my_voice.wav"
echo "π΄ Recording... (Press Ctrl+C to stop, or wait 5 seconds)"
arecord -d 5 -f S16_LE -r 16000 -c 1 -t wav "$RECORDING" 2>/dev/null
echo "β
Processing..."
# 1. Speech to Text (Whisper on CPU)
echo "ποΈ Transcribing..."
USER_TEXT=$(~/transcribe.sh "$RECORDING")
echo "π£οΈ You said: $USER_TEXT"
if [ -z "$USER_TEXT" ]; then
echo "β No speech detected."
exit 1
fi
# 2. Text to Intelligence (Llama on NPU)
echo "π€ AI Thinking..."
~/llama-4k/chat "$USER_TEXT"
EOF
chmod +x ~/voice-chat.shPlug in a USB microphone and run:
~/voice-chat.sh| Component | Model | Processor | Performance |
|---|---|---|---|
| Brain | Llama 3.2 1B (4096) | NPU (Hexagon) | ~15 tokens/sec |
| Ears | Whisper Tiny | CPU (Kryo) | ~1x realtime |
| Ears | Whisper Small | CPU (Kryo) | ~2.3x realtime |
| Memory | System RAM | Shared | ~2.5 GB Total |
Why not Whisper on NPU?
The QCS6490 NPU requires quantized (INT8) model I/O, but the qai_hub_models Whisper variants only support float precision. Qualcomm's pre-quantized Whisper models require AIMET-ONNX which isn't available for aarch64 Linux. CPU inference via Transformers is the most reliable path.
For better accuracy with longer audio or difficult accents:
# Edit ~/transcribe.sh and change the model line:
# whisper-tiny β whisper-base β whisper-small β whisper-medium
# Or create a high-accuracy version:
cat << 'EOF' > ~/transcribe-accurate.sh
#!/bin/bash
source ~/qai-venv/bin/activate
python3 << PYTHON
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore")
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-small",
device="cpu"
)
result = pipe("$1", generate_kwargs={"language": "en", "task": "transcribe"})
print(result["text"].strip())
PYTHON
EOF
chmod +x ~/transcribe-accurate.sh| Issue | Solution |
|---|---|
Permission denied (/dev/fastrpc) |
Run the Step 1 udev commands and reboot. |
genie-t2t-run: not found |
Ensure you are in ~/llama-4k and run chmod +x genie-t2t-run. |
ModuleNotFoundError (Python) |
Run source ~/qai-venv/bin/activate before using scripts. |
| Whisper outputs gibberish | Audio may be corrupt. Check with aplay your_audio.wav. |
ALSA lib... warnings |
Safe to ignore; audio still records correctly. |
| Slow Whisper performance | Use whisper-tiny instead of whisper-small. |
- Radxa Dragon Q6A Wiki
- HuggingFace Whisper Models
- Qualcomm AI Hub
- ai-hub-models GitHub - File issues for NPU Whisper support