Skip to content

Instantly share code, notes, and snippets.

@relic-yuexi
Last active February 11, 2026 08:30
Show Gist options
  • Select an option

  • Save relic-yuexi/aae25927a9d84ec4d572e96f58247d82 to your computer and use it in GitHub Desktop.

Select an option

Save relic-yuexi/aae25927a9d84ec4d572e96f58247d82 to your computer and use it in GitHub Desktop.

Hugging Face 模型/数据集下载工具

一个用于从 Hugging Face 下载模型和数据集的高效 Bash 脚本,支持断点续传、多线程下载和灵活的文件过滤。

功能特性

  • 🚀 多线程下载:支持 aria2c 多线程下载,大幅提升下载速度
  • 🔄 断点续传:支持中断后续传,避免重复下载
  • 🎯 文件过滤:支持通过通配符模式包含/排除特定文件
  • 🔐 身份认证:支持通过用户名和令牌访问需要授权的仓库
  • 📦 模型和数据集:同时支持 Hugging Face 模型和数据集下载
  • 🌍 镜像加速:默认使用 hf-mirror.com 镜像源,国内下载更快速
  • 💾 自定义路径:支持指定本地存储目录
  • 🏷️ 版本控制:支持下载特定 revision 的模型/数据集

安装要求

脚本运行需要以下工具:

  • curl - 用于获取仓库元数据
  • aria2cwget - 用于文件下载(推荐 aria2c)
  • jq(可选) - 用于更高效的 JSON 解析

在 Ubuntu/Debian 系统上安装依赖:

sudo apt update
sudo apt install curl aria2 jq

使用方法

基本语法

./hfd.sh <REPO_ID> [选项]

参数说明

参数 必需 说明
REPO_ID Hugging Face 仓库 ID,格式:org_name/repo_namegpt2 等格式

选项说明

选项 说明
--include PATTERN PATTERN ... 包含匹配通配符模式的文件(支持多个模式)
--exclude PATTERN PATTERN ... 排除匹配通配符模式的文件(支持多个模式)
--hf_username USERNAME Hugging Face 用户名(用于身份认证,不是邮箱)
--hf_token TOKEN Hugging Face 访问令牌
`--tool aria2c wget`
-x THREADS aria2c 单文件下载线程数,默认 4,最大 10
-j JOBS aria2c 并发下载数,默认 5,最大 10
--dataset 标记为数据集下载(默认为模型下载)
--local-dir PATH 指定本地存储目录,默认为当前目录下的 repo_name
--revision REV 指定下载的版本/修订,默认为 main

使用示例

示例 1:下载公开模型

# 下载 GPT-2 模型
./hfd.sh gpt2

# 下载 LLaMA-2-7B 模型(需要认证)
./hfd.sh meta-llama/Llama-2-7b --hf_username your_username --hf_token your_token

示例 2:使用文件过滤

# 排除所有 .safetensors 文件
./hfd.sh bigscience/bloom-560m --exclude "*.safetensors"

# 排除多个文件类型
./hfd.sh meta-llama/Llama-2-7b --exclude "*.safetensors" "*.md"

# 只包含特定目录下的文件
./hfd.sh stabilityai/stable-diffusion-xl-base-1.0 --include "vae/*" "unet/*"

示例 3:使用数据集

# 下载数据集
./hfd.sh lavita/medical-qa-shared-task-v1-toy --dataset

示例 4:自定义下载参数

# 使用 8 个线程和 10 个并发任务
./hfd.sh meta-llama/Meta-Llama-3-8B --hf_username user --hf_token token -x 8 -j 10

# 使用 wget 作为下载工具
./hfd.sh gpt2 --tool wget

示例 5:指定下载目录

# 下载到指定目录
./hfd.sh meta-llama/Llama-2-7b --local-dir /path/to/models

示例 6:下载特定版本

# 下载特定 revision 的模型
./hfd.sh bartowski/Phi-3.5-mini-instruct-exl2 --revision 5_0

示例 7:组合使用

# 下载特定模型的特定版本,排除某些文件,使用多线程
./hfd.sh openbmb/MiniCPM-V-2_6 \
  --exclude "*.safetensors" "*.bin" \
  --hf_username your_username \
  --hf_token your_token \
  -x 8 -j 8 \
  --revision v2.6

工作原理

  1. 获取元数据:脚本首先从 Hugging Face API 获取仓库的元数据信息
  2. 生成文件列表:根据包含/排除模式过滤文件,生成下载URL列表
  3. 缓存机制:元数据和文件列表会被缓存,后续下载可快速恢复
  4. 多线程下载:使用 aria2c 或 wget 进行实际的文件下载
  5. 断点续传:下载支持中断后继续,避免重复下载已完成的文件

环境变量

变量 说明 默认值
HF_ENDPOINT Hugging Face API 端点 https://hf-mirror.com

可以通过设置环境变量来修改默认的 Hugging Face 镜像源:

export HF_ENDPOINT="https://huggingface.co"
./hfd.sh gpt2

故障排除

问题 1:找不到命令 aria2cjq

# 安装缺失的依赖
sudo apt install aria2 jq

问题 2:下载需要认证的仓库失败

确保提供了正确的 --hf_username--hf_token 参数。访问 https://huggingface.co/settings/tokens 获取访问令牌。

问题 3:下载速度慢

  • 确保 aria2c 已安装(比 wget 更快)
  • 增加线程数:-x 8 -j 8
  • 检查网络连接
  • 默认已使用 hf-mirror.com 镜像,国内速度应该较快

问题 4:文件列表生成很慢

安装 jq 工具可以加速 JSON 解析:

sudo apt install jq

注意事项

  • 对于需要认证的仓库(标记为 "gated"),必须提供 --hf_username--hf_token
  • 下载大文件时建议使用 aria2c 并设置合适的线程数
  • 脚本会自动创建 .hfd 目录用于存储元数据缓存
  • 使用 Ctrl+C 可以中断下载,之后可以重新运行命令继续下载
  • 镜像源默认使用 hf-mirror.com,可根据需要修改 HF_ENDPOINT 环境变量

原型来源

本脚本原型来自 https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f

在此基础上进行了优化和功能扩展,包括:

  • 改进的用户界面和提示信息
  • 增强的错误处理
  • 支持 include/exclude 模式
  • 支持指定 revision 版本
  • 优化了缓存和断点续传机制

许可证

本脚本遵循与原原型相同的许可证协议。

致谢

感谢原作者 padeoe 提供的原型脚本。

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'; GREEN='\033[0;32m'; YELLOW='\033[1;33m'; NC='\033[0m' # No Color
trap 'printf " YELLOW\nDownloadinterrupted.Youcanresumebyre−runningthecommand.\n
{NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <REPO_ID> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [-j jobs] [--dataset] [--local-dir path] [--revision rev]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Arguments:
REPO_ID The Hugging Face repo ID (Required)
Format: 'org_name/repo_name' or legacy format (e.g., gpt2)
Options:
include/exclude_pattern The patterns to match against file path, supports wildcard characters.
e.g., '--exclude *.safetensor *.md', '--include vae/*'.
--include (Optional) Patterns to include files for downloading (supports multiple patterns).
--exclude (Optional) Patterns to exclude files from downloading (supports multiple patterns).
--hf_username (Optional) Hugging Face username for authentication (not email).
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use: aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c (default: 4).
-j (Optional) Number of concurrent downloads for aria2c (default: 5).
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Directory path to store the downloaded data.
Defaults to the current directory with a subdirectory named 'repo_name'
if REPO_ID is is composed of 'org_name/repo_name'.
--revision (Optional) Model/Dataset revision to download (default: main).
Example:
hfd gpt2
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
hfd bartowski/Phi-3.5-mini-instruct-exl2 --revision 5_0
EOF
exit 1
}
[[ -z "$1" || "$1" =~ ^-h || "$1" =~ ^--help ]] && display_help
REPO_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
CONCURRENT=5
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
INCLUDE_PATTERNS=()
EXCLUDE_PATTERNS=()
REVISION="main"
validate_number() {
[[ "$2" =~ ^[1-9][0-9]*$ && "$2" -le "$3" ]] || { printf "${RED}[Error] $1 must be 1-$3${NC}\n"; exit 1; }
}
# Argument parsing
while [[ $# -gt 0 ]]; do
case $1 in
--include) shift; while [[ $# -gt 0 && ! ($1 =~ ^--) && ! ($1 =~ ^-[^-]) ]]; do INCLUDE_PATTERNS+=("$1"); shift; done ;;
--exclude) shift; while [[ $# -gt 0 && ! ($1 =~ ^--) && ! ($1 =~ ^-[^-]) ]]; do EXCLUDE_PATTERNS+=("$1"); shift; done ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool)
case $2 in
aria2c|wget)
TOOL="$2"
;;
*)
printf "%b[Error] Invalid tool. Use 'aria2c' or 'wget'.%b\n" "$RED" "$NC"
exit 1
;;
esac
shift 2
;;
-x) validate_number "threads (-x)" "$2" 10; THREADS="$2"; shift 2 ;;
-j) validate_number "concurrent downloads (-j)" "$2" 10; CONCURRENT="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
--revision) REVISION="$2"; shift 2 ;;
*) display_help ;;
esac
done
# Generate current command string
generate_command_string() {
local cmd_string="REPO_ID=$REPO_ID"
cmd_string+=" TOOL=$TOOL"
cmd_string+=" INCLUDE_PATTERNS=${INCLUDE_PATTERNS[*]}"
cmd_string+=" EXCLUDE_PATTERNS=${EXCLUDE_PATTERNS[*]}"
cmd_string+=" DATASET=${DATASET:-0}"
cmd_string+=" HF_USERNAME=${HF_USERNAME:-}"
cmd_string+=" HF_TOKEN=${HF_TOKEN:-}"
cmd_string+=" HF_ENDPOINT=${HF_ENDPOINT:-}"
cmd_string+=" REVISION=$REVISION"
echo "$cmd_string"
}
# Check if aria2, wget, curl are installed
check_command() {
if ! command -v $1 &>/dev/null; then
printf "%b%s is not installed. Please install it first.%b\n" "$RED" "$1" "$NC"
exit 1
fi
}
check_command curl; check_command "$TOOL"
if [[ -n "$LOCAL_DIR" ]]; then
LOCAL_DIR="$LOCAL_DIR/$REPO_ID"
else
LOCAL_DIR="$REPO_ID"
fi
mkdir -p "$LOCAL_DIR/.hfd"
if [[ "$DATASET" == 1 ]]; then
METADATA_API_PATH="datasets/$REPO_ID"
DOWNLOAD_API_PATH="datasets/$REPO_ID"
CUT_DIRS=5
else
METADATA_API_PATH="models/$REPO_ID"
DOWNLOAD_API_PATH="$REPO_ID"
CUT_DIRS=4
fi
# Modify API URL, construct based on revision
if [[ "$REVISION" != "main" ]]; then
METADATA_API_PATH="$METADATA_API_PATH/revision/$REVISION"
fi
API_URL="$HF_ENDPOINT/api/$METADATA_API_PATH"
METADATA_FILE="$LOCAL_DIR/.hfd/repo_metadata.json"
# Fetch and save metadata
fetch_and_save_metadata() {
status_code=$(curl -L -s -w "%{http_code}" -o "$METADATA_FILE" ${HF_TOKEN:+-H "Authorization: Bearer $HF_TOKEN"} "$API_URL")
RESPONSE=$(cat "$METADATA_FILE")
if [ "$status_code" -eq 200 ]; then
printf "%s\n" "$RESPONSE"
else
printf "%b[Error] Failed to fetch metadata from $API_URL. HTTP status code: $status_code.%b\n$RESPONSE\n" "${RED}" "${NC}" >&2
rm $METADATA_FILE
exit 1
fi
}
check_authentication() {
local response="$1"
if command -v jq &>/dev/null; then
local gated
gated=$(echo "$response" | jq -r '.gated // false')
if [[ "$gated" != "false" && ( -z "$HF_TOKEN" || -z "$HF_USERNAME" ) ]]; then
printf "${RED}The repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
else
if echo "$response" | grep -q '"gated":[^f]' && [[ -z "$HF_TOKEN" || -z "$HF_USERNAME" ]]; then
printf "${RED}The repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
fi
}
if [[ ! -f "$METADATA_FILE" ]]; then
printf "%bFetching repo metadata...%b\n" "$YELLOW" "$NC"
RESPONSE=$(fetch_and_save_metadata) || exit 1
check_authentication "$RESPONSE"
else
printf "%bUsing cached metadata: $METADATA_FILE%b\n" "$GREEN" "$NC"
RESPONSE=$(cat "$METADATA_FILE")
check_authentication "$RESPONSE"
fi
should_regenerate_filelist() {
local command_file="$LOCAL_DIR/.hfd/last_download_command"
local current_command=$(generate_command_string)
# If file list doesn't exist, regenerate
if [[ ! -f "$LOCAL_DIR/$fileslist_file" ]]; then
echo "$current_command" > "$command_file"
return 0
fi
# If command file doesn't exist, regenerate
if [[ ! -f "$command_file" ]]; then
echo "$current_command" > "$command_file"
return 0
fi
# Compare current command with saved command
local saved_command=$(cat "$command_file")
if [[ "$current_command" != "$saved_command" ]]; then
echo "$current_command" > "$command_file"
return 0
fi
return 1
}
fileslist_file=".hfd/${TOOL}_urls.txt"
if should_regenerate_filelist; then
# Remove existing file list if it exists
[[ -f "$LOCAL_DIR/$fileslist_file" ]] && rm "$LOCAL_DIR/$fileslist_file"
printf "%bGenerating file list...%b\n" "$YELLOW" "$NC"
# Convert include and exclude patterns to regex
INCLUDE_REGEX=""
EXCLUDE_REGEX=""
if ((${#INCLUDE_PATTERNS[@]})); then
INCLUDE_REGEX=$(printf '%s\n' "${INCLUDE_PATTERNS[@]}" | sed 's/\./\\./g; s/\*/.*/g' | paste -sd '|' -)
fi
if ((${#EXCLUDE_PATTERNS[@]})); then
EXCLUDE_REGEX=$(printf '%s\n' "${EXCLUDE_PATTERNS[@]}" | sed 's/\./\\./g; s/\*/.*/g' | paste -sd '|' -)
fi
# Check if jq is available
if command -v jq &>/dev/null; then
process_with_jq() {
if [[ "$TOOL" == "aria2c" ]]; then
printf "%s" "$RESPONSE" | jq -r \
--arg endpoint "$HF_ENDPOINT" \
--arg repo_id "$DOWNLOAD_API_PATH" \
--arg token "$HF_TOKEN" \
--arg include_regex "$INCLUDE_REGEX" \
--arg exclude_regex "$EXCLUDE_REGEX" \
--arg revision "$REVISION" \
'
.siblings[]
| select(
.rfilename != null
and ($include_regex == "" or (.rfilename | test($include_regex)))
and ($exclude_regex == "" or (.rfilename | test($exclude_regex) | not))
)
| [
($endpoint + "/" + $repo_id + "/resolve/" + $revision + "/" + .rfilename),
" dir=" + (.rfilename | split("/")[:-1] | join("/")),
" out=" + (.rfilename | split("/")[-1]),
if $token != "" then "header=Authorization:Bearer " + $token else empty end,
""
]
| join("\n")
'
else
printf "%s" "$RESPONSE" | jq -r \
--arg endpoint "$HF_ENDPOINT" \
--arg repo_id "$DOWNLOAD_API_PATH" \
--arg include_regex "$INCLUDE_REGEX" \
--arg exclude_regex "$EXCLUDE_REGEX" \
--arg revision "$REVISION" \
'
.siblings[]
| select(
.rfilename != null
and ($include_regex == "" or (.rfilename | test($include_regex)))
and ($exclude_regex == "" or (.rfilename | test($exclude_regex) | not))
)
| ($endpoint + "/" + $repo_id + "/resolve/" + $revision + "/" + .rfilename)
'
fi
}
result=$(process_with_jq)
printf "%s\n" "$result" > "$LOCAL_DIR/$fileslist_file"
else
printf "%b[Warning] jq not installed, using grep/awk for metadata json parsing (slower). Consider installing jq for better parsing performance.%b\n" "$YELLOW" "$NC"
process_with_grep_awk() {
local include_pattern=""
local exclude_pattern=""
local output=""
if ((${#INCLUDE_PATTERNS[@]})); then
include_pattern=$(printf '%s\n' "${INCLUDE_PATTERNS[@]}" | sed 's/\./\\./g; s/\*/.*/g' | paste -sd '|' -)
fi
if ((${#EXCLUDE_PATTERNS[@]})); then
exclude_pattern=$(printf '%s\n' "${EXCLUDE_PATTERNS[@]}" | sed 's/\./\\./g; s/\*/.*/g' | paste -sd '|' -)
fi
local files=$(printf '%s' "$RESPONSE" | grep -o '"rfilename":"[^"]*"' | awk -F'"' '{print $4}')
if [[ -n "$include_pattern" ]]; then
files=$(printf '%s\n' "$files" | grep -E "$include_pattern")
fi
if [[ -n "$exclude_pattern" ]]; then
files=$(printf '%s\n' "$files" | grep -vE "$exclude_pattern")
fi
while IFS= read -r file; do
if [[ -n "$file" ]]; then
if [[ "$TOOL" == "aria2c" ]]; then
output+="$HF_ENDPOINT/$DOWNLOAD_API_PATH/resolve/$REVISION/$file"$'\n'
output+=" dir=$(dirname "$file")"$'\n'
output+=" out=$(basename "$file")"$'\n'
[[ -n "$HF_TOKEN" ]] && output+=" header=Authorization: Bearer $HF_TOKEN"$'\n'
output+=$'\n'
else
output+="$HF_ENDPOINT/$DOWNLOAD_API_PATH/resolve/$REVISION/$file"$'\n'
fi
fi
done <<< "$files"
printf '%s' "$output"
}
result=$(process_with_grep_awk)
printf "%s\n" "$result" > "$LOCAL_DIR/$fileslist_file"
fi
else
printf "%bResume from file list: $LOCAL_DIR/$fileslist_file%b\n" "$GREEN" "$NC"
fi
# Perform download
printf "${YELLOW}Starting download with $TOOL to $LOCAL_DIR...\n${NC}"
cd "$LOCAL_DIR"
if [[ "$TOOL" == "aria2c" ]]; then
aria2c --console-log-level=error --file-allocation=none -x "$THREADS" -j "$CONCURRENT" -s "$THREADS" -k 1M -c -i "$fileslist_file" --save-session="$fileslist_file"
elif [[ "$TOOL" == "wget" ]]; then
wget -x -nH --cut-dirs="$CUT_DIRS" ${HF_TOKEN:+--header="Authorization: Bearer $HF_TOKEN"} --input-file="$fileslist_file" --continue
fi
if [[ $? -eq 0 ]]; then
printf "${GREEN}Download completed successfully. Repo directory: $PWD\n${NC}"
else
printf "${RED}Download encountered errors.\n${NC}"
exit 1
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment