Skip to content

Instantly share code, notes, and snippets.

@padeoe
Last active September 19, 2024 03:30
Show Gist options
  • Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., *.bin or *.safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: Only depend on git, aria2c/wget.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

you can create an alias for convenience

alias hfd="$PWD/hfd.sh"

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
  --exclude       (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
  include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

hfd bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

Download with aria2c and multiple threads:

hfd bigscience/bloom-560m

Output: During the download, the file URLs will be displayed:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
--exclude (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
INCLUDE_PATTERNS=()
EXCLUDE_PATTERNS=()
while [[ $# -gt 0 ]]; do
case $1 in
--include)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
INCLUDE_PATTERNS+=("$1")
shift
done
;;
--exclude)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
EXCLUDE_PATTERNS+=("$1")
shift
done
;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
file_matches_include_patterns() {
local file="$1"
for pattern in "${INCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
file_matches_exclude_patterns() {
local file="$1"
for pattern in "${EXCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
if [[ ${#INCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_include_patterns "$file" || { printf "# %s\n" "$download_cmd"; continue; }
fi
if [[ ${#EXCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_exclude_patterns "$file" && { printf "# %s\n" "$download_cmd"; continue; }
fi
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
@padeoe
Copy link
Author

padeoe commented May 15, 2024

Some updates and related contributors.

@char-1ee
Copy link

请问是需要在提前安装aria2c吗,还是说我使用方法不对

$ ./hfd.sh deepseek-ai/DeepSeek-V2-Chat --tool aria2c -x 4 
aria2c is not installed. Please install it first.                

@MintGrass
Copy link

Hello, I have an issue when downloading a large dataset like oscar-corpus/OSCAR-2301. I only want to download sub-folder like its chinese part. How would I execute the command? I have tried --include reolve/main/en_meta or en_meta_*.zst, but didn;t work

try --include en_meta/*, it should match the full path of target files.

例如,我只想下载仓库根目录下 en_meta/* ,其它的都不要。类似git 稀疏检出的效果。应该怎样设置呢?

使用 --include en_meta/* 命令,会把其它文件也生成一份在本地(0字节)

@jane00
Copy link

jane00 commented Jun 12, 2024

Downloading to Qwen2-7B-Instruct-GGUF
Qwen2-7B-Instruct-GGUF exists, Skip Clone.
error: cannot pull with rebase: You have unstaged changes.
error: please commit or stash them.
Git pull failed.

下载一半中断再用什么命令继续下载呢?谢谢。直接还是老命令会失败。

@jane00
Copy link

jane00 commented Jun 13, 2024

Downloading to Qwen2-7B-Instruct-GGUF Qwen2-7B-Instruct-GGUF exists, Skip Clone. error: cannot pull with rebase: You have unstaged changes. error: please commit or stash them. Git pull failed.

下载一半中断再用什么命令继续下载呢?谢谢。直接还是老命令会失败。

好了,进目录后
git add .
git commit
再重新下载就不影响续传了

@edwardzjl
Copy link

hf-mirror 官网上的这个脚本不是最新的

@padeoe
Copy link
Author

padeoe commented Jun 14, 2024

hf-mirror 官网上的这个脚本不是最新的

感谢提醒,已更新

@O3smatvkr26
Copy link

大佬,每个文件下载刚开始能到100M/s,但是到最后几十M都会特别慢(<100k/s),甚至直接卡住了是什么情况? export HF_ENDPOINT="https://hf-mirror.com" export HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen-VL-Chat --local-dir Qwen#Qwen-VL-Chat --local-dir-use-symlinks False downloading https://hf-mirror.com/Qwen/Qwen-VL-Chat/resolve/f57cfbd358cb56b710d963669ad1bcfb44cdcdd8/pytorch_model-00001-of-00010.bin to /root/.cache/huggingface/hub/models--Qwen--Qwen-VL-Chat/blobs/d63e4b4238be3897d3b44d0f604422fc07dfceaf971ebde7adadd7be7a2a35bb.incomplete pytorch_model-00001-of-00010.bin: 99%|██████████████████████████████████████████████████████████████████████████████████████████████████▉ | 1.94G/1.96G [1:27:23<30:13, 11.6kB/s]

感觉是镜像站的原因, 不用这个工具下载, 用 huggingface-cli 从 hf-mirror 下载, 也会卡住, 我现在很多时候一个模型要下载两遍, 或者只能关掉 hr_transfer 功能

@Alpha-Rome0
Copy link

anyway to add retry logic if download fails?

@zodiacg
Copy link

zodiacg commented Jul 4, 2024

能否指定多个exclude或include pattern?

@Au3C2
Copy link

Au3C2 commented Jul 4, 2024

请问是需要在提前安装aria2c吗,还是说我使用方法不对

$ ./hfd.sh deepseek-ai/DeepSeek-V2-Chat --tool aria2c -x 4 
aria2c is not installed. Please install it first.                

apt install aria2

@yzf072
Copy link

yzf072 commented Jul 23, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

@xhx1022
Copy link

xhx1022 commented Jul 24, 2024

hfd.sh有没有办法下载到~/.cache目录下呢

@threegold116
Copy link

下载中断了怎么自动重试呢?

@zhaoxin-web
Copy link

想问一下这是什么原因呢
Downloading to cinepile
cinepile exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd cinepile
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet" -d "data" -o "test-00000-of-00001.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00000-of-00003.parquet" -d "data" -o "train-00000-of-00003.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00001-of-00003.parquet" -d "data" -o "train-00001-of-00003.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00002-of-00003.parquet" -d "data" -o "train-00002-of-00003.parquet"
Start downloading data/test-00000-of-00001.parquet.
[#127d73 0B/0B CN:1 DL:0B]
08/01 11:36:59 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
Exception: [AbstractCommand.cc:403] errorCode=18 URI=https://cdn-lfs-us-1.hf-mirror.com/repos/27/86/27864067717b3f938d06d61f89fe8d38e30ab1e533a7f05f541f53d5abb17e44/eef38ba3fbf349b42bafe6dea4af6316bd6ff8e0a3e25701e0678bfdbf2ed274?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27test-00000-of-00001.parquet%3B+filename%3D%22test-00000-of-00001.parquet%22%3B&Expires=1722742619&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMjc0MjYxOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzI3Lzg2LzI3ODY0MDY3NzE3YjNmOTM4ZDA2ZDYxZjg5ZmU4ZDM4ZTMwYWIxZTUzM2E3ZjA1ZjU0MWY1M2Q1YWJiMTdlNDQvZWVmMzhiYTNmYmYzNDliNDJiYWZlNmRlYTRhZjYzMTZiZDZmZjhlMGEzZTI1NzAxZTA2NzhiZmRiZjJlZDI3ND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=WvxqnAW2VUmMx3Smn4La9k41J-HrmTKMAt1Bd9zSQcSytx6W6-K3jtaXOSmvmNPxsfJyUUOCoWeDTn5TmXa7c2d2-eVXRIzdU3-J0CfNYUl3awBSljWpK2cBaIev7ZuSeOtPbWems1VZ4ZbbGsh0y5UTtdm4cNB9RTjPy7oLbUkAhV5g%7EbE%7EdQ1hSa2a7hvoSN1NVvUJ6GLPk11z11gx4t9w%7EsM7fsJZnyUyGkaZmhIkyGYLC4tdJ9SOmAMPf-ndOnP1woswKUDVpOPohpNd1Tue0%7Eext9nscpfQzxhxjGBNIDmc6AaBfxpErUeJKWNmH433Nc82Hclxt4whgbusQQ__&Key-Pair-Id=K24J24Z295AEI9
-> [RequestGroup.cc:760] errorCode=18 Download aborted.
-> [util.cc:1951] errNum=13 errorCode=18 Failed to make the directory data, cause: Permission denied

Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
127d73|ERR | 0B/s|data/test-00000-of-00001.parquet

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.

08/01 11:37:00 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
Exception: [AbstractCommand.cc:351] errorCode=24 URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
-> [HttpSkipResponseCommand.cc:215] errorCode=24 Authorization failed.

Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
f7b3d0|ERR | 0B/s|data/test-00000-of-00001.parquet

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet.

@padeoe
Copy link
Author

padeoe commented Aug 1, 2024

@zhaoxin-web 权限问题,这个目录下似乎没有写权限

@zhaoxin-web
Copy link

@zhaoxin-web 权限问题,这个目录下似乎没有写权限

大佬,上次的问题解决了,但是在运行的一段时间后停了,尽管我已经换了token,但是还是报错,请问这是什么原因呢?

remote: Access to dataset mlfoundations/MINT-1T-PDF-CC-2023-40 is restricted. You must be authenticated to access it.
fatal: Authentication failed for 'https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40/'
Git pull failed.

@verigle
Copy link

verigle commented Aug 6, 2024

下载中断了怎么自动重试呢?

同问

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/internlm/internlm2_5-7b-chat/resolve/main/model-00004-of-00008.safetensors.

@tianbuwei
Copy link

能否指定多个exclude或include pattern?

我也用这个问题,你解决了吗

@padeoe
Copy link
Author

padeoe commented Aug 7, 2024

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

@Peng154
Copy link

Peng154 commented Aug 7, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

@Peng154
Copy link

Peng154 commented Aug 7, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

from pathlib import Path
from glob import glob
import re

data_dir_path = Path("../data/lotsa_data2")
exclude_dirs = []
include_dirs = []
for sud_dir in glob(str(data_dir_path / "*")):
    if Path(sud_dir).is_dir():
        print(sud_dir)
        # 檢查arrow文件個數
        is_all_arrow_files_exist = False
        is_all_json_files_exist = False
        arrow_file_count = len(glob(str(Path(sud_dir) / "*.arrow")))
        if arrow_file_count !=0:
            print(glob(str(Path(sud_dir) / "*.arrow"))[0])
            total_arrow_file_count = int(re.match(r".+-([0-9]+).arrow",
                                                  glob(str(Path(sud_dir) / "*.arrow"))[0]
                                                  ).group(1))
        else:
            total_arrow_file_count = -1
        if arrow_file_count == total_arrow_file_count:
            print(f"all arrow files exist")
            is_all_arrow_files_exist = True
        
        # 檢查json文件個數
        json_file_count = len(glob(str(Path(sud_dir) / "*.json")))
        if json_file_count == 2:
            print(f"all json files exist")
            is_all_json_files_exist = True
        
        if is_all_arrow_files_exist and is_all_json_files_exist:
            exclude_dirs.append(str(Path(sud_dir).name) + "/*")
            print(f"exclude {sud_dir}")
        else:
            include_dirs.append(str(Path(sud_dir).name) + "/*")
            print(f"include {sud_dir}")

print(" ".join(exclude_dirs))

可以参考我这样做

@zhaoxin-web
Copy link

下载到一半出现这个问题,重新运行还是报错,有解决办法吗?
remote: Access to dataset mlfoundations/MINT-1T-PDF-CC-2023-06 is restricted. You must be authenticated to access it.
fatal: Authentication failed for 'https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06/'
Git pull failed.

@Peng154
Copy link

Peng154 commented Aug 8, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

from pathlib import Path from glob import glob import re

data_dir_path = Path("../data/lotsa_data2") exclude_dirs = [] include_dirs = [] for sud_dir in glob(str(data_dir_path / "")): if Path(sud_dir).is_dir(): print(sud_dir) # 檢查arrow文件個數 is_all_arrow_files_exist = False is_all_json_files_exist = False arrow_file_count = len(glob(str(Path(sud_dir) / ".arrow"))) if arrow_file_count !=0: print(glob(str(Path(sud_dir) / ".arrow"))[0]) total_arrow_file_count = int(re.match(r".+-([0-9]+).arrow", glob(str(Path(sud_dir) / ".arrow"))[0] ).group(1)) else: total_arrow_file_count = -1 if arrow_file_count == total_arrow_file_count: print(f"all arrow files exist") is_all_arrow_files_exist = True

    # 檢查json文件個數
    json_file_count = len(glob(str(Path(sud_dir) / "*.json")))
    if json_file_count == 2:
        print(f"all json files exist")
        is_all_json_files_exist = True
    
    if is_all_arrow_files_exist and is_all_json_files_exist:
        exclude_dirs.append(str(Path(sud_dir).name) + "/*")
        print(f"exclude {sud_dir}")
    else:
        include_dirs.append(str(Path(sud_dir).name) + "/*")
        print(f"include {sud_dir}")

print(" ".join(exclude_dirs))

可以参考我这样做

我后面发现似乎不对。。。。最后还是用回了huggingface_hub的snapshot_download函数,代码如下:

from huggingface_hub import hf_hub_download, snapshot_download
from huggingface_hub import constants
constants._HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"  # use mirror for faster download
try:
    import hf_transfer
    constants.HF_HUB_ENABLE_HF_TRANSFER = True  # enable transfer from Hugging Face
except ImportError:
    constants.HF_HUB_ENABLE_HF_TRANSFER = False
    
# 下载某个文件
# hf_hub_download(repo_id=f"Salesforce/moirai-1.0-R-{SIZE}", local_dir=f"../pretrained_models/moirai-1.0-R-{SIZE}")
# snapshot_download(repo_id=f"Salesforce/moirai-1.0-R-{SIZE}",
#                   local_dir=f"../pretrained_models/moirai-1.0-R-{SIZE}")

# 下载整个LOTSA 数据集
while True:
    try:
        snapshot_download(repo_id="Salesforce/lotsa_data",
                        local_dir="../data/lotsa_data2",
                        repo_type="dataset",
                        max_workers=4)
        break
    except Exception as e:
        print(e)
        print("retrying...")
        continue

@DavinciEvans
Copy link

大佬,我想问一下如果是想要把模型文件下载到 huggingface 的缓存也就是 ./cache/huggingface 下面,应该怎么做呢,看代码似乎是直接将权重下载到当前目录下

@i-square
Copy link

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

你好,脚本示例里面的排除通配符写法 --exclude *.safetensors 似乎一直都不行,无法匹配到文件,只能用文件全名

@T-Atlas
Copy link

T-Atlas commented Aug 16, 2024

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

你好,脚本示例里面的排除通配符写法 --exclude *.safetensors 似乎一直都不行,无法匹配到文件,只能用文件全名

同样的问题

@achristianson
Copy link

Is there a way to specify the tag/branch/revision? Many repos store things like different quant levels as different branches in the repo. An example with huggingface-cli would be:

huggingface-cli download ${MODEL_ID} --revision ${MODEL_REVISION}

@haukzero
Copy link

我之前使用hfd可以正常下载,为什么现在会报这样的错,我试过重新下载hfd但似乎也并不管用

Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
Unexpected HTTP Status Code: 000
Executing debug command: curl -v https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
Output:
* Host hf-mirror.com:443 was resolved.
* IPv6: (none)
* IPv4: 153.121.57.40, 160.16.199.204, 133.242.169.68
*   Trying 153.121.57.40:443...
* Connected to hf-mirror.com (153.121.57.40) port 443
* schannel: disabled automatic use of client certificate
* using HTTP/1.x
> GET /gpt2/info/refs?service=git-upload-pack HTTP/1.1
> Host: hf-mirror.com
> User-Agent: curl/8.8.0
> Accept: */*
>
* Request completely sent off
* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
< HTTP/1.1 200 OK
< Access-Control-Allow-Origin: https://hf-mirror.com
< Access-Control-Expose-Headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range
< Alt-Svc: h3=":443"; ma=2592000
< Content-Type: application/x-git-upload-pack-advertisement
< Cross-Origin-Opener-Policy: same-origin
< Date: Thu, 12 Sep 2024 05:17:56 GMT
< Referrer-Policy: strict-origin-when-cross-origin
< Server: hf-mirror
< Vary: Origin
< Via: 1.1 746d9b263e5f72ff5dc6d5120e20f00e.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: 5P8WslKSnkIpXLeQt4eXOpfG2c4uSa2-VsKpVcCoAv_otrQ4PtHvHg==
< X-Amz-Cf-Pop: NRT51-P2
< X-Cache: Miss from cloudfront
< X-Powered-By: huggingface-moon
< X-Request-Id: Root=1-66e27984-366b3db27298ceb702252a26
< Transfer-Encoding: chunked
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* client returned ERROR on write of 3561 bytes
* Failed reading the chunked-encoded stream
* Closing connection
* schannel: shutting down SSL/TLS connection with hf-mirror.com port 443

Git clone failed.

@zhang-ziang
Copy link

有可能做到在下载数据集的时候暂时跳过下载失败的文件吗?我在下载一个有很多文件的数据集但是单个文件的下载失败似乎中止了整个进程

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment