Skip to content

Instantly share code, notes, and snippets.

@alkavan
Last active August 20, 2024 07:33
Show Gist options
  • Save alkavan/eda9f5da34ea7134e47065f0c60444e5 to your computer and use it in GitHub Desktop.
Save alkavan/eda9f5da34ea7134e47065f0c60444e5 to your computer and use it in GitHub Desktop.
Rocky Linux 9 server instructions for enthusiastic people who wish to run or train the LLaMA 3.1 models using HuggingFace.

Rocky Linux 9 | Chatbot Edition (Llama 3.1)

The following was tested on Google GCP utilizing an a2-highgpu-1g instance and Rocky Linux 9 image (GCP optimized x86_64).
It has 192 RAM, 48 vCPU cores, and 4 Nvidia L4 24GB GPU attached in grid.
Recommended taking 384GB SSD disk or larger depending on the workload type.

NOTICE: Make sure you have positive bank balance before trying.

Update the system:

sudo dnf update -y

Install my favorite editor:

sudo dnf install -y nano

Install some basic development tools:

sudo dnf groupinstall -y "Development Tools"
sudo dnf install -y python3-pip

NVIDIA Drivers

Next you need to install drivers for your GPU. I am ofcourse using NVIDIA A100-SXM4-40GB but this should work almost for any decent datacenter GPU (NVIDIA).

Add EL9 compatible repository (Fedora):

sudo dnf config-manager --set-enabled crb
sudo dnf install -y \
  https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm \
  https://dl.fedoraproject.org/pub/epel/epel-next-release-latest-9.noarch.rpm
sudo dnf config-manager --add-repo \
  http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo

Install some monitoring and execution tools:

sudo dnf install -y htop tmux

Install driver dependencies:

sudo dnf install -y \
  kernel-headers-$(uname -r) kernel-devel-$(uname -r) \
  tar bzip2 make automake gcc gcc-c++ \
  pciutils elfutils-libelf-devel libglvnd-opengl libglvnd-glx libglvnd-devel acpid pkgconfig dkms

Install NVIDIA GPU driver:

sudo dnf module install -y nvidia-driver:latest-dkms

Now it's a good time to reboot the system:

reboot

Check the driver installation worked:

nvidia-smi

You should see something like this a second later:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   62C    P8             14W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      Off |   00000000:00:04.0 Off |                    0 |
| N/A   61C    P8             15W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      Off |   00000000:00:05.0 Off |                    0 |
| N/A   54C    P8             13W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      Off |   00000000:00:06.0 Off |                    0 |
| N/A   59C    P8             14W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install Llama 3.1 Using HuggingFace CLI Tool

Install CLI tool using pip:

pip install -U "huggingface_hub[cli]"

Install HuggingFace token:

mkdir ~/.bashrc.d \
  && echo 'export HUGGINGFACE_TOKEN=<your_access_token>' >> ~/.bashrc.d/hf \
  && source ~/.bashrc.d/hf

Configure git credential storage:

git config --global credential.helper store

Login with CLI tool:

huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential

You should see some similar after login:

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /home/user/.cache/huggingface/token
Login successful

Llama 3.1 Python Project Creation

Create new project directory:

mkdir llama31_playground && cd llama31_playground

Create requirements.txt, it's in this gist.

nano requirements.txt

Install Python Virtual Environment

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
exit

Create hello.py, it's in this gist.

nano hello.py

Run a 'hello world' program:

tmux
source env/bin/activate
python hello.py
exit

Llama based Vicuna 13b Model

Install support for large files for git:

sudo dnf install git-lfs
git lfs install

Clone LLaMA-13b model weights:

git clone https://huggingface.co/huggyllama/llama-13b

Create Vicuna-13b weights output directory:

mkdir vicuna-13b

FastChat (Vicuna 13B)

Clone FastChat repository:

git clone https://github.com/lm-sys/FastChat.git && cd FastChat

Upgrade pip (to enable PEP 660 support):

pip3 install --upgrade pip

Install package dependencies:

pip3 install -e .

Apply delta weights (will download repository):

python3 -m fastchat.model.apply_delta \
  --base-model-path ../llama-13b \
  --target-model-path ../vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

Confirm weights output:

ls -alh ../vicuna-13b/

Run CLI prompt (single GPU):

python3 -m fastchat.serve.cli --model-path ../vicuna-13b

Run web interface

Install tmux for easy running multiple processes:

sudo dnf install -y tmux

Quick tmux Tutorial

To run tmux just type tmux in the shell.
The first window is created automatically.
To create another window ctrl + b then c.
To switch window ctrl + b then w and choose with arrows the window. To detach ctrl + b then d.
To reattach latest session type tmux at in the shell.

Starting controller, worker(s), web interface

Run each of the servers in a different tmux window so you can switch between them and also leave them running in interactive mode after you logout or disconnected.

Start the controller server:

python3 -m fastchat.serve.controller

Start the worker server (can run multiple workers, different models):

python3 -m fastchat.serve.model_worker --model-path ../vicuna-13b/

Add default web interface http port (7860) to firewall:

sudo firewall-cmd --add-port=7860/tcp
sudo firewall-cmd --add-port=7860/tcp --permanent

If you're using Google GCP you probably need to open ingress for port 7860!

Start the GUI web interface:

python3 -m fastchat.serve.gradio_web_server
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-70B"
# Transformer input sequences
inputs = [
"Hello bot! How are you doing today?",
#"What is the date today, and how is the weather in New York?",
#"What is the answer to the ultimate question of life, the universe, and everything?"
]
# Load the tokenizer separately, needed for batch
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
# Set the pad token to be the 'End of Sequence' token
# Used for: sequence termination, generation control,
# input formatting, length normalization, and multi-sequence handling
tokenizer.pad_token = tokenizer.eos_token
# This sets the padding to be applied on the left side of the input sequences,
# which is the correct approach for decoder-only models like Llama.
# Left-padding is important for decoder-only models because:
#
# These models are trained to generate text from left to right.
# With left-padding, the actual input text always appears at the end of the padded sequence,
# which aligns better with how the model was trained.
# It ensures that the model's attention mechanism focuses on the
# relevant parts of the input when generating new tokens.
tokenizer.padding_side = 'left'
# Create transformer pipeline
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto"
)
# do_sample=False (default): Uses greedy decoding
# do_sample=True: Enables sampling
# When do_sample=True, you can use additional parameters like:
#
# temperature: Controls randomness (higher values increase randomness)
# top_k: Limits selection to the k most probable tokens
# top_p: Uses nucleus sampling to dynamically select top tokens
outputs = pipeline(
inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
batch_size=2 # Process 2 inputs at a time
)
for input_text, output in zip(inputs, outputs):
print(f"Input: {input_text}")
print(f"Output: {output[0]['generated_text']}")
print()
accelerate==0.33.0
certifi==2024.7.4
charset-normalizer==3.3.2
colorama==0.4.6
filelock==3.15.4
fsspec==2024.6.1
huggingface-hub==0.24.5
idna==3.7
Jinja2==3.1.4
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
packaging==24.1
psutil==6.0.0
PyYAML==6.0.1
regex==2024.7.24
requests==2.32.3
safetensors==0.4.3
sympy==1.13.1
tokenizers==0.19.1
torch==2.4.0
tqdm==4.66.4
transformers==4.43.3
triton==3.0.0
typing_extensions==4.12.2
urllib3==2.2.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment