Skip to content

Instantly share code, notes, and snippets.

Build Conda Env:

conda create -n mpi -c conda-forge -c nvidia mpich gcc=11 gxx=11 make automake ipython cuda-toolkit cuda-version=12.3 nccl cuda-cudart-static --yes

Build NCCL-Tests

git clone https://github.com/NVIDIA/nccl-tests.git

CUDA_HOME=$CONDA_PREFIX NCCL_HOME=$CONDA_PREFIX/include MPI_HOME=$CONDA_PREFIX MPI=1 make

Test with 2 GPUs

mpirun --hostfile hosts -np 2 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

============================= test session starts ==============================
platform darwin -- Python 3.11.9, pytest-8.2.0, pluggy-1.5.0 -- /Users/bzaitlen/miniforge3/envs/test-environment/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/bzaitlen/Documents/GitHub/dask
configfile: pyproject.toml
plugins: cov-5.0.0, rerunfailures-14.0, xdist-3.5.0, timeout-2.3.1
timeout: 300.0s
timeout method: thread
timeout func_only: False
collecting ... collected 486 items
#!/bin/bash
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=1
#SBATCH --account=dasrepo_g
#SBATCH --constraint=gpu
#SBATCH --gpus-per-node=4
#SBATCH --qos=early_science
#SBATCH --time 00:09:00
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 50000000
base-chunks | 4
other-chunks | 4
broadcast | default
protocol | ucx
device(s) | 0
rmm-pool | False
import cupy
import numpy as np
import xarray as xr
import dask.array as da
from dask.array import stats
import fsspec
n = 10000 # Number of variants (i.e. genomic locations)
m = 100000 # Number of individuals (i.e. people)
c = 3 # Number of covariates (i.e. confounders)
--------------------------------------------------------------------------------------------------------------------- benchmark: 48 tests ----------------------------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_Array_Slicing[shape0-cupy] 23.8450 (1.0) 68.3790 (2.03) 34.2448 (1.19) 19.1792 (8.08) 25.4280 (1.0) 14.3270 (5.18) 1;1 29,201.5031 (0.84) 5
## start cluster
REGION="us-east1"
CLUSTER_NAME="dask-rapids-test"
NUM_GPUS=2
NUM_WORKERS=2
gcloud dataproc clusters create $CLUSTER_NAME \
--region $REGION \
--image-version=2.0.0-RC22-ubuntu18 \
--master-machine-type n1-standard-16 \
--num-workers $NUM_WORKERS \
diff --git a/src/ucp/core/ucp_types.h b/src/ucp/core/ucp_types.h
index 458317530..e2047d339 100644
--- a/src/ucp/core/ucp_types.h
+++ b/src/ucp/core/ucp_types.h
@@ -38,7 +38,7 @@ typedef uint8_t ucp_lane_map_t;
/* Worker configuration index for endpoint and rkey */
typedef uint8_t ucp_worker_cfg_index_t;
-#define UCP_WORKER_MAX_EP_CONFIG 16
+#define UCP_WORKER_MAX_EP_CONFIG 64
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.