Rudrabha Mukhopadhyay Rudrabha

Multi-node-training on slurm with PyTorch

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

	#!/usr/bin/env python
	# -- coding: utf-8 --
	from argparse import ArgumentParser

	import torch
	import torch.distributed as dist
	from torch.nn.parallel import DistributedDataParallel as DDP
	from torch.utils.data import DataLoader, Dataset
	from torch.utils.data.distributed import DistributedSampler
	from transformers import BertForMaskedLM

	import ruamel.yaml

	yaml = ruamel.yaml.YAML()
	data = yaml.load(open('environment.yml'))

	requirements = []
	for dep in data['dependencies']:
	if isinstance(dep, str):
	package, package_version, python_version = dep.split('=')
	if python_version == '0':