Skip to content

Instantly share code, notes, and snippets.

@sparticlesteve
Created May 4, 2020 18:39
Show Gist options
  • Save sparticlesteve/8a3e81a31e89fd1cccc81a3fae3fcf2d to your computer and use it in GitHub Desktop.
Save sparticlesteve/8a3e81a31e89fd1cccc81a3fae3fcf2d to your computer and use it in GitHub Desktop.
Script for splitting the climate benchmark dataset (until we have a better procedure)
#!/bin/bash
# Config
inDir=/global/cscratch1/sd/amahesh/gb_data/All-Hist
nTrain=65536
nValid=8192
nTest=8192
nTotal=$((nTrain + nValid + nTest))
# Shuffle and select all files we need
find $inDir -type f -name "data-*.h5" | sort -R | head -n $nTotal > all_files.txt
# Split into train/val/test
head -n $nTrain all_files.txt > train_files.txt
head -n $((nTrain+nValid)) all_files.txt | tail -n $nValid > valid_files.txt
tail -n $nTest all_files.txt > test_files.txt
# Copy the files into this directory
mkdir train validation test
rsync -vau $(cat train_files.txt) train/
rsync -vau $(cat valid_files.txt) validation/
rsync -vau $(cat test_files.txt) test/
# Install stats file
cp $inDir/stats.h5 train/
cp $inDir/stats.h5 validation/
cp $inDir/stats.h5 test/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment