Skip to content

Instantly share code, notes, and snippets.

@kikairoya
Created January 31, 2022 12:38
Show Gist options
  • Save kikairoya/9e3984944c017fdb8980b5a56b2f656e to your computer and use it in GitHub Desktop.
Save kikairoya/9e3984944c017fdb8980b5a56b2f656e to your computer and use it in GitHub Desktop.
BLOCKSIZE=100
rm x*.gz
bzcat source.txt.bz2 | split -l $BLOCKSIZE -d --filter='gzip --fast > ${FILE}.gz'
for f in x*.gz; do zcat $f | shuf | gzip --fast > shuf-$f; done
echo shuf-*.gz | gawk -v seed="${RANDOM}" 'BEGIN{srand(seed);} {while(NF) { i=int(rand()*NF)+1; cmd=sprintf("zcat %s", $i); if ((cmd | getline l) > 0) { print l; } else { $i=""; $0=$0; } } }' | gzip > shuf.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment