Skip to content

Instantly share code, notes, and snippets.

@mkweskin
Created October 24, 2022 17:57
Show Gist options
  • Save mkweskin/87e78e8337013bae5d1a568278bc54bc to your computer and use it in GitHub Desktop.
Save mkweskin/87e78e8337013bae5d1a568278bc54bc to your computer and use it in GitHub Desktop.
Randomize columns in a CSV

Method to randomize column order in a csv using the BASH

I received a question about how to randomize column order in a text file. I came up with the method that using common Unix command line tools (sort, sed, tr, join) and the BASH shell. It has been tested on the BSD command line tools in macOS 12 (running zsh) and gnu command line tools BASH on CentOS (running bash).

NOTE: this does not handle quoted commas in the CSV. The only commas should be the delimiters.

Randomize all columns

NUM_COLS=10
NUM_RANDOMIZED_OUTPUT=5
INPUT_FILE=input.csv
OUTPUT_PREFIX=output

seq 1 ${NUM_COLS} > cols
for x in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
  sort --random-sort cols | sed -r 's/^/1./' | tr "\n" , | sed -r 's/,$//'
  echo
done >randomized_col_order

for y in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
  current_order=`sed -n ${y}p randomized_col_order`
  join -t, -o "${current_order}" ${INPUT_FILE} ${INPUT_FILE} > ${OUTPUT_PREFIX}_${y}
done

rm cols randomized_col_order

Example output

input.csv:

1,2,3,4,5,6,7,8,9,10

Output files:

ls output_*
output_1
output_2
output_3
output_4
output_5

cat output_*
3,8,1,9,2,4,6,5,10,7
3,1,7,4,10,9,6,2,8,5
6,1,10,8,7,3,5,2,4,9
5,3,2,4,7,6,1,9,8,10
6,4,3,2,7,9,8,1,5,10

Randomize columns, but keep first column in-place

This version is modified such that the first column position is not randomized. This would be for cases where the first column is an identifier column. The changes to the code are seq 1 ${NUM_COLS} > cols becomes seq 2 ${NUM_COLS} > cols and the addition of echo -n "1.1," to the beginning of each line of randomized_col_order.

NUM_COLS=10
NUM_RANDOMIZED_OUTPUT=5
INPUT_FILE=input.csv
OUTPUT_PREFIX=output

seq 2 ${NUM_COLS} > cols
for x in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
  echo -n "1.1,"
  sort --random-sort cols | sed -r 's/^/1./' | tr "\n" , | sed -r 's/,$//'
  echo
done >randomized_col_order

for y in `seq 1 ${NUM_RANDOMIZED_OUTPUT}`; do
  current_order=`sed -n ${y}p randomized_col_order`
  join -t, -o "${current_order}" ${INPUT_FILE} ${INPUT_FILE} > ${OUTPUT_PREFIX}_${y}
done

rm cols randomized_col_order

Example

cat output_*
1,3,4,6,2,5,10,9,7,8
1,5,7,10,8,4,6,2,9,3
1,2,4,7,3,5,10,9,8,6
1,6,2,3,7,4,5,8,9,10
1,6,3,7,9,4,5,2,8,10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment