Bigrams are 2-letter combos. When designing a keyboard layout, it's common to optimize for comfort and speed by analyzing bigrams. Here's a simple shell script to do a quick-and-dirty bigram analysis.
Download Shai's corpus for Colemak:
cd ~/Downloads
curl -fsSLo corpus.txt.xz https://colemak.com/pub/corpus/iweb-corpus-samples-cleaned.txt.xz
Extract the .txt file:
unxz corpus.txt.xz
Separate that into individual words:
LC_ALL=C sed 's/[^[:print:][:space:]]//g' corpus.txt | tr '[:space:]' '\n' > corpus_word_list.txt
Run this awk script to analyze the word list and generate bigrams:
awk '{
$0 = tolower($0)
for (i = 1; i < length($0); i++) {
pair = substr($0, i, 2)
pairs[pair]++
}
}
END {
for (pair in pairs) {
printf "%s: %d\n", pair, pairs[pair]
}
}' corpus_word_list.txt | sort -k2,2nr > results.txt