-
-
Save fernandrone/7809a5e919b142508b6a45838cde139b to your computer and use it in GitHub Desktop.
Top N most-used words in a text
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env sh | |
# | |
# Simple script that prints out the top N most-used words in a text from standard input. | |
# | |
# Inspired by https://buttondown.email/hillelwayne/archive/donald-knuth-was-framed/. The short | |
# linux script is first shown at https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf | |
topw() { | |
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed "$N"q | |
} | |
help() { | |
cat << EOF | |
Usage: $(basename $0) [OPTION]... | |
Print out the top N most-used words in a text from standard input. A word is | |
a non-zero-length sequence of letters ('A-Za-z' regex) delimited by white | |
space. | |
-n value of N (number of top words to list) [default 5] | |
-h display this help and exit | |
EOF | |
} | |
N=5 | |
while getopts "hn:" option; do | |
case $option in | |
h) | |
help | |
exit 0;; | |
n) | |
N="$2" | |
shift 2;; | |
\?) | |
echo "$(basename $0): invalid option -- '${1}'" | |
help | |
exit 1;; | |
esac | |
done | |
topw |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To install, copy the latest gist to a location in your PATH, e.g.:
Then, just pipe the output of any text:
The complete works of Shakespeare (the "dataset" also includes some legal notes by Project Gutenberg though).