Skip to content

Instantly share code, notes, and snippets.

@MichaelDimmitt
Last active September 10, 2024 13:07
Show Gist options
  • Save MichaelDimmitt/294cea1497e0480691d7339f86800dd8 to your computer and use it in GitHub Desktop.
Save MichaelDimmitt/294cea1497e0480691d7339f86800dd8 to your computer and use it in GitHub Desktop.
quick command line word frequency

note to self, is there a command line tool on homebrew to use instead?

copied from: https://unix.stackexchange.com/a/174421/188491 which was copied from https://tomayko.com/blog/2011/awkward-ruby 😅

find .  -type f | grep -v git | grep -v mov | xargs cat |  tr -c '[:alnum:]' '[\n*]' | sort | uniq -ci | sort -n
function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}
cat file1 file2 file 3 | wordfrequency | grep -vE "to|a|in|git|for|help|ui|the" | head -10

use all files and subfiles

find .  -type f | grep -v git | xargs cat | wordfrequency | grep -vE "to|a|in|git|for|help|ui|the" | head -10

How to find bad files: (i.e. files with spaces.) find . -type f | grep -vE "git|mov" | grep " ";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment