identifying stopwords using ngrams and vowels

#identifying stopwords using ngrams and vowels

we have: street names in Richmond, VA
we want to: match street names to Confederate generals and Civil Rights leaders
first we must: remove small pesky elements of names that are irrelevant to the match

the solution: ngrams to build a stopword list!

prerequisites

# name of input plain text file - in this case, list of TIGER/Line road names in Richmond, VA
in=tiger.csv
# set max string length to be considered a stopword  
n=6
# identify unigrams from the input file
ngrams $in 1 |\
# get two column TSV of frequency, unigram
sortfreq |\
# strip out header
sed '1d' |\
# take any string <= length *n*, OR that has no vowels
awk "{if(length(\$2)<= $n || \$2 ~ /^[^aeiou]*$/ )print \$0}"|\
# get top 40 most frequent
head -n 40 |\
# now use ggplot to make a graph of our results
# the graph is 6in in height and 9in wide, with size 20 title font!
plotbars stopword count "Potential Stopwords Identified Using ngrams\nTIGER/Line Road Names, Richmond, VA" 20 6 9

albert-decatur/README.md

the solution: ngrams to build a stopword list!

prerequisites

results!