Skip to content

Instantly share code, notes, and snippets.

@telvis07
Last active July 1, 2016 14:18
Show Gist options
  • Save telvis07/b565e3b11b33eec79fbb6d6713ec2bb8 to your computer and use it in GitHub Desktop.
Save telvis07/b565e3b11b33eec79fbb6d6713ec2bb8 to your computer and use it in GitHub Desktop.
NLP - prune ngrams by finding the minimum number of ngrams that cover X percent of the word instances
prune_ngram_df_by_cover_percentage <- function(df, percentage) {
# assumes df contains columns (word, freq)
# assumes df is sorted by freq in descending order
# prune ngrams by finding the minimum number of ngrams that cover X percent of the word instances
sums <- cumsum(df$freq)
cover <- which(sums >= sum(df$freq) * percentage)[1]
print(sprintf("%s of %s (%s%%) cover %s%% of word instances",
cover,
nrow(df),
cover/nrow(df)*100,
percentage*100))
df[1:cover,]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment