-
-
Save mac389/9569264 to your computer and use it in GitHub Desktop.
import nltk, os, json, csv | |
from scipy.stats import scoreatpercentile | |
#read the stopwrod file | |
#do stopwords have to be in the same file as the json files? | |
READ = 'rb' | |
stopwords = open('../data/stopwords',READ).readlines() | |
#lemmatizer | |
lmtzr = WordNetLemmatizer() | |
#get the names of the files in a list | |
json_list = os.listir('C:\Users\Carrie0731\Desktop\JSON FILES')#probably should modify this so everyone can open it on his/her computer | |
sanitize = lambda text: lmtzer.lemmatize(text) if text not in stopwords else '' | |
string = ' ' .join(map(sanitize,[tweet['text'] for tweet in json.load(open(filename,READ)) for filename in json_list])) | |
freq = FreqDist(string) | |
cutoff = scoreatpercentile(freq.values(),15) | |
vocab = [word for word,f in freq.items() if f > cutoff] #Items returns the tuple (key,value), in this case (word, frequency) |
As to the comment about how much of the FreqDist to use: On one hand any truncation of FreqDist decreases the accuracy of P(token=t). On the other hand our estimates of the frequency of occurrence of rare words are inaccurate. It's not clear that including them help us better estimate P(token=t).
My inclination is to use a cutoff established from the empirical cumulative distribution function (Reference: http://en.wikipedia.org/wiki/Empirical_distribution_function) In Python, you don't have to estimate the full function. If you use the standard 85% cutoff, use the SciPy function scoreatpercentile
to find the word frequency at 15th percentile. Include all words with frequencies greater than that.
A benefit of using the empirical cumulative distribution function is that we make no assumption about the distribution of word frequencies. Btw, the 85th percentile thing arises because if we assume that the frequencies are normally distributed then the 85th percentile of the CDF corresponds to a signal-to-noise ratio of 5-to-1, which itself is an old rule of thumb in electrical engineering.
In Python testing membership in sets is faster than testing membership in lists. (Reference: http://stackoverflow.com/questions/7110276/faster-membership-testing-in-python-than-set)
You may to have remove newlines from the stopwords file after reading it.