-
-
Save sloria/6407257 to your computer and use it in GitHub Desktop.
import math | |
from text.blob import TextBlob as tb | |
def tf(word, blob): | |
return blob.words.count(word) / len(blob.words) | |
def n_containing(word, bloblist): | |
return sum(1 for blob in bloblist if word in blob) | |
def idf(word, bloblist): | |
return math.log(len(bloblist) / (1 + n_containing(word, bloblist))) | |
def tfidf(word, blob, bloblist): | |
return tf(word, blob) * idf(word, bloblist) | |
document1 = tb("""Python is a 2000 made-for-TV horror movie directed by Richard | |
Clabaugh. The film features several cult favorite actors, including William | |
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, | |
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the | |
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean | |
Whalen. The film concerns a genetically engineered snake, a python, that | |
escapes and unleashes itself on a small town. It includes the classic final | |
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles, | |
California and Malibu, California. Python was followed by two sequels: Python | |
II (2002) and Boa vs. Python (2004), both also made-for-TV films.""") | |
document2 = tb("""Python, from the Greek word (πύθων/πύθωνας), is a genus of | |
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are | |
recognised.[2] A member of this genus, P. reticulatus, is among the longest | |
snakes known.""") | |
document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly | |
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. | |
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced | |
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued | |
Colt Python targeted the premium revolver market segment. Some firearm | |
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy | |
Thompson, Renee Smeets and Martin Dougherty have described the Python as the | |
finest production revolver ever made.""") | |
bloblist = [document1, document2, document3] | |
for i, blob in enumerate(bloblist): | |
print("Top words in document {}".format(i + 1)) | |
scores = {word: tfidf(word, blob, bloblist) for word in blob.words} | |
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) | |
for word, score in sorted_words[:3]: | |
print("Word: {}, TF-IDF: {}".format(word, round(score, 5))) |
Thanks for the code. Little error
from text.blob import TextBlob as tb
should be
from textblob import TextBlob as tb
I'm a NLP noob, how could I use this with TextBlob classifiers (Bayes/Maxent) ?
I don't know if this is a python 2 thing, but your division in the tf
routine is operating on integers...
@eggie5, you can add this line to the top to coerce float division:
from future import division
Run this script in python 2.7 got math domain error
find out that the root cause is len(bloblist) / (1 + n_containing(word, bloblist))
will likely to be 0 and log function will cause exception
same as function
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
as fix solution, float it before calculation, such as :
def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))
Not sure Py3 result...
def idf(word, bloblist): return math.log(len(bloblist) / (float)(1 + n_containing(word, bloblist)))
idf function was throwing math domain error as well. hence I modified it. It worked. Ofcourse I also incorporated the suggestion by RangetWolf.
Thanks, very useful the comments!
Hi, I am having output error...followed given steps:
My OUTPUT:
Top words in document 1
Word: Van, TF-IDF: 0.0
Word: both, TF-IDF: 0.0
Word: including, TF-IDF: 0.0
Top words in document 2
Word: and, TF-IDF: -0.0
Word: among, TF-IDF: 0.0
Word: snakes, TF-IDF: 0.0
Top words in document 3
Word: premium, TF-IDF: 0.0
Word: and, TF-IDF: -0.0
Word: Ian, TF-IDF: 0.0
Please Help
Thank you
convert all to float as below
def tf(word, blob):
return (float)(blob.words.count(word)) / (float)(len(blob.words))
def n_containing(word, bloblist):
return (float)(sum(1 for blob in bloblist if word in blob))
def idf(word, bloblist):
return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return (float)((float)(tf(word, blob)) * (float)(idf(word, bloblist)))
Really helpful stepping into the NLP world. Thanks!
Hello. Is there any way to sum all the same words in multiple documents?
I use this function to sum the same word in single document
def tf(word, blob): return blob.words.count(word)
Thank you
I have few documents stored in a folder, instead of writing documents data into .py file, I want access the document through code. Please help !!
Thanks in advance.
Hi @nikhilcheke, I have a similar situation to yours. I am using this solution:
import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
with open (file1, 'r') as f:
data = f.read() # Reads document content into a string
document = tb(data.decode("utf-8")) # Makes TextBlob object
bloblist.append(document)
It's working for me
After above suggested corrections, I get no error, nor it is printing any output in jupyter notebook
Is it possible to incorporate lemmatizing into this process? using TextBlob, for instance?
i am using python27 and i got this error
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
AttributeError: 'unicode' object has no attribute 'words'
Hi prabhatntpc,
I think this is too late but others can benefit from it.
import glob
import os
files = glob.glob(os.path.join(os.getcwd(), ':/folder', '*.txt' ))
iterate over the list getting each file
for file in files:
open the file and then call .read() to get the text
with open(file) as f:
text = f.read()
I have this error -
File "C:\Users\megha\Local\Programs\Python\Python37-32\lib\site-packages\textblob\decorators.py", line 38, in decorated
raise MissingCorpusError()
textblob.exceptions.MissingCorpusError:
Looks like you are missing some required data for this feature.
How would it be to read data from a txt file?
How would it be to read data from a txt file?
with open ("abc.txt", "r") as myfile:
data=myfile.read().replace('\n', '')
Here data will store the contents of your text file.
You can then use the variable "data" as required.
return log(len(bloblist) / (1 + n_containing(word, bloblist)))
ValueError: math domain error
Along with TF-IDF score, I also want TF score. How can I do it?
def score_tf(query, tokenized_document):
print('query:', query)
result = 0.0
for q in query:
count = term_frequency(q, tokenized_document)
tf = 1 + math.log(count)
print ("count:",count, "\tterm:",q,"\ttf:",tf)
result = result + tf
return result
def inverse_document_frequencies(term, documents):
df = 0
for d in documents:
tokenized_d = text2list(d)
if term in tokenized_d:
df = df + 1
return math.log(len(documents)/df)
how to make calculate ti.idf ?
def score_tfidf(query)
...........
please help me !
Can someone help me understand line number 8?
return sum(1 for blob in bloblist if word in blob)
I don't understand how the sum(1 for ...) statement works. What is the purpose of 1
in there?
dear Sir,
If i have to find out Tf-Idf for mutiple files stored in a folder , than how this program will change.