These are two solutions for a topic extraction task. The sample data is loaded into a variable by the script. I’ve included running times for both solutions, so we could have precise information about the cost that each one takes, in addition to their results. According to (Pazienza et al. 2005)
, two trends on textual information can be identified: one based on linguistic and syntactical information, another based on statistical analysis of frequency patterns (which usually consider text as a bags-of-words). Whilst the first approach is a purely syntactic one, the second one aims to imcorporate information about syntatic categories into the analysis (hence a hybrid approach)
1 – Set-up used:
*Ubuntu 11.04 Natty AMD64
*Python 2.7.3
*python re library
*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)
*python tweeterstream 1.1.1 library (For Tweeter Manipulation)
*python simplejason library for jason manipulation
*python sklearn 0.11 library(For machine learning tasks)
*python time module for benchmarking different solutions
Installation Instructions:
- Python and python installation packages: from command prompt run:
- NLTK, NumPy, PyYaml library: from command prompt run
sudo pip install -U numpy
sudo pip install -U pyyaml nltk
Test installation: run python then type import nltk*Sklearn 0.11
sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ python-matplotlib python-pip (for dependencies)
: from command promp run:sudo pip install -U scikit-learn
Theoretical assumptions:
The data manipulated is a jason file containing articles with a variety of fields. Since the solutions implemented here are based on machine learning techniques (involving the frequency distribution of tokens in a bag of words), I only used the title and content of the article. The title was duplicated, so the frequency of words containing in it would be increased (this is a simple way to capture the intuition that words in the title are a bit more important).
Running instructions:
simply running the scripts nnmf_no_datatreatment.py or nnmf_noun_extraction.py would run both scripts, displaying results and times. The data.jason file should be placed in the same directory.
The problem was broken down into three subtasks: (i) loading and per-processing the data, (ii) running a tf-idf algorithm on the data, (iii) running a non-negative frequency factorization of the tf-idf matrix.
Step (i)
Data Loading:
For both solution, data loading is identical. The json file is loaded, and for each article, we generate a string containing the title duplicated and the content of the file. Each string is them appended to an empty list. The resulting list contains each documents as rows (vectors).Preprocessing:
For the solution in nnmf_no_datatreatment.py, no data treatment was performed. The aim was to have a comparison between the pros and cons of pre-processing options.
For the solution in nnmf_ noun_extraction.py, we created a filter that tokenizes, performs POS tagging on the text and filters it by removing all but noun elements. Each document corresponding to each element in the list can be regarded as a bag-of-nouns. Many authors have proposed that nouns are the linguistic units that best represent the topics of a text (for a recent example, Mihancea 2004, but many others). Personal note: I thought that verbs in the gerund (e.g. ‘bombing’), which are subject to nomilaziation rules, e.g. ‘the bombing in King’s Cross’, have a similar nouny nature, but unfortunately results are not so interesting.
Step (ii) TF*IDF
Roughly, TF-IDF is the ratio between two components: (i) the term frequency of a term t in a document d – tf(t,d) – and (ii) inverse document frequency- idf(t,D) – which captures the inverse frequency of a term in a corpus. So if ‘the’ appears many times, but it also appears many times in a corpus, its TF-IDF value will be low. A term that is frequent in a document, but not very frequent in a corpus will have a higher score, which means that it is statistically significant (for discussion, see Salton and Buckley 1988)
.Implementation of this step involved just calls to the sklearn.feature_extraction.text.CountVectorizer module on our data. A good documentation can be found on the library’s website: http://scikit-learn.org/stable/modules/feature_extraction.html .
Note: max_df was set to 0.95, discarding the 5% most frequent terms (the resulting data is a bit noisy, e.g. lots of pronouns) and binary=False (whilst setting it to True is useful for small chunks of text, it didn’t work well for our data).
Step (iii) Non-Negative Matrix factorization
Non-negative Matrix Factorization (NNMF)
can be user as a technique for reducting the complexity of the analysis of a term-document matrix D (as in tf*idf), hence some problems in information retrieval (see Chang et al. 2002). Then the original D matrix is approximated by the product of two low rank matrices: the Document-Topic matrix W and the Topic-Word matrix H. The rank of the matrices is defined by the number k of topics. wd(i,k) in W represents the importance of the topic k to the document i and wt(k,j) represents the importance of the term j to the topic k. My implementation was a re-work of the examples given in the sklearn’s library website (ww.sci-kit.org)Discussion:
As expected, solution 2, which extracts noun phrases first, takes much longer (more than 1 minute) than the solution which does not use noun extraction (less than 3 secs), but it has much cleaner data.
As a final note, I like to point out that the solutions presented here are not exhaustive. Graph based hybrid solutions like the text-rank algorithm give very good results. This hasn’t been presented because (i) my unfamiliarity with graph based information retrieval, and (ii) the availability of a very good implementation which would only require me to load the data (you can find it here https://gist.github.com/1646117 , too good to be true really).
sudo apt-get install python python-pip python-setuptools
Hi,
I believe in text.TfidfVectorizer() norm=None also needs to be passed otherwise some topics may end up having the same set of words. Normalization need to occur after nmf.fit according to this article http://web.stanford.edu/class/ee378b/papers/xu-lin-gong-nmf.pdf.
from sklearn import preprocessing
W = preprocessing.normalize(nmf.fit_transform(tfidf_matrix), norm='l2', copy=False)
H = preprocessing.normalize(nmf.components_, norm='l2', copy=False)