The following gist is an extract of the article Detecting Similar News. It exploit data retrieve by a crawler and detect similar article across different domains
Start by running the crawler to retrieve the data. Crawler takes about 50 minutes to retrieve all the data the first time.
$ python run.py
retrieving url... [techcrunch.com] /