Automating news discovery in real-time

How media works
- There's a difference in positioning: in-depth vs breaking news
- Crunch in talent, margin pressures. Not enough staff to 'break news'
- Sources of breaking news: agencies, in-house, competition, social media
- Increasingly, social media is a dominant source
How can we source social media data at scale
- Twitter vs Facebook vs Google Trends vs ...: accessibility vs reach
- Streaming in real-time (importance of sub-second responses for TV)
- Parallel extraction: Sockets & threads -- importance of async (and why node.js is better than Python 2)
- Storage: JSON and coming of age of RDBMSs (and why Postgres is as good as MongoDB)
- Distributed scraping -- building a headless browser farm
- Client-side scraper farms as alternatives -- building Chrome plugins
Filtering sources for insights
- Why traditional entity extraction fails
- Fuzzy matching in the Indian context: key-collision vs distance-based methods
- How visuals help flexibly identify topic clusters -- k-means and beyond
- Determining the importance and relevance of topics
- Manual vs automated filtering -- negative-lists
Structure of the final solution -- what it looked like, and what it resulted in

sanand0/automating-news-discovery-in-real-time.md