This doc describes some of the issues we found when putting a stem based extractor on spotlight development

Issues with stem-based Spotlight

Bare stemming might be a bit too coarse grained.
1. Possessive pronouns and contractions reduce to the same form. For example, "[Shake its] head" and "Shake it" the song reduce "shak it" and we have topic interferences from one form to the other.
2. Latin pluralities (borrowings) reduce like native pluralities, e.g. "Illuminati" reduces to "illumit" and we get the topic Ilumminated.
3. Acronyms might reduce to blacklisted words, e.g. "IT" => "it".
4. Verbal endings, e.g. "whisking up" => "whisk up" and we disambiguate it as whiskey.
5. "According" => "accord" which is then disambiguated as Honda Accord.
6. "The timing" => Time
7. "The energy => Energis (telecom company)
8. Specific acronyms that when normalised become are matched against general words ["the news" => NEWS (japanese band)]
9. Indistinction between different meanings of adjectives ("Portuguese" is very likely to be disambiguated to Portugal, we won't be able to disntiguish between the language and the nationality).
10. "s share" => S Chip
11. "a nice" => Nice (the city)
12. "this is" => This is Spinal Tap
13. "nabbed" => "nab" => Neutralizing Antibiotic (NAB)
14. "a clarion" => CLARION (artificial intelligigence)
15. "go on", "keep on" => topics related to tv and music, SFs occur as verbs though.
16. "Who I am" => music album
17. "mislead" => misl (12 soveraing Sikh states)
Better smoothing for lower case surface forms. Here is the case, the sf "Neymar" occurs many times and normally has a good enough annotation probability, however, its lowercase counterpart "neymar" does not. If we force to get spots via the lowercase/stem store, we would not get a lot of topics whose uppercase counterparts occur frequently. We need a better way to transfer the probability mass from uppercase to lowercase/stem, if we are using such metrics for spottability. (General case for Place and People names)
Tokenization Issues. Here is an issue for numeric sfs.
1. "7-1", is tokenized as "7" and "-1", and the second is linked to the negative one numeral topic.
2. "3-0" is tokenized as "3-" and that is refers to the 3-manifold concept.
3. Not normailising hyphe as a device for Noun Noun composition, we get "child abuse" but not "child-abuse".
Problems when calculating the relevance of bigrams.
1. "Dermot Nolan" => ["Dermot", "Nolan", "Demort Nolan"]. Only the first name is associated to an entity, so we get Dermot O'Leary. Same for "Costa Concordia" => "Costa" => Costa Crociere (right concept should be Costa Concordia). It seems that in the same way we prefer longer matches, we should prefer matching the rightmost noun over the leftmost one [this could be language specific].
Contextual Issues. At first, our contextual score seems to be really good (as in when used as one of the components for our relevance score), but when we let a lot of surface forms to be selected through the spotting phase, we get a lot of bad candidates. More surprisingly still, is that these wrong candidates score quite high for bad disambiguations (so we can't filter on relevance of the output). I think we need a better representation of the context.
Bad topic candidates. Lots of topics like "List_of_Intel_processor". I think we simply should nuke every topic starting with "List_of"
Problems introduced by our SF-matching. Examples:
1. "a fair deal" => Fair Deal [contextual problem ? not sure, since context is bussiness]
2. "the spike" => "Spike"(fictional character) [might be contextual problem]
3. "reached a record" => Reach Records [problem related to our tokenization/stemming ["reach", "<fake_token>", "record"]]
4. "eight weeks" => Eights Week [stemming on the numeral]
5. Problems related to our duplication of titles, e.g. from "Energy firm complaints hit record high Energy firm complaints hit record high" we have extracted "high energy" => Particle Physics.
6. "the referral" => "refer" => HTTP referrer.
7. "the Burger" => Warren E. Burger
Still missing general words:
1. "lender" not attached to Creditor
2. "transformers"
3. "How to Train Your Dragon" (movie)
4. "Lender" not in the SF store.
Associations with bad words. Lot of topics associated with pronouns, wh-words, prepositions, discourse connectives. Examples:
1. "due" => Tax
2. "general election" => United Kingdom general election, 2010, "midterm elections" => United States elections, 2010 (more specific concepts)
3. Member group metonymies. For example, there's an association between "Mark Reilly" (musician) and Matt Bianco (band), we extract "Mark Reilly" which is referring to a company's president and get the band topic as a result.

Other Examples:

Inch in
Short Film a short
Professional wrestling authority figures made
Napoleon Bonaparte Napoleon
Interview interviews
Steven Spielberg Steven Spielberg
Stanley Kubrick Stanley Kubrick
James Bond is
Artificial intelligence A.I
Broadway theatre on
Ghost town abandoned
Archive archival
Schizophrenia of
Wisconsin here
Drawing drew
Filmmaking production
Documentary documentary
Hebrew Language he
The Lion King for
Breast implant them
Materials Science materials
Project projects
Malcolm McDowell Malcolm McDowell
Lost comet Lost
Collaboration collaborators
Knowledge know
Narrative Narrated
Discovery discover
Ground beetle Some
Guest appearance featuring
Lost film lost
The Divine Comedy the work
Vârcolac were
Decimation why
Nevers never
Righteous Among the Nations many
Teleology finally
Kevin Steen and El Generico and
Lost Lost
Abundance of the chemical elements abundant
Wartime Lies Aryan Papers
Los Angeles Herald-Examiner examines
Unfinished creative work unfinished films
Ultimate Marvel ultimately

tgalery/report_stem_based_topic_extraction.md

Issues with stem-based Spotlight