Skip to content

Instantly share code, notes, and snippets.

@darthbhyrava
Last active July 2, 2019 09:58
Show Gist options
  • Save darthbhyrava/21c395f09d5f415f5f65de879974a1a1 to your computer and use it in GitHub Desktop.
Save darthbhyrava/21c395f09d5f415f5f65de879974a1a1 to your computer and use it in GitHub Desktop.
On Spellcheckers for Chatbots

Requirements

  • Real-time Spell Checker
  • Can be used in a chatbot
  • Preferably using ML/DL

ML/DL Based Solutions

  • (Supervised) Spelling transformation vectors, which capture the patterns in differences of GloVe embeddings of correct and incorrect spellings. link If we use sub-word leel embeddings using fasttext, say, things get much better as shown in this article by Haptik.ai, a large scale Indian chatbot maker.
  • (Semi-supervised) A pipeline which preprocesses correct spellings to incorrect counterparts and trains a seq2seq model on the pairs . link. There's also another approach with code
  • (Unsupervised) My friend and college senior has an ACL workshop paper with my lab on automatic spelling correction for resource-scarce languages using seq2seq. link. I could ask her for advice.
  • (Unsupervised) An EMNLP paper which uses a n-gram language model, requires no annotation, and claims to have a 3.8% error rate in English. link
  • (Semi-supervised) Tal Weiss' article on a deep learning ensemble which leverages search engine queries link code. Another implementation code
  • Under Armour's context sensitive deep learning approach (only discussed in theory) link
  • A medium article on using bi-directional LSTMs for spelling correction link
  • A phonemic approach for a resource-scare language as proposed in a answer here
  • A paper which employs clustering algorithms for spell checking. link
  • A paper which discusses an ensemble which leverages Restricted Boltzmann machines in a deep autoencoder for spelling correction. link
  • An auto-encoder+CNN approach for spell correction link
  • Another attention based seq2seq model for grammatical correction link
  • These guys just have an online API which works well. I'd be very pleased if we built something like that.

Rule Based Solutions

  • As far as I know, Hunspell is the current industry standard.
  • Chatbot maker DeepPavlov use a Language model, statistical error model and a simple edit distance checker for their spell checker as shown here
  • The really awesome SymSpell which most people use when they're not using Hunspell.
  • The less impressive Aspell which gives us some data
  • Peter Norvig's build-from-the-basics approach link. It has many different implementations and has been the industry standard in its day. link
  • An old tutorial by LingPipe here
  • A CICLING paper on weighted finite-state language model for spell checking link
  • A COLING paper on a contextual LM-based spell checker and diacritic completer. link
  • A simple rule based approach for a CLIN28 task. code
  • An aptly named paper for semi-character level RNNs for Robsut Wrod Reocginiton. link
  • Optimized solutions for Norvig's approach link1 link2

As a follow up, I found this forum where Sebastian Ruder himself chips in with some inputs. Lots of leads here that we could discuss. There is a Bing blog which gives some pointers as well. If we have the time, I could also go through Google's videos on search to see if I can learn anything. And quite predictably, there's SO forums talking about how Google do it. link. Based on what I've read, Google use seq2seq methods at their core ML algorithms as well, as evidenced by this post. There are also some Reddit posts which talk about vector transformations using Facebook's fasttext vectors.

Some thoughts

To start off with, I think an architecture built around a seq2seq model is the way to go. The fasttext word embeddings, Peter Norvig's basic rules, some edit distance metrics etc are all other thigns to pad around the core, IMO. We also need to make changes based on what restraints we have, while leveraging the vast amounts of data we possess. I have a few more questions, as well:

  • How fast do we want the corrections to be?
  • Apart from time, what other restrictions does my in-chat spell checker work on?
  • How can we leverage the search queries that we already have? (for example, identifying the most common type of errors and then building our model primarily to address those errors)
  • How can we leverage the catalogue data that we already have?
  • How can we store previous spelling corrections and is it beneficial?
  • How will the UI and UX work with this spell checker for Moi?
  • Are we looking for just spelling corrections, or do we need to have a grammar check as well?
  • What is the existing spell-complete algorithm? Are we trying to improve that as well?
  • What should be evaluation metrics?
  • How deep do we go into the whole process? Should I be exploring state-of-the-art architectures from top NLP conferences?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment