Note: This post can also be read from here
Spell correction using seq2seq models is nothing new. Tal Weiss tried to tackle spell correction by training a seq2seq model on data generated from Google news corpus. His original implementation though it doesn’t work, inspired many to tackle spell correction with seq2seq. There were some unsuccessful attempts by others (Matthew Relich, Pavel Surmenok) at this.
Unfazed by this I decided to try and build a working spell corrector with deep learning. Initially I identified the major problems with the above implementations.
- Use of attention: Attention is very important for any sequence to sequence task. Without attention, the network needs to look at the entire input while making predictions. So, I built a seq2seq model with LSTM encoder, decoders and “Luong” attention. I used the same model for punctuation correction before and achieved excellent results in my previous post.
- Training Data: Instead of relying just on generation of data, I felt my goal would be better achieved by adding real world spelling mistakes too. So, I collected some data that reflects real world spelling mistakes and for generating training data I built a clean text corpus of roughly 80 million sentences.
For the generation of data, I used a per word edit distance limit of 0.3 and also introduced real world spelling mistakes.
After 37 hours of training on a GTX 1080Ti the model achieved 0.99907 validation accuracy and the sample predictions looked excellent.
On the held out test data (30000 sentences), this model achieved an 0.9892 absolute accuracy, which is excellent. Just to real world test this mode, I hosted it on a test server and asked some of my friends to give it a try. I was pretty sure it was gonna work amazing.
To my dismay, the response from the “testers” was pretty underwhelming. Most of them reported the same thing. The model was excellent at contextual spell correction but, it sucked at recognising names of people and some other trivial things.
- “i wll b there for u” → “i will be there for you”
- “these is not that gre at” → “this is not that great”
- “i dotlike this” → “i dont like this”
For instance, it changed
- “i am jayadeep” →“i am a jay deep”
- “i work at reckonsys” → “i work in reckoning”
- “bedapudi works at abzooba he is a great person” → “beds works about he is a great person.”
Even though the network’s precision and recall were excellent on the held-out data, on unseen data false positive prediction was higher than ideal. This indicated that for unseen words, the model’s predictions were bad. One of the reasons being, the model isn’t able to understand that proper nouns shouldn’t be spell corrected.
This is because, while generating the training data we introduced errors in proper nouns too, resulting in the model memorising the “corrected” proper nouns and trying to correct unseen words into some of the memorised words. This could most probably be corrected by either applying a max edit distance rule for each word while decoding or by not introducing errors for proper nouns while training. Using logic based decoding defeats the purpose of using DL and teaching the model not to correct proper nouns would result in the model being not able to correct simple mistakes like “jahn” → “john”.
This made me re-evaluate the approach of data generation for spelling correction and realise that for generic spelling correction, going with classical i.e: edit distance based models might be a better idea.
I evaluated SymSpell and JamSpell and the results were excellent. Though they cannot do (good) contextual correction, for simple spell correction these libraries were simply phenomenal. Especially, SymSpell by Wolf Garbe is blazing fast and can perform word segmentation too. But, these algorithms can’t effectively correct homonyms and other simple contextual errors (I ate a apple → I ate an apple).
Alex Paino’s Deep Text Corrector serves as an excellent proof of concept for context based text correction. But, it’s functionality is very limited because
- It is a word level model (Thus limiting it’s ability to expand to unseen words and making training on large datasets very very tough)
- It uses a logic based decoding step where the word predicted is ignored if it’s not present in a set of pre-defined corrective tokens.
- The data generation used was limited to article correction and very few homophone correction.
Planning to expand on this, I created a correction dataset by introducing homonym, homophone and common grammatical errors. Since I already exhausted my resources for training the spell correction model, I was only able to train the model on data generated from 1.4 million sentences (Tatoeba + Cornell movie lines) for 5 hours.
Since I trained the model on Tatoeba and Cornell, I tested the model with data generated from Google news 2013 corpus. This model achieved 0.8921 absolute accuracy (no of perfectly corrected lines/ total no of lines). Since the data generation technique is the same for training and testing data I followed the same method of testing I used for spell correction. Of the 134 sentences tried by real world users, the model was able to perfectly correct the grammar in 62 sentences and punctuation in 131 sentences. Of the 72 sentences where the model failed to correct grammar, 48 sentences were “out of scope”. i.e: They were verb form, plural singular errors.
I plan to train the same model on the 80 million sentence dataset that I curated along with further additions to the scope of correction including word form correction.
The code and pre-trained models are available at https://github.com/bedapudi6788/deepcorrect. (The data and the pre-trained model will be made available by 25-Dec. I am currently trying to re-train the model with some more data). The demo will be available at http://bpraneeth.com/projects after I train the model on 80 million sentences.