DeepCorrection 1: Sentence Segmentation of unpunctuated text.

Note: This post can also be read from here

Sentence segmentation or Sentence boundary Detection is one of the foremost problems of NLP that is considered to be solved. While working on a text correction module that includes punctuation correction, spell correction and common grammar error correction, I realised that to do any of these my model(s) should be able to correctly segment the input text.

There are various libraries including some of the most popular ones like NLTK, Spacy, Stanford CoreNLP that that provide excellent, easy to use functions for sentence segmentation. Before you start thinking that this is just another survey post that adds nothing new to the topic, let’s take a look at how these libraries segment the text “I am Batman. I live in Gotham.”

NLTK works perfectly on this text.
So does Spacy.
CoreNLP too, works perfectly.

Now, let’s see how these algorithms deal with the text “I am Batman I live in Gotham” which is but a small modification of our original sentence.

NLTK and Spacy fails (to no one’s surprise)
and so does CoreNLP

This is of course, well known and expected behaviour since all the famous modules use either statistical models or heavily language dependent patterns to perform sentence boundary detection. These libraries work exactly as they are supposed to i.e: work near perfectly on perfectly formatted text and fail miserably on text with bad punctuations, wrong capitalisations etc.

So, I decided to give this a try. My prerequisites were that

Obtaining the Data:

My idea was to obtain a set of gold standard english sentences and combine them randomly to generate my training data. But Obtaining a large multi-domain corpus of gold standard sentences proved tough and I decided to go with wikipedia dump segmented using Spacy/ CoreNLP to get the sentences. But again, this didn’t feel right, since 1. Wikipedia have a very few examples of general conversional data and first person data. 2. Spacy/ CoreNLP can’t do 100% perfect segmentation. So, my data won’t be no where near gold standard.

This is when I remembered about, an open data initiative aimed at translation and speech recognition. Their english corpus has closer to one million sentences covering a broad variety of writing styles, which can be used to generate huge amount of training data. The data for all the languages can be downloaded from

I used a simple logic that combines a random number of sentences from the corpus and removes or changes the punctation and casing of the text to generate the text. The data generation scripts and pre-trained models will be available at The data generated can be found at

Deciding on the Model:

After some thinking about the architecture(s) that would be best suited for this task, I decided that I will be able to use sequence to sequence models or sequence tagging models for the task at hand.

If I decide to go with sequence to sequence models, the data needs to be very huge, as in tens of millions of sentences, since the model needs to learn the context of each character from scratch and the training time would be very high. Leveraging pre-trained word embeddings such as Glove/ELMo won’t be possible or will be very tough to train. There is also another approach to this problem, where we restore punctuation through seq2seq models and use general segmentation techniques. This approach isn’t scalable, as training seq2seq would require lot of data and it will become exponentially tougher to deal with large texts. In fact, the correct way of doing punctuation restoration would be the exact reverse. i.e: Perform sentence segmentation on the unpunctuated text and use seq2seq for punctuation correction at a sentence level instead of on the whole text.

So, I decided to start with a BiLSTM+CRF sequence tagging model ( along with pre-trained word embeddings and observe the results before planning on the next step. Although we can implement an arguably better performing model than this particular implementation (eg: Elmo + BiLSTM CRF or BERT + BiLSTM CRF), since the idea behind this is to get a baseline model as quickly as possible, I decided to start with the Glove + BiLSTM CRF implementation.

I started the training with 1 Million examples as the training data and 100,000 examples as validation data. I followed standard BIO format for labelling the data. The beginning of a sentence is labelled as “B-sent” and all other word are assigned the label “O”. So, for our example text “I am Batman I live in Gotham” the training data looks like

On the 1080Ti the training took ~24 minutes per epoch and reached convergence in 14 epochs at a validation F1 score of 0.9639.

Measuring the accuracy:

Since the number 0.9721 F1 score doesn’t tell us much about the actual sentence segmentation accuracy in comparison to the existing algorithms, I devised the testing methodology as follows.

Test data:

Note that these texts are created from random from 20000 sentences which were separated randomly.

The Absolute Accuracy in each case is scored as the total number of correctly segmented texts / total number of examples. i.e: A text is considered correctly segmented only if the module is able to split in to exactly the sentences into which it’s supposed to be split. The tester scripts are available in the git repo.

Comparison of absolute accuracy

DeepSegment achieved an average absolute accuracy of 73.35 outperforming both Spacy and NLTK by a wide margin.

The Performance:

DeepSegment took 139.57 seconds to run on the entire dataset compared to NLTK’s 0.53 seconds and Spacy’s 54.63 seconds on a i5 dual core Macbook air. When ran on a modest 4 GB GTX 960M with batch inference (batch size 64), DeepSegment took 2.6 seconds to run on the same test data. Note that Spacy and NLTK can’t benefit from the use of a GPU.


Few people messaged me enquiring about the performance of DeepSegment for unpunctuated text. The reported 52.637% is the absolute accuracy score (see above for definition). I don’t like reporting precision/ recall wherever possible because, these scores make the accuracy look much higher than what it is. I honestly believe that, while building real world systems people should keep in mind that F1 score is always going to look much higher than what the actual accuracy is. (Not true for all cases, but in general).

For example take a look at the precision, recall and F1 scores of the above model.

Precision, Recall and F1-scores for labels B-sent and O

For the completely unpunctuated test case, the absolute accuracy is 52.637 (as reported) and the F1 score for the label B-sent is 91.33 (precision: 93.242, recall: 89.506). Similarly for other test cases the F1 score is much higher than absolute accuracy (can be seen in the image above).

Next steps:


Check out for multi-lingual segmentation with single model.



Senior NLP Engineer - DeepAffects

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store