DeepSegment 2.0: Multilingual text segmentation with vector alignment
Note: This post can also be read from here
My first post (https://github.com/bedapudi6788/deepsegment) tackles the problem of sentence segmentation with bad, no punctuation. Although the absolute accuracy reported in the post might seem lower, the model itself performs excellently in real world (as explained in the update to the original post).
While exploring vector alignment, I realised that for some combinations of languages aligned vectors can be used for building multilingual models. Before going into the results of the multilingual DeepSegment, I am going to briefly touch upon what vector alignment is and it’s advantages.
Vector alignment is a very simple but effective concept. When we train some word vector model (eg: FastText, Word2vec, glove) on a corpus the vector representations of the words that we get are representations of semantic similarity of words in that corpus. But, when we train FastText or Glove on two different corpora, the vector representation that we obtain won’t translate.
Take a look at the image below.
Since there is no universal ground truth, the word-vector models can’t understand that “apple” in the corpus one is the same as “apple” in corpus 2. To overcome this we need a list of ground truth values stating that “word_1” in corpus_1 is equal to “word_m” in corpus_m. Once we obtain these ground truth values, we rotate and transform the vector spaces to minimise the distance between our ground truth word pairs.
Note that during vector space alignment, we only rotates and transforms vector spaces. This results in the vectors of word being changes but not their distance. i.e: if d(apple -> orange) = 0.8 before we perform vector alignment it will remain the same after alignment, even though the vectors themselves change.
A method of vector alignment along with example scripts and ground truth vectors can be found here, along with the paper.
When dealing with multi-lingual models vector alignment is very important since, all layers will have the same weights for every input. Imagine a layer with weights w1, w2, w3, …, wn. If our inputs are not normalised, the weights (w1 ..) can’t be learned effectively and the model will not converge.
Now, I train DeepSegment on French, Italian, English with their aligned vectors as initial input to get a single model which can perform sentence segmentation for text from these languages. In my next post, I am going to show how to build zero short named entity recognition and zero shot text classification and demonstrate that for languages with less data, pre-training with aligned vectors helps.
Similar to DeepSegment v1, I generated one million examples for each of these languages with the label B-sent to mark the beginning of a sentence.
After 26 epochs, the model converged with validation F1 score (of B-sent label) 95.56.
The test results can be seen in the following image
Although, the score for the single language model are slightly higher, the difference is minuscule and the multi-lingual model performs excellently for all the three languages. In future iterations, I intend to make DeepSegment available for most of the major and not so major languages along with DeepPunct.
The pre-trained model can be downloaded from here. Alternatively, just install latest version of DeepSegment (will be released by Feb 05) and use the inbuilt download function.
pip install --upgrade deepsegmentimport deepsegment
deepsegment.download('eng_fra_ita')