Tokenization is a common task in natural language processing that takes a character sequence as input and splits it into pieces called tokens:
In our village, folks say God crumbles up the old moon into stars.
In our village , folks say God crumbles up the old moon into stars .
One natural language processing application where tokenization is useful is statistical machine translation (SMT). SMT finds the sentence in the target language that has the highest probability of being the correct translation of a given sentence in the source language. For this purpose, we use a set of statistical models trained on large sets of translation examples in both languages.
Tokenization improves both model estimation and the subsequent translation process, since texts are broken into atomic units that can be analyzed independently (e.g. after tokenizing the raw sentence above,
village, is no longer considered as a single word but as two different ones:
After tokenizing the input sentence and translating it with an SMT system, the result is in tokenized format, and should be converted to raw text. This process is called detokenization.
It is relatively easy to find publicly available tokenization tools, such as those included in the NLTK library. By contrast, detokenizers are not so common. Typically, they are implemented as a very complex set of
if statements and regular expressions, resulting in pieces of code that are difficult to create, maintain and extend.
In this two-part article, we will describe how a detokenizer can be implemented following a machine learning approach, where a detokenizing model is first estimated (part I) and later applied in a decoding process to convert tokenized text into detokenized text (part II).
More specifically, the strategy proposed here is to build a very simple SMT system whose purpose is to translate tokenized text into detokenized text. This system is included in the Thot toolkit for SMT.
Statistical Machine Translation Models
SMT is based on statistical models to infer new translations. Among the different models used, the main ones are the language model and the phrase alignment model.
Language models measure the fluency of a given translation as a sentence in the target language. One popular implementation of this is the so-called n-gram language model, which generates probabilities for individual words given the n-1 preceding ones. Most basic n-gram models are obtained by collecting n-gram word counts from a monolingual corpus:
In our village 1 our village , 1 village , folks 1 ...
On the other hand, phrase alignment models measure the adequacy of a target sentence as translation of the source sentence. In this context, a phrase is a set of consecutive words of the source or the target sentences. Typically, a phrase model is a dictionary of phrase pairs with probabilities (let's assume we're translating from Spanish to English):
En nuestra aldea ||| In our village ||| 1 luna ||| moon ||| 1 estrellas ||| stars ||| 1 ...
One important problem when working with SMT models is data scarcity. Language model estimation requires the availability of large monolingual texts in the target language, while phrase alignment models are trained from bilingual examples that are even harder to obtain. However, as will be explained below, this difficulty is removed for the SMT system for detokenization that we propose, since we only need a tokenizer for easy generation of training corpora.
Language Model for Detokenization
We propose using an n-gram language model trained on detokenized text to implement the language model of our detokenizer. This is due to the fact that, in this case, detokenized text constitutes the target language. The following is an example of how this language model would look:
In our village, 1 our village, folks 1 village, folks say 1 ...
Phrase Alignment Model for Detokenization
Our phrase alignment model for detokenization translates from tokenized to detokenized text. Raw and tokenized texts similar to the example shown at the beginning of this article constitute the training data we should use to estimate this model. Generating such data is trivial since we only need raw text and a tokenizer. Once the model is estimated, its entries would look resemble the following:
In ||| In ||| 1 village , ||| village, ||| 1 stars . ||| stars. ||| 1 ...
Improving Model Generalization Capability
The generalization capability of the models can be improved by categorizing the text. In this context, categorization consists of replacing certain words by a label designating its category. In our implementation, we have defined the following categories:
- Digits: numbers below 10 are replaced by the string
- Numbers: numbers above 10 are replaced by
- Alphanumeric strings: strings containing numbers and letters are
- Common words: words not classified under any of the previous
categories and composed of more than five characters are replaced
When a word does not fall into any of the previous categories, it is left unmodified. This categorization process is applied when training the models, which improves model quality and substantially reduces its size. Here is an example of how phrase model entries in the above example would look:
In ||| In ||| 1 <common_word> , ||| <common_word>, ||| 1 stars . ||| stars. ||| 1 ...
The detokenizer we are proposing is implemented by the Thot toolkit. In particular, the toolkit provides the
thot_detokenize tool, which detokenizes input text given a raw text file in the language of interest that is used for training the models:
thot_detokenize -f <tokenized_text> -r <raw_training_text>
thot_detokenize may be given an additional text file obtained as a result of tokenizing the raw text file with a particular tokenizer. When used in this way, the detokenizer learns how to detokenize from the given tokenizer instead of its native one (implemented by the
thot_detokenize uses the Python tool
thot_train_detok_model to train language and translation models. For more details, interested readers can inspect
TransModel classes in the Python library