Machine translation (MT) is a fundamental technology that is gaining more importance each day in our multilingual society. Companies and particulars alike are turning their attention to MT to dramatically cut down their expenses on translation. One of its key advantages is that the rules governing the translation process are automatically estimated from examples without human intervention. This drastically reduces the cost of developing new systems in comparison to other approaches where the translation rules are defined manually by expert linguists. In exchange for this, the problem of how to obtain suitable examples becomes a cornerstone in the development of a MT system.

Examples in our case are pairs of sentences, original and translation, in the two languages between which we want to translate. Given these examples, MT systems are able to estimate the linguistic rules to generate the translations from the sources. For example, they can estimate that the English word bank has different meanings in Spanish depending on the context. For instance, orilla if we are talking about rivers, ladera if talking about mountains, or banco if the source sentence refers to a building. Of course, the accuracy of the estimation depends on the quality of the examples provided to the system. If the set of examples is inadequate, the system will learn undesirable translations. This is a common issue that has affected even the largest technology companies.

Manual generation of high-quality examples is out of the question since it will be almost as difficult as manually defining translation rules. In light of this dilemma, the Web stands as a priceless resource with non-profit projects such as Common Crawl providing for free billions of pages in more than 40 languages. These petabytes of raw data are, however, pretty noisy. Incorrect detection of the language of the sentence, or sentences that are not translation of each other, are errors commonly found in the data automatically crawled from the web. Accurately filtering out such inadequate data to obtain high-quality examples is a key problem of great interest for the MT research community.

The problem we have at hands can be stated as follows:

“Given a large set of potentially noisy sentence translation examples, filter it to a smaller size of high quality sentence pairs suitable to develop a translation system”

We approach this problem as a binary classification task where a statistical model is used to predict if a given example is adequate or noisy. We start by transforming each example into a list of features that quantify important characteristics of the pair of sentences. With these features, using machine learning techniques, a statistical model can be created to obtain a score that represents the adequacy of the example.

We use a very rich set of features to represent each example. These features can be gruesomely classified into two different groups:

  • Fluency: this type of features aim at capturing if the sentences are well-formed grammatically, contain correct spellings, adhere to common use of terms, titles and names, are intuitively acceptable and can be sensibly interpreted by a native speaker. Language models are extensively used in this context. In particular, we use the probability and perplexity of n-gram language models. For a given sentence , the language model probability is given by approximating the chain rule over the words of the sentence:

    Here, it is assumed that the probability of observing the i-th word in the context of the preceding i-1 words can be approximated by the probability of observing it in the shortened context of the preceding n-1 words; n, the order of the model, is usually chosen to be 3 or 5. The conditional probability of each word can be calculated from frequency counts:

  • Adequacy: these features measure how much of the meaning of the original is expressed in the translation and vice versa. To this end, we use probabilistic lexicons with different formulations on how to align the words of the two sentences to compute how accurate are both sentences translation of each other. For instance, one of the scores we use is computed as the average translation probability of the words in the translation given the source sentence :

Regarding the machine learning algorithm, there is a huge amount of libraries that implement the vast array of classifiers. We choose scikit-learn. A python library that integrates almost every  classifier proposed in the literature. With an extensive documentation, and a quite intuitive common interface for all classifiers, scikit-learn is arguably the best machine learning library for python. After trying different classifiers, we chose to use gradient boost classifiers in our experiments.

Whichever classification algorithm we choose, this task poses an additional important learning problem. For a binary classification task, we should include data from both classes (adequate examples to keep and noisy examples to filter out). We can use a curated list of sentence pairs to represent the “keep” class but data of the negative class (pairs of sentences that should be filtered) is usually not available. Without examples from both classes, the model has no way of telling how the features differ between them, e.g what properties of a sentence pair make it more or less likely to be kept or filtered out.

The good news is that negative examples can be created under demand from the positive examples. The process involves perturbing one or both of the sentences in a pair to create a new synthetic pair that by construction should be filtered out. For instance, randomly selecting two sentences to create a new pair. Of course, the sentences in these synthetic pairs are not translations of each other. Thus, they are suitable negative examples for the classification model to distinguish from the sentence pairs that should be kept.

Once the model is trained, we can apply it to any noisy data to identify adequate translation examples. In our experiments, this methodology allows to filter synthetic negative data with over 90% accuracy. This has allowed us to effectively leverage the Web to improve our localisation systems.