AI-based language translations were the object of ridicule when they coughed up something funny. Consequently, AI researchers focused on translation accuracy and preserved their fluidness to set aside the embarrassment because of faulty translations. The situation gradually improved, especially with better and larger language models that surpassed humans in various benchmarks.
But these language models still amplify the statistical biases found in their training data. And the biases affect not only the translations but also their linguistic richness. Researchers from the University of Maryland and Tilburg University have tried to study this effect quantitatively in terms of grammar and linguistic analysis of machine translations.
The translated work differs from the original one thanks to intentional factors like explicitation and normalization and unintentional ones like unconscious effects of the source language input on the target language produced. These factors are studied under a linguistics field, called Translationese, to assess the translator’s unique additions. Similarly, linguists analyze these elements introduced by a machine translator under Machine Translationese.
In the study, the researcher linguistic analysis of sequential neural models like LSTMs, Transformers, and phrase-based statistical translation models to highlight the above factors. These models were tasked with translation between English, French, and Spanish from the source. They found that the statistical distribution of terms in the training data dictates the morphological loss of variety in the machine translations.
The translation systems do not distinguish between the synonymous and grammatical variants. This directly reduces the number of grammatically correct but diverse options. In layman terms, the diversity of words and sentence structure was drastically low in the translations because of consistency and simplification.
The authors also investigated the impacts of the loss in social-lingual aspects because these machine translations affect language usage among the masses. No solution has been proposed to the problem. The authors believe that different metrics like language acquisition metric to analyze lexical sophistication, Shannon entropy, and Simpson diversity to study morphological diversity, shall contribute further investigation.