The researchers from Adobe and Auburn University pointed out that current BERT-based language models are simply deep n-gram models because they blatantly reject taking word order into account.
In natural language, word order remains highly constrained by many linguistic factors, including syntactic structures, subcategorization, and discourse. Arranging words in the correct order is considered a critical problem in language modeling. In earlier times, statistical models like n-grams were used for primitive Natural Language Understanding (NLU) tasks, like sentiment analysis, sentence completion, and more. But those models have many problems like being ineffective at preserving long-term dependencies, loss of context, and sparsity. They can not produce convincible long sentences in the correct word order. Thanks to attention modules, feats achieved by many language models like Microsoft’s DeBerta, GPT-3, and Google’s Switch-Transformers made us believe that the word-order problem is solved for good.
Sadly, the researchers found that the language models heavily rely on words, not their order, to make decisions. And the root cause was the existence of self-attention matrices that explicitly map word-correspondence between two input sentences regardless of those words’ order. To demonstrate, the researchers used a pre-trained language model, RoBERTa, that achieves a 91.12% accuracy on the Quora Question-Pairs dataset by correctly labeling a pair of Quora questions “duplicate.” And using that model, they show the following effect on shuffled words.
- Words not shuffled:
- All words in question Q2 shuffled at random. Interestingly, the model’s predictions remain almost unchanged.
- The models incorrectly label a real sentence and the shuffled version “duplicate.”
BERT-based language models, which use transformer architectures, that learn representations via a bidirectional encoder are good at exploiting superficial cues like the sentiment of keywords and the word-wise similarity between sequence-pair inputs. They use these hints to make correct decisions when tokens are in random orders. The researchers tested BERT-, RoBERTa-, and ALBERT-based models on 6 GLUE binary classification tasks. The tasks were to classify whether the words in an input sentence were intact. Any human or model that has surpassed humans is expected to choose the “reject” option when asked to classify a sentence whose words are randomly shuffled.
Their experiment showed that 75% to 90% of the accurate predictions of BERT-derived classifiers, trained on many GLUE tasks — five out of six — remain unchanged even after shuffling the input words. It means that 65% of the five GLUE tasks’ ground-truth labels can be predicted when the words in one sentence in each example are shuffled. The behavior persists even in BERT embeddings that are famously contextual.
They also showed that in the sentiment analysis task (SST-2), a single salient word’s ability to predict an entire sentence’s label remains more significant than 60%. Consequently, one can be safely assumed that the models rely heavily on a few keywords to classify a complete sentence.
It was found that models trained on sequence-pair GLUE tasks used a set of self-attention heads to find similar tokens shared between the two inputs. For instance, in ≥ 50% of the correct predictions, QNLI models rely on a set of 15 specific self-attention heads for finding similar words shared between questions and answers regardless of word order.
Modifying the training regime of RoBERTa-based models to be more sensitive to word order improves SQuAD 2.0, most GLUE tasks (except SST-2), and out-of-samples. These findings suggest that existing, high-performing NLU models have a naive understanding of the text and readily misbehave under out-of-distribution inputs. They behave similarly to n-gram models since each word’s contribution to downstream tasks remains intact even after the word’s context is lost. In the end, as far as benchmarking of the language models are concerned, many GLUE tasks now cease to be challenging enough for machines to understand a sentence’s meaning.