Hate-speech detection models are the most glaring example of biased models, as shown by researchers from Allen Institute for Artificial Intelligence in their linguistic study. In a recent post, the effects of statistical bias in machine translations were highlighted, but you shall see how dataset bias affects models in this post. The researchers studied the hate-speech detectors’ behavior using lexical — swear words, slurs, identity mentions — and dialectal markers — specifically African-American English. They also proposed an automated dialect-aware data correction method, which uses synthetic labels to reduce dialectal associations with toxicity score.
The dataset creation process always captures biases that are inherent to humans. This dataset bias consists of spurious correlations between surface patterns and annotated toxicity labels. These spurious correlations give rise to two different types of bias, lexical and dialectical. The lexical bias associates toxicity with certain words that are considered profane and identity mentions, while dialectal bias correlates toxicity with the lingua franca of minorities. All these biases proliferate freely during the training phase of the hate-speech models.
Researchers have proposed numerous debiasing techniques in the past, some applied by internet giants — Google, Facebook, and Twitter — in their systems. In this study, the researchers found that these models are not good enough. The so-called “Debiased” models still disproportionately flag text in particular dialects as toxic. The researchers noted, ”mitigating dialectal bias through current debiasing methods does not mitigate a model’s propensity to label tweets by black authors as more toxic than by white authors.”
A proof-of-concept solution was proposed by the Allen researchers that ward off the problem. The idea is to parse those “reported” hate-speeches into the majority’s lingua franca deemed non-toxic by the classifier. This idea takes care of the speeches’ dialectal context, resulting in common ground for the model to predict the toxicity score of speeches reasonably and be less prone to dialectal and racial biases.