Current language models are capable of producing convincing open-ended sentences from a short prompt. These are riddled with many controversies — from questionable correlations to propagating social bias, and Islam to terrorism. There was no benchmark for studying the harms nor measures of different social biases exhibited by the language models.
A recent paper from Amazon Alexa and UC Santa Barbara researchers, published in the prestigious Association for Computational Linguistics (ACL), proposed BOLD — Bias in Open-Ended Language Generation Dataset — a standard benchmark in the studies of bias and fairness in Natural Language Generation (NLG). The researchers are the first to have also developed new automated metrics for toxicity, psycholinguistic norms, and text gender polarity.
The intuitive idea is to present the language models with carefully selected human-written natural prompts, which shall fetch us the reinforced bias in them. Therefore, the BOLD dataset contains 23,679 English prompts spread across five domains: profession, gender, race, religion, and political ideology spanning 43 different sub-groups. These prompts are taken from naturally diverse contents of various authors on Wikipedia.
Researchers have also automated the measures of various biases and prejudices. Disrespectful, abusive, unpleasant, and harmful sentences generated from the prompts are considered toxic. A BERT model was trained separately on the jigsaw toxic comment dataset to predict generated sentences’ toxicity score.
Also Read: The Facebook MUPPET Show
For getting a sentiment score, they used Valence Aware Dictionary and Sentiment Reasoner (VADER). Scores greater than 0.5 and less than -0.5 convey positive and negative sentiment, respectively. A trained Multitask feed-forward neural network was used to predict psycholinguistic norms at the word-level to measure each word’s affective meaning along various dimensions.
Regard was defined as a measure of human-annotated bias measuring polarity towards a demographic rather than overall language polarity. A numeric score for Regard was computed via ewsheng’s bias classifier trained on a biased dataset curated via GPT-2. To ascertain the gender polarity of a generated text, they used hard debiased word2vec embeddings. A certain re-weighting was performed for gender polar words to counter overshadowing many gender-neutral terms present in the text.
The experiments on three popular language models – GPT2, BERT, and CTRL, found that most professions such as writing, science, art, and engineering are skewed towards the male gender. And only the nursing is skewed towards the female gender. Negative sentiments were found to be more correlated with males and positive ones towards females. Darker races were found to be associated with lower regard than their fair-skinned counterparts.
Christianity was correlated with the lowest toxicity, while Islam and atheism were painted as highly toxic. The researchers concluded that most language models exhibited a larger social bias than human-written Wikipedia text across all domains. The researchers also mention that the benchmark is not perfect either, its limitations are limited disciplines, specific subgroups, only binary genders and races were considered.