Monday, November 10, 2025
ad
Home Blog Page 340

Dealing With Racially-Biased Hate-Speech Detection Models

biased hate speech models

Hate-speech detection models are the most glaring example of biased models, as shown by researchers from Allen Institute for Artificial Intelligence in their linguistic study. In a recent post, the effects of statistical bias in machine translations were highlighted, but you shall see how dataset bias affects models in this post. The researchers studied the hate-speech detectors’ behavior using lexical — swear words, slurs, identity mentions — and dialectal markers — specifically African-American English. They also proposed an automated dialect-aware data correction method, which uses synthetic labels to reduce dialectal associations with toxicity score.

The dataset creation process always captures biases that are inherent to humans. This dataset bias consists of spurious correlations between surface patterns and annotated toxicity labels. These spurious correlations give rise to two different types of bias, lexical and dialectical. The lexical bias associates toxicity with certain words that are considered profane and identity mentions, while dialectal bias correlates toxicity with the lingua franca of minorities. All these biases proliferate freely during the training phase of the hate-speech models. 

Researchers have proposed numerous debiasing techniques in the past, some applied by internet giants — Google, Facebook, and Twitter — in their systems. In this study, the researchers found that these models are not good enough. The so-called “Debiased” models still disproportionately flag text in particular dialects as toxic. The researchers noted, ”mitigating dialectal bias through current debiasing methods does not mitigate a model’s propensity to label tweets by black authors as more toxic than by white authors.”

A proof-of-concept solution was proposed by the Allen researchers that ward off the problem. The idea is to parse those “reported” hate-speeches into the majority’s lingua franca deemed non-toxic by the classifier. This idea takes care of the speeches’ dialectal context, resulting in common ground for the model to predict the toxicity score of speeches reasonably and be less prone to dialectal and racial biases.

Advertisement

AutoML Made Easy With Symbolic Programming using Pyglove

Symbolic programing AutoML Pyglove

Google AI researchers have released a PyGlove library, a symbolic implementation of Automated Machine Learning (AutoML) that allows developers to experiment with search spaces, search algorithms, and search flows of an AutoML with only a few code lines. Now, developers can self-mutate Python classes and functions through brief Python annotations, making it much easier to write AutoML programs.

Developers previously had data and the outputs; they fed that into a machine learning algorithm, which automated the learning of rules governing input to output. Researchers later automated the selection and hyper-parameter tuning of those machine learning algorithms as well. One of the sub-classes of machine learning algorithms is neural networks, which are highly sensitive to architecture and hyper-parameters.

The possible combinations of architecture and hyper-parameter choices become humongous as researchers aim to build larger and larger neural models. They waste months in hand-crafting neural network architectures and selecting the right hyper-parameters. AutoML automated these aspects by formulating the problem as a search problem.

Also Read: What Is Liquid Machine Learning?

A search space is defined to represent all possible choices, and a search algorithm is used to find the best options. Neural Architecture Search (NAS) algorithms like ENAS and DARTS come under the purview of the AutoML. But the current implementations of NAS algorithms do not offer modularity to the components of NAS algorithms like the search space and search algorithm. Therefore, researchers had to face difficulties modifying the search space, search algorithm, or search flow alone.

The Google researchers introduced AutoML based on symbolic programming — a paradigm that allows self-mutating programs by manipulating its components — that makes components decoupled. This decoupling makes it easy for practitioners to change the search space and search algorithm with and without weight sharing, and add search capabilities to existing code and implement complex search flows.

On ImageNet and NAS-Bench-101 benchmarks, they showed that symbolic programming based PyGlove converts a static program into a search space, quickly iterates on the search spaces and search algorithms, and crafts complex search flows to achieve better results. PyGlove also allows easy plug-and-play of AutoML techniques in existing ML pipelines while also benefiting open-ended research.

Advertisement

Microsoft Introduces VIVA To Help People Work From Home

Microsoft VIVA employee platform

Microsoft has recently unveiled a new employee experience platform–VIVA–that will act as an integrated platform to manage employee well-being, learning, engagement, and knowledge discovery in the workflow. With close integration of Teams and Office 365 technologies, Microsoft wants to be the market leader in employee engagement.

Microsoft has bet upon the remote working culture in the future. It is targeting all kinds of organizations and employees. During the pandemic, almost all companies have been distraught with various platforms for on-boarding and training the employees. This new platform promises to smoothen the journey for both employees and the company alike. 

Also Read: AWS Will Host Free Virtual Classes On ML, Blockchain, Big Data, And More

Currently, the platform has four foundations — Viva Connect, Viva insight, Viva learning, and Viva topics — each one represents a different aspect of employee workflow from in-vitro the company or outside. Viva Connect has a personalized gateway for every employee to access all internal communications and company resources. It also helps employees participate in communities like employee resource groups, all from a single customizable app in Microsoft Teams. VIVA Insight helps the CXOs identify where teams struggle, especially to balance productivity and well-being.

VIVA Learning makes learning resources available in the company, like courses and guided-projects from EDX, Coursera, and many more, in one platform. It helps the employees to manage all their training and micro-courses with their accomplishments. VIVA Topics allows knowledge discovery from various third-party sources inside the documents across Microsoft 365 and conversation in Teams using AI.

Microsoft has partnered with Accenture, Avanade, PwC, and EY to help other companies adopt the homegrown employee experience environment, providing consulting and advisory services. Microsoft’s CEO Satya Nadela put the benefits of VIVA into a much-needed vision statement amidst this new initiative. He said, “We have participated in the largest at-scale remote work experiment the world has seen, and it has had a dramatic impact on the employee experience. Every organization will require a unified employee experience from onboarding and collaboration to continuous learning and growth. Viva brings together everything an employee needs to be successful, from day one, in a single, integrated experience directly in Teams.”

Advertisement

AWS Will Host Free Virtual Classes On ML, Blockchain, Big Data, And More

AWS virtual classes

AWS will offer free virtual classes for learners who want to gain in-demand skills like machine learning, Blockchain, big data, containers, and more. The AWS virtual class is an ideal way to getting started with the latest technologies on AWS. 

The webinar-based online classes are about 90 minutes that are mostly focused on beginners or professionals who want to explore new technologies.

Learners from across the world can register for the event based on their convenience as the lessons are delivered across the timezone. AWS keeps hosting training through virtual classes to ensure the world has a workforce that can work with cutting-edge technologies. Last month, AWS conducted free AWS AI Conclave to deliver 20+ breakthrough sessions from the industry experts.

In February 2021, AWS included topics like BlockchainContainersKubernetes, Machine LearningBig Data, among others. With these webinars, aspirants can find new interests and get to know the fundamentals of technologies.

As organizations are expecting fundamental knowledge of product development from data scientists, gaining an understanding of containers can differentiate them among others. In addition to learning new technologies, familiarising with the AWS platform can also help beginners in streamlining the workflow while they start working at organizations.

Every month, AWS delivers webinars on the latest technologies, and in the coming months, AWS will also focus on Cloud Security, Data Analytics, and more.

To know more about the upcoming AWS Virtual Classes click here.

Advertisement

Measuring Weirdness In AI-Based Language-Translations

machine translation linguistic analysis

AI-based language translations were the object of ridicule when they coughed up something funny. Consequently, AI researchers focused on translation accuracy and preserved their fluidness to set aside the embarrassment because of faulty translations. The situation gradually improved, especially with better and larger language models that surpassed humans in various benchmarks.

But these language models still amplify the statistical biases found in their training data. And the biases affect not only the translations but also their linguistic richness. Researchers from the University of Maryland and Tilburg University have tried to study this effect quantitatively in terms of grammar and linguistic analysis of machine translations. 

The translated work differs from the original one thanks to intentional factors like explicitation and normalization and unintentional ones like unconscious effects of the source language input on the target language produced. These factors are studied under a linguistics field, called Translationese, to assess the translator’s unique additions. Similarly, linguists analyze these elements introduced by a machine translator under Machine Translationese.

Also Read: Language Models Exhibits Larger Social Bias Than Human-Written Texts

In the study, the researcher linguistic analysis of sequential neural models like LSTMs, Transformers, and phrase-based statistical translation models to highlight the above factors. These models were tasked with translation between English, French, and Spanish from the source. They found that the statistical distribution of terms in the training data dictates the morphological loss of variety in the machine translations.

The translation systems do not distinguish between the synonymous and grammatical variants. This directly reduces the number of grammatically correct but diverse options. In layman terms, the diversity of words and sentence structure was drastically low in the translations because of consistency and simplification.

The authors also investigated the impacts of the loss in social-lingual aspects because these machine translations affect language usage among the masses. No solution has been proposed to the problem. The authors believe that different metrics like language acquisition metric to analyze lexical sophistication, Shannon entropy, and Simpson diversity to study morphological diversity, shall contribute further investigation.

Advertisement

Google Introduces Interpretable Ranking Via Generative Additive Models

google interpretable ranking GAM

We are building more and more complex AI models day by day to get our predictions right. In the end, we have very accurate predictions but without any interpretations of the models’ internal working. These AI models have been introduced in a controlled manner among susceptible areas like determining bail or parole, loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions.

But the lack of interpretations has resulted in model maintenance and prevalence of social bias in their predictions. Till now, their general participation in higher-stake decision processes remains limited. Google researchers are trying to change this predicament of accuracy versus interpretation trade-off. They have introduced interpretable rankings based on GAMs –Generative Additive models (Neural RankGAMs), explaining their decisions and outperforming previous ranking methods.

The current research ecosystem in explainability is still in its infancy. Most research has focussed on post-hoc analysis –post prediction decision analysis of black-box models. Even these post-hoc analyses are not perfect; they offer limited interpretations of decisions for out-of-dataset instances. In some cases, they stand inefficient to understand model behavior. The other way to solve this interpretability problem is to build intrinsically interpretable models with transparent and self-explainable structure. In these models, every feature’s effect on the predictions should be visible and understandable to ensure the decisions’ explainability.

Also Read: Data Labeling And The Hidden Costs In Machine Learning

General Additive Models (GAMs) seems to fit the bill. They’re interpretable models that have been tried and tested on both regression and classification tasks. The GAM outputs a sum of multiple sub-models’ predictions, where each sub-model only takes one feature as input. Each sub-model of a GAM reflects the contribution of each input feature to the final prediction.The Google researchers are the first to use them for ranking tasks, where the goal is to rank a list of items given some objective. They instantiate the Ranking GAMs with neural networks and propose two different architectures: context-free ranking and context-present ranking. 

Each sub-model was individually distilled to produce smaller models with higher inference speed, lower memory footprint, and a more straightforward structure. The central intuition is to train a smaller, simpler model by minimizing the loss between its output and a larger, complex model. 

Neural RankGAMs outperformed various other ranking models with a considerable margin on YAHOO and Chrome Web Service benchmark. And the researchers showed that the performance boost lies in the models’ capacity to learn item-level features and list-level contexts. 

Advertisement

Language Models Exhibits Larger Social Bias Than Human-Written Texts

Language models social bias

Current language models are capable of producing convincing open-ended sentences from a short prompt. These are riddled with many controversies — from questionable correlations to propagating social bias, and Islam to terrorism. There was no benchmark for studying the harms nor measures of different social biases exhibited by the language models. 

A recent paper from Amazon Alexa and UC Santa Barbara researchers, published in the prestigious Association for Computational Linguistics (ACL), proposed BOLD — Bias in Open-Ended Language Generation Dataset — a standard benchmark in the studies of bias and fairness in Natural Language Generation (NLG). The researchers are the first to have also developed new automated metrics for toxicity, psycholinguistic norms, and text gender polarity.

The intuitive idea is to present the language models with carefully selected human-written natural prompts, which shall fetch us the reinforced bias in them. Therefore, the BOLD dataset contains 23,679 English prompts spread across five domains: profession, gender, race, religion, and political ideology spanning 43 different sub-groups. These prompts are taken from naturally diverse contents of various authors on Wikipedia. 

Researchers have also automated the measures of various biases and prejudices. Disrespectful, abusive, unpleasant, and harmful sentences generated from the prompts are considered toxic. A BERT model was trained separately on the jigsaw toxic comment dataset to predict generated sentences’ toxicity score.

Also Read: The Facebook MUPPET Show

For getting a sentiment score, they used Valence Aware Dictionary and Sentiment Reasoner (VADER). Scores greater than 0.5 and less than -0.5 convey positive and negative sentiment, respectively. A trained Multitask feed-forward neural network was used to predict psycholinguistic norms at the word-level to measure each word’s affective meaning along various dimensions. 

Regard was defined as a measure of human-annotated bias measuring polarity towards a demographic rather than overall language polarity. A numeric score for Regard was computed via ewsheng’s bias classifier trained on a biased dataset curated via GPT-2. To ascertain the gender polarity of a generated text, they used hard debiased word2vec embeddings. A certain re-weighting was performed for gender polar words to counter overshadowing many gender-neutral terms present in the text.

The experiments on three popular language models – GPT2, BERT, and CTRL, found that most professions such as writing, science, art, and engineering are skewed towards the male gender. And only the nursing is skewed towards the female gender. Negative sentiments were found to be more correlated with males and positive ones towards females. Darker races were found to be associated with lower regard than their fair-skinned counterparts.

Christianity was correlated with the lowest toxicity, while Islam and atheism were painted as highly toxic. The researchers concluded that most language models exhibited a larger social bias than human-written Wikipedia text across all domains. The researchers also mention that the benchmark is not perfect either, its limitations are limited disciplines, specific subgroups, only binary genders and races were considered.

Advertisement

Microsoft’s Gooseberry Treat For Quantum Computing

Microsoft Gooseberry Quantum Chip

In collaboration with the University of Sydney, Microsoft has built a cryogenic quantum controller chip – Gooseberry for controlling thousands of qubits. They placed the whole control structure near the qubits themselves in a near absolute-zero environment, which is a first in the field. Their work was featured in the prestigious journal Nature Electronics.

Quantum computing is in its infancy right now, comparable to the early days of computers. They promise a considerable deal of computing power and an entirely novel set of algorithms to solve some of the most troubling problems in the computing history of cryptography, chemistry, weather forecasting, and many more. The basic computing unit in them, the qubits, can encode much more information via superposition of 0 and 1, have a terrible reputation of reacting to any perturbations. But, information is still encoded and read from the qubits via electrical signals. It becomes a matter of delicacy to manipulate the qubits, which call for a controlling chip to reduce the error margins in information handling.

Also Read: IBM And Daimler Simulates Materials With Fewer Qubits

It is a common practice in the quantum industry to place the controlling structures away from qubits to safeguard the information stored in qubits from electronic noise. The Microsoft researchers designed their chip interface to allow the control chip to be with the qubits themselves. Instead of packing a rack of room-temperature electronics to generate electrical pulses for qubits placed in 0.3 Kelvin refrigerators, the Gooseberry quantum chip is placed in the refrigerator with the qubits. This arrangement results in a tightly regulated and stable environment.

The Microsoft researchers have also built a cryogenic compute core that operates in much warmer temperatures and computes the classical calculations that are essential for determining the instructions for Gooseberry quantum chip. The chip then feeds the electrical signals to the qubits directly. Having the room to generate more heat and achieve more computations, the core enables general computing like any other CPU. 

Advertisement

The Facebook MUPPET Show

Facebook Muppet pre-fine tuning

Facebook researchers have scaled up a relevantly new technique, Pre-finetuning (PFT), in their paper MUPPET to multi-task learning of over 50 tasks on a vast scale, i.e., 4.8 million instances. They showed that PFT increases both the performance and sample efficiency of fine-tuned models like BERT, RoBERTa, and more. They even set new records in RTE and HellaSWAG benchmarks.

The usual workflow in large scale language modeling is pre-training via self-supervision over massive unlabeled datasets and then fine-tuning to suit the tasks at hand with relatively few labeled data. This arrangement works fine till the datasets and tasks are relevant. But in low-resource languages or individual tasks with very little labeled data, this training scheme bleeds the language models out. 

Also Read: Data Labeling And The Hidden Costs In Machine Learning

In 2019, a group of researchers had introduced a Pre-finetuning (PFT) stage in a paper named ‘Tri-Train,’ that lies in between pre-training and fine-tuning to overcome the above problem. They constructed another small-sized corpus by selecting a set of sentences from unlabeled pre-training data relevant to the labeled training data. Then they fine-tune the pre-trained model on merely two tasks – predict the next word on sentences from the small corpus and predict the start/end words of those sentences.

Facebook’s MUPPET — Massive Multi-task Representations with Pre-Finetuning — extends the above work to new levels. The researchers used 50 diverse tasks that include classification, summarization, question answering, and common sense reasoning. Their investigation showed that general multi-task learning schemes fail to learn useful representations and are unstable. However, their experiments also showed that scale plays a significant role in multitask learning. 

Fewer tasks degrade representation quality than the pre-trained model, but more tasks than this point improve representations. Pre-finetuning hurts performance when few tasks are used until a critical point, usually above 15, after which performance improves linearly in the number of tasks.

The researchers used loss scaling and task-heterogeneous batches so that learning remains balanced across different competing tasks, significantly improving training stability and overall performance. For training on several tasks, their model contains task-specific heads, each optimizing for a task-specific loss. They scaled each data-point loss so that, if the class distribution were uniformly distributed along with the model’s predictions, all of the task-specific losses would have equivalent values.

Similarly, the researchers proposed task-heterogeneous batches to optimize several potentially competing objectives to create a global representation across several model training tasks. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, the model optimizes over batches that consist of several tasks. Each worker samples a random batch from the set of tasks and computes a gradient, accumulated for the final update.

The model also learns better representation than the standard RoBERTa, leveraging representations from the pre-fine tuned models with 34-40 tasks. The scale factor becomes evident as the more the tasks are, the more the data-efficiency is.

Advertisement

Data Labeling And The Hidden Costs In Machine Learning

data labeling hidden cost

The most challenging part of machine learning is data cleaning because, on average, it takes 70% of the allotted time for a project. Now, there are Auto-ML systems that can handle the rest 30% of the work. But here, you have certainly made some assumptions just like the No Free Lunch Theorem predicts; a good model is always based on some assumptions. The question is whether you are aware of the beliefs? You will learn some of the assumptions that you may have made and their hidden cost.

Assumptions

The first one is ‘you have the data.’ Suppose you are building a facial recognition system, you can not deploy open-sourced pre-trained models directly. You have to fine-tune the model as per your local distribution. If the pre-trained model is trained on facial data sourced from dark-colored populations, no matter how accurate the model predictions are, it is bound to mess up when deployed on white-skin people. Hence, it becomes paramount to collect local training data for fine-tuning.

The second assumption is, ‘you have enough data.’ This belief gets tested once you plot the training and testing error. In case your model is overfitting, you are certainly going to fetch more validation data. However, large models require a more significant amount of training data. How would you amass a colossal amount of information? You have some fancy options like web-scraping, obtain open-source data of similar distribution, and/or buy data from different suppliers. 

Of all assumptions, the most critical one is ’you have ample resources to go ahead with any of the above assumptions.’ You need to have trained human capital who can work on sufficient computing power with the budget you possess in terms of resources. Frankly, there is always a trade-off involved between the above three factors.

Also Read: What Is Liquid Machine Learning?

Hidden Costs

The worst part is that you may be unaware of the hidden cost of those assumptions. In the very beginning, we mentioned the time-share of a machine learning project, but there is no mention of data labeling. What would you clean when you have no idea what target the instances have. The same argument follows for the first assumption. When you do not have local data labels, having ‘THE’ local data for fine-tuning will not be useful. Therefore, the first hidden cost that we learned is labeled data availability.

The second assumption highlights the issue of data sources and their underlying presumptions. ‘Garbage In, Garbage Out’ is a rule of thumb in machine learning. If you believe the Internet is your abundant source of data, think again. Many blunders, recorded in the AI incidence database, will make you stay away from such an idea. Secondly, the labeling paradigm will differ if you use open-source datasets. Are you going to use manual labor to do the labeling again? Definitely not. And buying data will not give any advantage against your competitor because the seller is not restricted to make a deal with your enemy at the gate. Hence, the second hidden cost is data quality.

The problem with the third assumption is the trade-off between workforce, capital, and computing resources. Now ask yourself, how many MOOCs or courses have data labeling syllabus in them? Out of all the models you built, for how many of them did you annotate your data? How much time did you spare for data labeling arrangements in the machine learning workflow? Thus, the last hidden cost is intent.

Solutions

Till now, you might have a better understanding of the scenario you are facing as a data scientist or machine learning engineer. Let us now talk about the solution — a Training Data Platform (TDP). The startups and medium and small enterprises (MSMEs) need not build any in-house tools from scratch, saving investments to capitalize over offering other services and products. These provide a one-stop solution from data collection to labeling. Some even offer training provisions too.

Now, you can streamline your machine learning workflow in a single environment and save money and time. You need not force your capable workforce to dwindle around for fixes all day. The intuitive UI of the TDPs also make workforce training easy. The main mantra of TDPs is — annotate, manage, and iterate. The TDPs have automatic annotation that needs few well-annotated examples, and they annotate the rest. A reasonable TDP has collaboration built-in and also the support for APIs of other software. Similarly, the TDP should be agile enough to iterate over new data batches for boosting the accuracy of its models. Here are some TDPs that earned their place for scaling up to enterprise-grade data platforms – Labelbox, Appen, SAMA, Scale, Neuralmarker, Unidata, and more.

Advertisement