Thursday, November 28, 2024
ad
Home Blog Page 333

Google Introduces Interpretable Ranking Via Generative Additive Models

google interpretable ranking GAM

We are building more and more complex AI models day by day to get our predictions right. In the end, we have very accurate predictions but without any interpretations of the models’ internal working. These AI models have been introduced in a controlled manner among susceptible areas like determining bail or parole, loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions.

But the lack of interpretations has resulted in model maintenance and prevalence of social bias in their predictions. Till now, their general participation in higher-stake decision processes remains limited. Google researchers are trying to change this predicament of accuracy versus interpretation trade-off. They have introduced interpretable rankings based on GAMs –Generative Additive models (Neural RankGAMs), explaining their decisions and outperforming previous ranking methods.

The current research ecosystem in explainability is still in its infancy. Most research has focussed on post-hoc analysis –post prediction decision analysis of black-box models. Even these post-hoc analyses are not perfect; they offer limited interpretations of decisions for out-of-dataset instances. In some cases, they stand inefficient to understand model behavior. The other way to solve this interpretability problem is to build intrinsically interpretable models with transparent and self-explainable structure. In these models, every feature’s effect on the predictions should be visible and understandable to ensure the decisions’ explainability.

Also Read: Data Labeling And The Hidden Costs In Machine Learning

General Additive Models (GAMs) seems to fit the bill. They’re interpretable models that have been tried and tested on both regression and classification tasks. The GAM outputs a sum of multiple sub-models’ predictions, where each sub-model only takes one feature as input. Each sub-model of a GAM reflects the contribution of each input feature to the final prediction.The Google researchers are the first to use them for ranking tasks, where the goal is to rank a list of items given some objective. They instantiate the Ranking GAMs with neural networks and propose two different architectures: context-free ranking and context-present ranking. 

Each sub-model was individually distilled to produce smaller models with higher inference speed, lower memory footprint, and a more straightforward structure. The central intuition is to train a smaller, simpler model by minimizing the loss between its output and a larger, complex model. 

Neural RankGAMs outperformed various other ranking models with a considerable margin on YAHOO and Chrome Web Service benchmark. And the researchers showed that the performance boost lies in the models’ capacity to learn item-level features and list-level contexts. 

Advertisement

Language Models Exhibits Larger Social Bias Than Human-Written Texts

Language models social bias

Current language models are capable of producing convincing open-ended sentences from a short prompt. These are riddled with many controversies — from questionable correlations to propagating social bias, and Islam to terrorism. There was no benchmark for studying the harms nor measures of different social biases exhibited by the language models. 

A recent paper from Amazon Alexa and UC Santa Barbara researchers, published in the prestigious Association for Computational Linguistics (ACL), proposed BOLD — Bias in Open-Ended Language Generation Dataset — a standard benchmark in the studies of bias and fairness in Natural Language Generation (NLG). The researchers are the first to have also developed new automated metrics for toxicity, psycholinguistic norms, and text gender polarity.

The intuitive idea is to present the language models with carefully selected human-written natural prompts, which shall fetch us the reinforced bias in them. Therefore, the BOLD dataset contains 23,679 English prompts spread across five domains: profession, gender, race, religion, and political ideology spanning 43 different sub-groups. These prompts are taken from naturally diverse contents of various authors on Wikipedia. 

Researchers have also automated the measures of various biases and prejudices. Disrespectful, abusive, unpleasant, and harmful sentences generated from the prompts are considered toxic. A BERT model was trained separately on the jigsaw toxic comment dataset to predict generated sentences’ toxicity score.

Also Read: The Facebook MUPPET Show

For getting a sentiment score, they used Valence Aware Dictionary and Sentiment Reasoner (VADER). Scores greater than 0.5 and less than -0.5 convey positive and negative sentiment, respectively. A trained Multitask feed-forward neural network was used to predict psycholinguistic norms at the word-level to measure each word’s affective meaning along various dimensions. 

Regard was defined as a measure of human-annotated bias measuring polarity towards a demographic rather than overall language polarity. A numeric score for Regard was computed via ewsheng’s bias classifier trained on a biased dataset curated via GPT-2. To ascertain the gender polarity of a generated text, they used hard debiased word2vec embeddings. A certain re-weighting was performed for gender polar words to counter overshadowing many gender-neutral terms present in the text.

The experiments on three popular language models – GPT2, BERT, and CTRL, found that most professions such as writing, science, art, and engineering are skewed towards the male gender. And only the nursing is skewed towards the female gender. Negative sentiments were found to be more correlated with males and positive ones towards females. Darker races were found to be associated with lower regard than their fair-skinned counterparts.

Christianity was correlated with the lowest toxicity, while Islam and atheism were painted as highly toxic. The researchers concluded that most language models exhibited a larger social bias than human-written Wikipedia text across all domains. The researchers also mention that the benchmark is not perfect either, its limitations are limited disciplines, specific subgroups, only binary genders and races were considered.

Advertisement

Microsoft’s Gooseberry Treat For Quantum Computing

Microsoft Gooseberry Quantum Chip

In collaboration with the University of Sydney, Microsoft has built a cryogenic quantum controller chip – Gooseberry for controlling thousands of qubits. They placed the whole control structure near the qubits themselves in a near absolute-zero environment, which is a first in the field. Their work was featured in the prestigious journal Nature Electronics.

Quantum computing is in its infancy right now, comparable to the early days of computers. They promise a considerable deal of computing power and an entirely novel set of algorithms to solve some of the most troubling problems in the computing history of cryptography, chemistry, weather forecasting, and many more. The basic computing unit in them, the qubits, can encode much more information via superposition of 0 and 1, have a terrible reputation of reacting to any perturbations. But, information is still encoded and read from the qubits via electrical signals. It becomes a matter of delicacy to manipulate the qubits, which call for a controlling chip to reduce the error margins in information handling.

Also Read: IBM And Daimler Simulates Materials With Fewer Qubits

It is a common practice in the quantum industry to place the controlling structures away from qubits to safeguard the information stored in qubits from electronic noise. The Microsoft researchers designed their chip interface to allow the control chip to be with the qubits themselves. Instead of packing a rack of room-temperature electronics to generate electrical pulses for qubits placed in 0.3 Kelvin refrigerators, the Gooseberry quantum chip is placed in the refrigerator with the qubits. This arrangement results in a tightly regulated and stable environment.

The Microsoft researchers have also built a cryogenic compute core that operates in much warmer temperatures and computes the classical calculations that are essential for determining the instructions for Gooseberry quantum chip. The chip then feeds the electrical signals to the qubits directly. Having the room to generate more heat and achieve more computations, the core enables general computing like any other CPU. 

Advertisement

The Facebook MUPPET Show

Facebook Muppet pre-fine tuning

Facebook researchers have scaled up a relevantly new technique, Pre-finetuning (PFT), in their paper MUPPET to multi-task learning of over 50 tasks on a vast scale, i.e., 4.8 million instances. They showed that PFT increases both the performance and sample efficiency of fine-tuned models like BERT, RoBERTa, and more. They even set new records in RTE and HellaSWAG benchmarks.

The usual workflow in large scale language modeling is pre-training via self-supervision over massive unlabeled datasets and then fine-tuning to suit the tasks at hand with relatively few labeled data. This arrangement works fine till the datasets and tasks are relevant. But in low-resource languages or individual tasks with very little labeled data, this training scheme bleeds the language models out. 

Also Read: Data Labeling And The Hidden Costs In Machine Learning

In 2019, a group of researchers had introduced a Pre-finetuning (PFT) stage in a paper named ‘Tri-Train,’ that lies in between pre-training and fine-tuning to overcome the above problem. They constructed another small-sized corpus by selecting a set of sentences from unlabeled pre-training data relevant to the labeled training data. Then they fine-tune the pre-trained model on merely two tasks – predict the next word on sentences from the small corpus and predict the start/end words of those sentences.

Facebook’s MUPPET — Massive Multi-task Representations with Pre-Finetuning — extends the above work to new levels. The researchers used 50 diverse tasks that include classification, summarization, question answering, and common sense reasoning. Their investigation showed that general multi-task learning schemes fail to learn useful representations and are unstable. However, their experiments also showed that scale plays a significant role in multitask learning. 

Fewer tasks degrade representation quality than the pre-trained model, but more tasks than this point improve representations. Pre-finetuning hurts performance when few tasks are used until a critical point, usually above 15, after which performance improves linearly in the number of tasks.

The researchers used loss scaling and task-heterogeneous batches so that learning remains balanced across different competing tasks, significantly improving training stability and overall performance. For training on several tasks, their model contains task-specific heads, each optimizing for a task-specific loss. They scaled each data-point loss so that, if the class distribution were uniformly distributed along with the model’s predictions, all of the task-specific losses would have equivalent values.

Similarly, the researchers proposed task-heterogeneous batches to optimize several potentially competing objectives to create a global representation across several model training tasks. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, the model optimizes over batches that consist of several tasks. Each worker samples a random batch from the set of tasks and computes a gradient, accumulated for the final update.

The model also learns better representation than the standard RoBERTa, leveraging representations from the pre-fine tuned models with 34-40 tasks. The scale factor becomes evident as the more the tasks are, the more the data-efficiency is.

Advertisement

Data Labeling And The Hidden Costs In Machine Learning

data labeling hidden cost

The most challenging part of machine learning is data cleaning because, on average, it takes 70% of the allotted time for a project. Now, there are Auto-ML systems that can handle the rest 30% of the work. But here, you have certainly made some assumptions just like the No Free Lunch Theorem predicts; a good model is always based on some assumptions. The question is whether you are aware of the beliefs? You will learn some of the assumptions that you may have made and their hidden cost.

Assumptions

The first one is ‘you have the data.’ Suppose you are building a facial recognition system, you can not deploy open-sourced pre-trained models directly. You have to fine-tune the model as per your local distribution. If the pre-trained model is trained on facial data sourced from dark-colored populations, no matter how accurate the model predictions are, it is bound to mess up when deployed on white-skin people. Hence, it becomes paramount to collect local training data for fine-tuning.

The second assumption is, ‘you have enough data.’ This belief gets tested once you plot the training and testing error. In case your model is overfitting, you are certainly going to fetch more validation data. However, large models require a more significant amount of training data. How would you amass a colossal amount of information? You have some fancy options like web-scraping, obtain open-source data of similar distribution, and/or buy data from different suppliers. 

Of all assumptions, the most critical one is ’you have ample resources to go ahead with any of the above assumptions.’ You need to have trained human capital who can work on sufficient computing power with the budget you possess in terms of resources. Frankly, there is always a trade-off involved between the above three factors.

Also Read: What Is Liquid Machine Learning?

Hidden Costs

The worst part is that you may be unaware of the hidden cost of those assumptions. In the very beginning, we mentioned the time-share of a machine learning project, but there is no mention of data labeling. What would you clean when you have no idea what target the instances have. The same argument follows for the first assumption. When you do not have local data labels, having ‘THE’ local data for fine-tuning will not be useful. Therefore, the first hidden cost that we learned is labeled data availability.

The second assumption highlights the issue of data sources and their underlying presumptions. ‘Garbage In, Garbage Out’ is a rule of thumb in machine learning. If you believe the Internet is your abundant source of data, think again. Many blunders, recorded in the AI incidence database, will make you stay away from such an idea. Secondly, the labeling paradigm will differ if you use open-source datasets. Are you going to use manual labor to do the labeling again? Definitely not. And buying data will not give any advantage against your competitor because the seller is not restricted to make a deal with your enemy at the gate. Hence, the second hidden cost is data quality.

The problem with the third assumption is the trade-off between workforce, capital, and computing resources. Now ask yourself, how many MOOCs or courses have data labeling syllabus in them? Out of all the models you built, for how many of them did you annotate your data? How much time did you spare for data labeling arrangements in the machine learning workflow? Thus, the last hidden cost is intent.

Solutions

Till now, you might have a better understanding of the scenario you are facing as a data scientist or machine learning engineer. Let us now talk about the solution — a Training Data Platform (TDP). The startups and medium and small enterprises (MSMEs) need not build any in-house tools from scratch, saving investments to capitalize over offering other services and products. These provide a one-stop solution from data collection to labeling. Some even offer training provisions too.

Now, you can streamline your machine learning workflow in a single environment and save money and time. You need not force your capable workforce to dwindle around for fixes all day. The intuitive UI of the TDPs also make workforce training easy. The main mantra of TDPs is — annotate, manage, and iterate. The TDPs have automatic annotation that needs few well-annotated examples, and they annotate the rest. A reasonable TDP has collaboration built-in and also the support for APIs of other software. Similarly, the TDP should be agile enough to iterate over new data batches for boosting the accuracy of its models. Here are some TDPs that earned their place for scaling up to enterprise-grade data platforms – Labelbox, Appen, SAMA, Scale, Neuralmarker, Unidata, and more.

Advertisement

What Is Liquid Machine Learning?

liquid machine learning

Let us imagine for a minute the most beautiful moments of our lives? How do you mentally visualize those memories? Let us look at our surroundings. Did you see the constant changes taking place with time? The things you see, the sounds you hear, the activities you do, the decisions you make, and the thoughts in your mind always come in some sequence. These kinds of physical changes are sequential. Out of all the tools that we currently possess to model sequential data, which one has the highest accuracies in modeling such sequential data? You will find that the brain is the right answer.

Thankfully, we have a mathematical structure, the ordinary differential equations (ODEs), to capture the changes taking place with time. But do we have any computational analogies for such a magnificent learning model? Yes, the famous residual neural networks (RNNs). Then an obvious question arises — do we have a common link between both? Yes, absolutely. There is a different breed of neural models, the hybrids of both structures — Continuous-time models (CT).

CT models are mostly three types – Continuous-time RNNs, Continuous-time Gated Recurrent Units (CT-GRU), and Neural ODEs. These models are very successful in modeling time-series data from financial to medical domains. There were specific questions regarding the scope of improvement in their expressibility and limitations of their current forms to learn richer feature representations. To answer these open questions, MIT researchers unveiled a new class of a neural network that was selected in the prestigious AAAI 2021 venue. These are flexible enough to capture the data stream’s varying input distribution fed into it, even after completing their training. Such flexibility has conferred the ‘Liquid’ term to the networks, and they are now popular as liquid machine learning models.

Also Read: Concept-Whitening For Interpreting Neural Networks At Ease

In these models, there are not many successive hidden layers but only a set of differential equations. The hidden states are calculated from the states of the differential equation via an ODE solver. The authors allow updating the parameters, shared across all layers, based on the state from the ‘liquid’ machine learning model’s innate differential equations.

The design has drawn inspiration from biological neural systems. The lead author Hasani, in his doctoral thesis, had worked on designing worm-inspired neural networks. In this paper, the researcher chose a similar set of differential equations for the ‘liquid’ machine learning model that governs the neural dynamics of the nematode Caenorhabditis elegans with 302 neurons and 8000 synapses. In a related paper, the researcher had shown that bio-inspired CT models construct a Neural Circuit Policy where neurons have increased expressibility and interpretability.

To test their model, the researchers open-sourced the code for others to pit it against all other CT models. In the time-series modeling task, the model outperformed all CT models in four out of eight benchmarks. For the other four, the model was lagging with a minimal difference. The LTC model set the highest accuracy, 88.2%, on the Half-Cheetah kinematic modeling dataset. It travels the least distance amongst all other CT models in terms of trajectory length, reflecting the efficiency gains in parameter count. The smaller network results in lesser data requirement and energy cost for training but faster inference time.

As impact evaluation, the researchers claim that the liquid machine learning models can help in decision making where uncertainties prevail the most — medical diagnosis and autonomous driving. In learning richer representations with fewer nodes, they are hopeful of computationally efficient applications in signal processing domains like robot control, natural language processing, and video processing.

The researchers, however, pointed out the increased memory footprint. They also noted the absence of long-term dependencies that remains the core of all state-of-the-art sequential models. The most considerable drawback of all is the dependency on the numerical optimization of specific ODE solvers. Hence, the implementations may not be used in an industrial setting as of now.

Advertisement

Baidu Receives Permit To Test Self-Driving Cars In California Without Safety Drivers

Baidu self-driving car

Baidu becomes the sixth company to obtain approval for testing self-driving cars without safety drivers in Sunnyvale. The company joins Waymo, Zoox, AutoX, Nuro, and Cruise to test fully autonomous vehicles in California. Baidu has deployed self-driving cars since 2016 in California, but this permit by the California Department of Motor Vehicles will allow the company to test three autonomous vehicles.

The company can test its vehicles on roads during non-heavy fog and rain conditions with speed limits not exceeding 45 miles per hour. Similar to other fully autonomous vehicle providers, Baidu fulfilled the requirements of providing evidence of insurance or bond equal to $5million, the capability of operating without a safety driver, and more.

Other compliance for testing fully autonomous vehicles include continuously monitoring the status of vehicles, reporting collisions within 10 days, and filing the annual report of disengagements. According to the company, in 2019, Baidu topped California’s 2019 DMV disengagement report in a total of 108,300 miles.

With the permit, Baidu is now the first company to get approval for two different types of vehicle models — Lincoln MKZ and Chrysler Pacifica. Late last year, Baidu received approval for testing driverless tests on public streets in China, making it the first in the home country.

Developing self-driving cars since 2013, Baidu became the first open-source autonomous driving platform with Baidu Apollo. The platform was open-sourced in 2017, engaging more than 45,000 developers and 210 industry partners.

Advertisement

Facebook’s Single Model XLSR For Speech Recognition In 53 Languages.

Facebook XLSR speech recognition

Facebook AI researchers recently open-sourced their unsupervised cross-lingual speech recognition model, XLSR, that can handle 53 languages at a time. The model provides 72% phoneme error rate reduction and 16% word error rate reduction on the CommonVoice and the Babel benchmarks, respectively, compared to the best results.

Why Unsupervised Training?

Multilingual models leverage data from other languages to transcribe in multiple languages. These models mostly use supervised cross-lingual training, which requires labeled data in various languages. But the transcribed speech acts as the label is often scarcer than unlabeled speech and requires non-trivial human annotation. Hence, unsupervised representation learning or pre-training is used as it does not require labeled data. The choice is supported by previous experiments that have shown cross-lingual pre-training is very useful for low resource languages mainly.

What Is New This Time?

Unsupervised representation learning has been followed mostly in the monolingual setting. The Facebook researchers extend to cross-lingual settings by learning representations on unlabeled data that generalize across languages. Unlike previous works, the researchers fine-tuned the transformer part of the model instead of freezing all pre-trained representations or feeding them to a separate downstream model, managing to pack all functions in a single model.

Also Read: BERT-Based Language Models Are Deep N-Gram Models

How Does The XLSR Architecture Look?

The researchers build on the wav2vec ver2.0 model’s pretraining approach. The model contains a convolutional feature encoder to map raw audio to latent speech representations fed to a transformer network to output context representations. The transformer architecture follows BERT except for relative positional embeddings. Feature encoder representations are discretized with a quantization module to represent the self-supervised learning objective targets for training. A shared quantization module ensures multilingual quantized speech units whose embeddings are then used as targets for a transformer trained by contrastive learning. Therefore, XLSR jointly learns contextualized speech representations as well as a discrete vocabulary of latent speech representations. The latter is used to train the model with a contrastive loss, and the discrete speech representations are shared across languages creating bridges among them. 

How Is XLSR Pre-trained?

It is pre-trained on 56k hours of speech in 53 languages on datasets like BABEL – conversational telephone data, CommonVoice – a corpus of read speech, and Multilingual LibriSpeech (MLS). When pretraining on L languages, multilingual batches are formed by sampling speech samples from a multinomial distribution. The model is trained by solving a contrastive task over masked latent speech representations and learning a quantization of the latent features shared across languages. The researchers randomly masked all time steps with probability 0.065 to be initial indices and the next ten timesteps. The objective requires identifying the real quantized latent for a masked time-step within a set of K = 100  other latent features sampled from other masked time steps. A codebook diversity penalty augments this to encourage the model to use all codebook entries.

How Is XLSR-53 Fine-Tuned?

To fine-tune the model, a classifier representing the respective downstream task’s output vocabulary on top of the model is added and trained on the labeled data with a Connectionist Temporal Classification (CTC) loss. Weights of the feature encoder are not updated at the fine-tuning time. For CommonVoice, they fine-tuned for 20k updates and on BABEL for 50k updates on 2 GPUs for the Base model (XLSR-10) and 4 GPUs for the Large model (XLSR-53).

What Were The Results?

Cross-lingual pre-training significantly outperforms monolingual pretraining thanks to the latent discrete speech representations shared across languages with increased sharing for related languages. As a result, the multilingual pre-trained model implicitly learns to cluster related languages.

The researchers also demonstrated that XLSR representations could be fine-tuned simultaneously on multiple languages to obtain a multilingual speech recognition system whose performance is competitive and fine-tuning a separate model. The model’s learned representations transfer well to unseen languages. Cross-lingual transfer learning improves low-resource language understanding, but the transfer-interference trade-off between high-resource versus low-resource languages benefits low resource languages and hurts high resource languages.

To tinker with the code and read more, visit here.

Advertisement

The Statistics Ministry Adopts AI For Faster Economic Insights

Statistics Ministry AI economics

India’s statistics ministry is turning the tide from manual labor to intelligent machines to speed-up economic data gathering and draw insights. India has been alleged for fudging up the numbers that indicate the country’s economic buoyancy to meet political ends. On top of it, revisions to published data and insufficient workforce to crunch the numbers have added salt to the injury.

In timely reporting of statistics, India lags behind major Asian counterparts like China and Japan. India’s quarterly GDP data are reported with a lag of two months, whereas China reports within three weeks. Similarly, the Indian employment data is always a year behind, compared to the US and Europe who reported weekly unemployment rates during the pandemic. Therefore, the ministry wants to automate the collection and analysis of economic data to better monitor financial institutions. 

Also Read: MeITY And AWS Sets Up Quantum Computing Application Lab

Statistics Secretary Kshatrapati Shivaji said, “Because of the changing landscape, there’s a growing need for more and more data, faster data and also more refined data products. With end-to-end computerization, this type of automation will enhance the quality, credibility, and timeliness of data.” 

A $60 million program with the World Bank has pushed the ministry to build an information portal that collates real-time data. The Indian government has announced the National Policy on Official Statistics to catalyze the Indian Statistical System for major reforms to meet the data systems’ increased demands and produce evidence-based forecasting. Real-time monitoring of the economy and governance using AI will be carried out via the National Integrated Information Portal (NIIP). The institute is being established in the Ministry of Statistics and Programme Implementation to provide a high-end platform for data analytics to help realize real-time data availability and inference objectives. The government has also planned to set up data centers across India for in-house analysis of domestic economic data. It is now up to the implementations that will judge whether the statistics ministry AI adoption will be a successful one or not.

Advertisement

DRDO Is Offering A 12-Week-Long Online Course On Artificial Intelligence

DRDO Artificial Intelligence Course

Defence Research & Development Organisation (DRDO) is offering a 12-week-long online course on artificial intelligence as a part of its training and certification programme — Defence Institute of Advanced Technology (DIAT). Unlike other online courses, this program will have 120 contact hours; two hours a day for five days a week. However, not everyone can apply for the course; Learners would be required to clear a test to qualify and enroll in the course.

The test would include modular mathematics, statistics, probability theory, basics of algorithms, data structures, databases, and knowledge of any programming language. Although the entrance test is free, you will have to pay 17,700 + GST @ 18% to enroll in the course.

You will be awarded a certificate from DIAT to showcase your expertise in artificial intelligence and machine learning after completing the course. The program is designed for graduates from any discipline, but students from the final year can apply too.

Focused on both fundamentals and advanced topics, the course framework includes probability, pattern recognition, machine learning, deep learning, computer vision, augmented reality, and natural language processing.

The registration for the test starts today till 15 February 2021, and the entrance test will be held on 20 February. And the actual paid course will start on 28 February 2021.

DIAT has also invited applications for its cybersecurity course that has a similar program structure to AI and ML, but learners who have C/C++/Java/any OOP language and any scripting language like PHP/Python/Ruby/Perl are eligible. This course includes cybersecurity essentials, forensics and incident response, system/driver programming and OS internals, reverse engineering and malware analysis, and more.

Devised by experts from DIAT, DRDO, and Ministry of Defence makes it a must for beginners who want to learn from one of the bests in the industry.

Advertisement