What Is Liquid Machine Learning?

By

-

February 1, 2021

Let us imagine for a minute the most beautiful moments of our lives? How do you mentally visualize those memories? Let us look at our surroundings. Did you see the constant changes taking place with time? The things you see, the sounds you hear, the activities you do, the decisions you make, and the thoughts in your mind always come in some sequence. These kinds of physical changes are sequential. Out of all the tools that we currently possess to model sequential data, which one has the highest accuracies in modeling such sequential data? You will find that the brain is the right answer.

Thankfully, we have a mathematical structure, the ordinary differential equations (ODEs), to capture the changes taking place with time. But do we have any computational analogies for such a magnificent learning model? Yes, the famous residual neural networks (RNNs). Then an obvious question arises — do we have a common link between both? Yes, absolutely. There is a different breed of neural models, the hybrids of both structures — Continuous-time models (CT).

CT models are mostly three types – Continuous-time RNNs, Continuous-time Gated Recurrent Units (CT-GRU), and Neural ODEs. These models are very successful in modeling time-series data from financial to medical domains. There were specific questions regarding the scope of improvement in their expressibility and limitations of their current forms to learn richer feature representations. To answer these open questions, MIT researchers unveiled a new class of a neural network that was selected in the prestigious AAAI 2021 venue. These are flexible enough to capture the data stream’s varying input distribution fed into it, even after completing their training. Such flexibility has conferred the ‘Liquid’ term to the networks, and they are now popular as liquid machine learning models.

Also Read: Concept-Whitening For Interpreting Neural Networks At Ease

In these models, there are not many successive hidden layers but only a set of differential equations. The hidden states are calculated from the states of the differential equation via an ODE solver. The authors allow updating the parameters, shared across all layers, based on the state from the ‘liquid’ machine learning model’s innate differential equations.

The design has drawn inspiration from biological neural systems. The lead author Hasani, in his doctoral thesis, had worked on designing worm-inspired neural networks. In this paper, the researcher chose a similar set of differential equations for the ‘liquid’ machine learning model that governs the neural dynamics of the nematode Caenorhabditis elegans with 302 neurons and 8000 synapses. In a related paper, the researcher had shown that bio-inspired CT models construct a Neural Circuit Policy where neurons have increased expressibility and interpretability.

To test their model, the researchers open-sourced the code for others to pit it against all other CT models. In the time-series modeling task, the model outperformed all CT models in four out of eight benchmarks. For the other four, the model was lagging with a minimal difference. The LTC model set the highest accuracy, 88.2%, on the Half-Cheetah kinematic modeling dataset. It travels the least distance amongst all other CT models in terms of trajectory length, reflecting the efficiency gains in parameter count. The smaller network results in lesser data requirement and energy cost for training but faster inference time.

As impact evaluation, the researchers claim that the liquid machine learning models can help in decision making where uncertainties prevail the most — medical diagnosis and autonomous driving. In learning richer representations with fewer nodes, they are hopeful of computationally efficient applications in signal processing domains like robot control, natural language processing, and video processing.

The researchers, however, pointed out the increased memory footprint. They also noted the absence of long-term dependencies that remains the core of all state-of-the-art sequential models. The most considerable drawback of all is the dependency on the numerical optimization of specific ODE solvers. Hence, the implementations may not be used in an industrial setting as of now.

Baidu Receives Permit To Test Self-Driving Cars In California Without Safety Drivers

By

Kenny Manuel

-

January 31, 2021

Baidu becomes the sixth company to obtain approval for testing self-driving cars without safety drivers in Sunnyvale. The company joins Waymo, Zoox, AutoX, Nuro, and Cruise to test fully autonomous vehicles in California. Baidu has deployed self-driving cars since 2016 in California, but this permit by the California Department of Motor Vehicles will allow the company to test three autonomous vehicles.

The company can test its vehicles on roads during non-heavy fog and rain conditions with speed limits not exceeding 45 miles per hour. Similar to other fully autonomous vehicle providers, Baidu fulfilled the requirements of providing evidence of insurance or bond equal to $5million, the capability of operating without a safety driver, and more.

Other compliance for testing fully autonomous vehicles include continuously monitoring the status of vehicles, reporting collisions within 10 days, and filing the annual report of disengagements. According to the company, in 2019, Baidu topped California’s 2019 DMV disengagement report in a total of 108,300 miles.

With the permit, Baidu is now the first company to get approval for two different types of vehicle models — Lincoln MKZ and Chrysler Pacifica. Late last year, Baidu received approval for testing driverless tests on public streets in China, making it the first in the home country.

Developing self-driving cars since 2013, Baidu became the first open-source autonomous driving platform with Baidu Apollo. The platform was open-sourced in 2017, engaging more than 45,000 developers and 210 industry partners.

Facebook’s Single Model XLSR For Speech Recognition In 53 Languages.

By

Pradyumna Sahoo

-

January 30, 2021

Facebook AI researchers recently open-sourced their unsupervised cross-lingual speech recognition model, XLSR, that can handle 53 languages at a time. The model provides 72% phoneme error rate reduction and 16% word error rate reduction on the CommonVoice and the Babel benchmarks, respectively, compared to the best results.

Why Unsupervised Training?

Multilingual models leverage data from other languages to transcribe in multiple languages. These models mostly use supervised cross-lingual training, which requires labeled data in various languages. But the transcribed speech acts as the label is often scarcer than unlabeled speech and requires non-trivial human annotation. Hence, unsupervised representation learning or pre-training is used as it does not require labeled data. The choice is supported by previous experiments that have shown cross-lingual pre-training is very useful for low resource languages mainly.

What Is New This Time?

Unsupervised representation learning has been followed mostly in the monolingual setting. The Facebook researchers extend to cross-lingual settings by learning representations on unlabeled data that generalize across languages. Unlike previous works, the researchers fine-tuned the transformer part of the model instead of freezing all pre-trained representations or feeding them to a separate downstream model, managing to pack all functions in a single model.

Also Read: BERT-Based Language Models Are Deep N-Gram Models

How Does The XLSR Architecture Look?

The researchers build on the wav2vec ver2.0 model’s pretraining approach. The model contains a convolutional feature encoder to map raw audio to latent speech representations fed to a transformer network to output context representations. The transformer architecture follows BERT except for relative positional embeddings. Feature encoder representations are discretized with a quantization module to represent the self-supervised learning objective targets for training. A shared quantization module ensures multilingual quantized speech units whose embeddings are then used as targets for a transformer trained by contrastive learning. Therefore, XLSR jointly learns contextualized speech representations as well as a discrete vocabulary of latent speech representations. The latter is used to train the model with a contrastive loss, and the discrete speech representations are shared across languages creating bridges among them.

How Is XLSR Pre-trained?

It is pre-trained on 56k hours of speech in 53 languages on datasets like BABEL – conversational telephone data, CommonVoice – a corpus of read speech, and Multilingual LibriSpeech (MLS). When pretraining on L languages, multilingual batches are formed by sampling speech samples from a multinomial distribution. The model is trained by solving a contrastive task over masked latent speech representations and learning a quantization of the latent features shared across languages. The researchers randomly masked all time steps with probability 0.065 to be initial indices and the next ten timesteps. The objective requires identifying the real quantized latent for a masked time-step within a set of K = 100 other latent features sampled from other masked time steps. A codebook diversity penalty augments this to encourage the model to use all codebook entries.

How Is XLSR-53 Fine-Tuned?

To fine-tune the model, a classifier representing the respective downstream task’s output vocabulary on top of the model is added and trained on the labeled data with a Connectionist Temporal Classification (CTC) loss. Weights of the feature encoder are not updated at the fine-tuning time. For CommonVoice, they fine-tuned for 20k updates and on BABEL for 50k updates on 2 GPUs for the Base model (XLSR-10) and 4 GPUs for the Large model (XLSR-53).

What Were The Results?

Cross-lingual pre-training significantly outperforms monolingual pretraining thanks to the latent discrete speech representations shared across languages with increased sharing for related languages. As a result, the multilingual pre-trained model implicitly learns to cluster related languages.

The researchers also demonstrated that XLSR representations could be fine-tuned simultaneously on multiple languages to obtain a multilingual speech recognition system whose performance is competitive and fine-tuning a separate model. The model’s learned representations transfer well to unseen languages. Cross-lingual transfer learning improves low-resource language understanding, but the transfer-interference trade-off between high-resource versus low-resource languages benefits low resource languages and hurts high resource languages.

To tinker with the code and read more, visit here.

The Statistics Ministry Adopts AI For Faster Economic Insights

By

Pradyumna Sahoo

-

January 29, 2021

India’s statistics ministry is turning the tide from manual labor to intelligent machines to speed-up economic data gathering and draw insights. India has been alleged for fudging up the numbers that indicate the country’s economic buoyancy to meet political ends. On top of it, revisions to published data and insufficient workforce to crunch the numbers have added salt to the injury.

In timely reporting of statistics, India lags behind major Asian counterparts like China and Japan. India’s quarterly GDP data are reported with a lag of two months, whereas China reports within three weeks. Similarly, the Indian employment data is always a year behind, compared to the US and Europe who reported weekly unemployment rates during the pandemic. Therefore, the ministry wants to automate the collection and analysis of economic data to better monitor financial institutions.

Also Read: MeITY And AWS Sets Up Quantum Computing Application Lab

Statistics Secretary Kshatrapati Shivaji said, “Because of the changing landscape, there’s a growing need for more and more data, faster data and also more refined data products. With end-to-end computerization, this type of automation will enhance the quality, credibility, and timeliness of data.”

A $60 million program with the World Bank has pushed the ministry to build an information portal that collates real-time data. The Indian government has announced the National Policy on Official Statistics to catalyze the Indian Statistical System for major reforms to meet the data systems’ increased demands and produce evidence-based forecasting. Real-time monitoring of the economy and governance using AI will be carried out via the National Integrated Information Portal (NIIP). The institute is being established in the Ministry of Statistics and Programme Implementation to provide a high-end platform for data analytics to help realize real-time data availability and inference objectives. The government has also planned to set up data centers across India for in-house analysis of domestic economic data. It is now up to the implementations that will judge whether the statistics ministry AI adoption will be a successful one or not.

DRDO Is Offering A 12-Week-Long Online Course On Artificial Intelligence

By

Ratan Kumar

-

January 28, 2021

Defence Research & Development Organisation (DRDO) is offering a 12-week-long online course on artificial intelligence as a part of its training and certification programme — Defence Institute of Advanced Technology (DIAT). Unlike other online courses, this program will have 120 contact hours; two hours a day for five days a week. However, not everyone can apply for the course; Learners would be required to clear a test to qualify and enroll in the course.

The test would include modular mathematics, statistics, probability theory, basics of algorithms, data structures, databases, and knowledge of any programming language. Although the entrance test is free, you will have to pay 17,700 + GST @ 18% to enroll in the course.

You will be awarded a certificate from DIAT to showcase your expertise in artificial intelligence and machine learning after completing the course. The program is designed for graduates from any discipline, but students from the final year can apply too.

Focused on both fundamentals and advanced topics, the course framework includes probability, pattern recognition, machine learning, deep learning, computer vision, augmented reality, and natural language processing.

The registration for the test starts today till 15 February 2021, and the entrance test will be held on 20 February. And the actual paid course will start on 28 February 2021.

DIAT has also invited applications for its cybersecurity course that has a similar program structure to AI and ML, but learners who have C/C++/Java/any OOP language and any scripting language like PHP/Python/Ruby/Perl are eligible. This course includes cybersecurity essentials, forensics and incident response, system/driver programming and OS internals, reverse engineering and malware analysis, and more.

Devised by experts from DIAT, DRDO, and Ministry of Defence makes it a must for beginners who want to learn from one of the bests in the industry.

IBM And Daimler Simulates Materials With Fewer Qubits

By

Pradyumna Sahoo

-

January 28, 2021

IBM, Daimler AG, and Virginia Tech researchers simulate materials with fewer qubits. Their work was published in the October issue of the Royal Society of Chemistry’s journal, “Physical Chemistry, Chemical Physics” and featured as a “hot topic” in 2020.

“Nature isn’t classical, dammit, and if you want to make a simulation of nature, you’d better make it quantum mechanical., and by golly, it’s a wonderful problem because it doesn’t look so easy.” – Richard P. Feynmann.

The above quote from the great physicist reminds us of the limitations of classical computers to simulate nature. It is indeed the most suited problem for quantum computers to simulate intrinsically quantum mechanical systems like molecules more efficiently than classical computers.

Also Read: MeITY And AWS Sets Up Quantum Computing Application Lab

Quantum computers can predict the properties of molecules with precision on par with actual lab experiments. It involves accurate modeling of molecules of a compound and the particles that make up these molecules to simulate how they react in many different environments. There are infinite molecular combinations to test before the right one is found, requiring large numbers of qubits and quantum operations.

In practice, there are two approaches for simulating with fewer quantum resources. One approach is to perform classically intractable calculations on the quantum computer followed by classical post-processing to correct for basis set errors associated with using fewer qubits. The second is to reduce the quantum resources required for more accurate calculations — the number of qubits and quantum gates. The researchers adopted the latter approach.

The information density of negatively charged electrons to repel each other does not usually fit existing quantum computers because it requires too much extra computation. Therefore, the researchers incorporated electrons’ behavior directly into a transcorrelated Hamiltonian. The result was an increased simulation’s accuracy without the need for more qubits.

Daimler, an IBM’s research partner, has heavily invested in designing better batteries for electric vehicles that will occupy tomorrow’s roads. The company wants to build the capacity to search for new materials that can lead to higher-performing, longer-lasting, and less expensive batteries. Therefore, Daimler intends to simulate more and more orbitals for reproducing the results of an actual experiment as better modeling and simulations will ultimately result in the prediction of new materials with specific properties of interest.

Read more about how IBM simulated materials here.

Brain Storage Scheme Can Solve Artificial Networks’ Memory Woes

By

Pradyumna Sahoo

-

January 27, 2021

Neuroscientists recently showed that the brain’s storage scheme is more capable of storing information than the neural networks. The paper from neuroscientists of SISSA, in collaboration with Kavli Institute for Systems Neuroscience & Norway’s Centre for Neural Computation, featured in the prestigious Physical Reviews Letters.

The basic unit of neural networks are neurons that learn patterns by fine-tuning the connections among them. The stronger the connections, the lesser is the chance to overlook any pattern. Neural networks use the backpropagation algorithms to tune and optimize the connections during the training phase iteratively. In the end, the neurons recognize patterns by the mapping function they have approximated, i.e., network memory.

This procedure works well in a static setting where no new data is being ingested. In a continual environment, where the models learn new patterns across diverse tasks over extended periods like humans, the neural networks suffer from catastrophic forgetting. So, there must be something else that makes the brain much more powerful and efficient.

Also Read: VOneNets – Computer Vision meets Primate Vision

The answer is in the brain’s more straightforward approach: the link between neurons decides how the pattern changes. Scientists thought that the simpler process would permit fewer memories based on the fundamental assumption that neurons are binary units. But the researchers showed that the fewer memories are the result of using such an unrealistic assumption. They combined the brain’s storage scheme to change the connections with biologically plausible models for single neurons response and found that the hybrid performs at par and beyond AI algorithms.

The researchers pinpointed the role of introducing errors in the performance boost. Usually, when the brain retrieves a memory correctly, it will be identical to the original input-to-be-memorized memory or correlated to it. But the brain storage scheme retrieves memories that are not identical to the initial input.

The neurons that are barely active in memory retrieval and do not distinguish among the different memories stored within the same network are silenced. These freed neural resources are focused on those neurons that matter in an input to be memorized and lead to a higher memory capacity. It is believed that the recent findings shall seep into the field of continual learning and multitask learning to produce more robust neural models that can handle catastrophic forgetting.

Microsoft’s FELICIA – A New Mechanism To Deal With Private Medical Data

By

Pradyumna Sahoo

-

January 25, 2021

FEderated LearnIng with a CentralIzed Adversary (FELICIA) — a federated generative mechanism enabling collaborative learning — has been recently proposed by researchers from Microsoft and the University of British Columbia to train models from private medical data.

What is the problem?

There has been an outcry among AI researchers to gain easier medical data access from varied sources to better train medical diagnosis models like disease detection and biomedical segmentation. Biased by the demographics, medical equipment types, and acquisition process, images from a single source would skew any models’ performance towards the source population. The model would then perform poorly for other populations.

Therefore, medical data owners, such as hospitals and research centers, share their medical images to access differently sourced data and cut their data curation costs. They mostly use the additional data to counter the bias arising from their limited data while keeping source data private from others. But the legal constraints complicate the access to external large medical datasets. Current legislation prevents the sharing and processing of datasets outside the source from avoiding privacy breaches. Thanks to lower data diversity involved in diagnostics, the very laws that safeguard patients’ privacy endanger their lives because of less powerful AI models.

Also Read: OpenMined, In Collaboration With PyTorch, Introduces A Free Course Of “The Privacy AI Series”

What is the solution? Why are GANs involved?

Therefore, the researchers generate synthetic medical data to set the data imbalance right, using Generative Adversarial Networks (GANs) architectures to train models. GANs have two neural networks — adversaries — competing against each other. While one of the networks is a generator that produces fake data as real as possible, the other is a discriminator that discriminates between the fake and real data from the mixed input. The generated data is mixed with the real ones. In a zero-sum-game, both try harder and harder to beat each other. And the result is a generator network that can generate fake data closer to real ones.

What does the best solution look like?

The best solution is to build upon PrivGAN architecture that works locally on a single dataset and generates synthetic images. But another group of researchers showed that PrivGAN could be used in a federated learning setting. PrivGAN was designed to protect against membership inference attacks — noticeable patterns in outputs that leak training data. This robustness against training data leakage makes PrivGAN the candidate for Microsoft’s FELICIA that honors medical data privacy constraints.

What is the best way to implement the solution?

Microsoft’s FELICIA simply extends any GAN to a federated learning setting using a centralized adversary, a central discriminator with limited access to shared data. The sharing of data with the central discriminator depends on many factors such as use cases, regulation, business value protection, and infrastructure. Researchers used multiple copies of the same discriminator and generator architectures of a ‘base GAN’ inside FELICIA to test the mechanism. The central privacy discriminator (D_P) is kept identical to the other discriminators except for the final layer’s activation. First, the base GANs are trained individually on the whole training data to generate realistic images. Then FELICIA’s parameters, jointly optimized using the base GANs parameters, are tuned to get real-like synthetic samples.

FELICIA’s federated loss function equally optimizes the local utility on local data and global utility on all users’ data. It means that successive synthetic images will have to be far better than the previous ones, both at the local and global levels. The hyperparameter λ, which balances the participation in global loss optimization, improves the utility contrary to the original PrivGAN loss.

Did Microsoft’s FELICIA work?

Yes, FELICIA’s images are clearer and more diverse than other GANs. It generates synthetic images with more utility than what is possible with only access to local images. The improved utility suggests that the samples cover more of the input space than those of the local GANs.

During multiple experiments, it was seen that combining FELICIA with real data achieves performance on par with real data while most results significantly improve the utility even in the worst cases. The improvement is particularly significant when the data is most biased. The more biased the dataset is, the more its synthetic data will benefit in utility. Excluding the first 10000 epochs, a FELICIA augmented dataset is almost always better than what is achieved by real images.

These results show that Microsoft’s FELICIA allows the owner of a rich medical image set to help an owner of a small and biased set of images to improve its utility while never sharing any image. Different data owners (e.g., hospitals) now could help each other by creating joint or disjoint synthetic datasets that contain more utility than any of the single datasets alone. Such a synthetic dataset could be instrumental to freely share images within the local hospital and keep the real images secured and available to a limited number of individuals. This arrangement produces powerful models trained with shared data among research groups while maintaining confidentiality measures. A data owner can generate high-quality synthetic images with high utility while providing no access to its data.

Google Cloud Introduces Skills Challenge, Offering Free Training On AI, Data Analytics & More

By

Kenny Manuel

-

January 24, 2021

Google Cloud introduces Skills Challenge that will offer free training to learners on Google Labs. The initial four tracks in the skills challenge include getting started, machine learning and artificial intelligence, data analytics, and hybrid and multi-cloud. Learners can also gain skill badges to demonstrate their expertise in the latest technologies on social and professional media platforms.

With Google Cloud Skills Challenge, learners can master skills to create machine learning models, deploy virtual machines, run applications on Kubernetes, manage cloud resources, use machine learning APIs, setup and configure cloud environments, and more.

One of the biggest trends in 2021 is to obtain skills to use the latest technologies like AI, ML, and software development on the cloud. The pandemic has made cloud computing skills a necessity for aspirants as well as practitioners.

Companies are highly relying on cloud computing since the pandemic caused by covid-19. As more professionals are working from home, the cloud has become the go-to platform for all data, AI, and software development tasks.

Google Cloud Skills Challenge brings an opportunity for learners to gain the most in-demand skills for free. The last date to register for the Google Cloud Skills Challenge is by January 31, 2021. Since each of the challenges can be completed in one month, Google Cloud is offering this free training for 30 days.

Over the last few months, Google Cloud has been providing several opportunities for learners to master new skills. Late last year, Google Cloud offered free training through its Qwiklabs. Being one of the top three cloud providers, Google Cloud skills can help in career advancements to technology enthusiasts.

Register for the Skills Challenge here.

BERT-Based Language Models Are Deep N-Gram Models

By

Pradyumna Sahoo

-

January 23, 2021

The researchers from Adobe and Auburn University pointed out that current BERT-based language models are simply deep n-gram models because they blatantly reject taking word order into account.

In natural language, word order remains highly constrained by many linguistic factors, including syntactic structures, subcategorization, and discourse. Arranging words in the correct order is considered a critical problem in language modeling. In earlier times, statistical models like n-grams were used for primitive Natural Language Understanding (NLU) tasks, like sentiment analysis, sentence completion, and more. But those models have many problems like being ineffective at preserving long-term dependencies, loss of context, and sparsity. They can not produce convincible long sentences in the correct word order. Thanks to attention modules, feats achieved by many language models like Microsoft’s DeBerta, GPT-3, and Google’s Switch-Transformers made us believe that the word-order problem is solved for good.

Sadly, the researchers found that the language models heavily rely on words, not their order, to make decisions. And the root cause was the existence of self-attention matrices that explicitly map word-correspondence between two input sentences regardless of those words’ order. To demonstrate, the researchers used a pre-trained language model, RoBERTa, that achieves a 91.12% accuracy on the Quora Question-Pairs dataset by correctly labeling a pair of Quora questions “duplicate.” And using that model, they show the following effect on shuffled words.

Words not shuffled:

All words in question Q2 shuffled at random. Interestingly, the model’s predictions remain almost unchanged.

The models incorrectly label a real sentence and the shuffled version “duplicate.”

BERT-based language models, which use transformer architectures, that learn representations via a bidirectional encoder are good at exploiting superficial cues like the sentiment of keywords and the word-wise similarity between sequence-pair inputs. They use these hints to make correct decisions when tokens are in random orders. The researchers tested BERT-, RoBERTa-, and ALBERT-based models on 6 GLUE binary classification tasks. The tasks were to classify whether the words in an input sentence were intact. Any human or model that has surpassed humans is expected to choose the “reject” option when asked to classify a sentence whose words are randomly shuffled.

Their experiment showed that 75% to 90% of the accurate predictions of BERT-derived classifiers, trained on many GLUE tasks — five out of six — remain unchanged even after shuffling the input words. It means that 65% of the five GLUE tasks’ ground-truth labels can be predicted when the words in one sentence in each example are shuffled. The behavior persists even in BERT embeddings that are famously contextual.

They also showed that in the sentiment analysis task (SST-2), a single salient word’s ability to predict an entire sentence’s label remains more significant than 60%. Consequently, one can be safely assumed that the models rely heavily on a few keywords to classify a complete sentence.

It was found that models trained on sequence-pair GLUE tasks used a set of self-attention heads to find similar tokens shared between the two inputs. For instance, in ≥ 50% of the correct predictions, QNLI models rely on a set of 15 specific self-attention heads for finding similar words shared between questions and answers regardless of word order.

Modifying the training regime of RoBERTa-based models to be more sensitive to word order improves SQuAD 2.0, most GLUE tasks (except SST-2), and out-of-samples. These findings suggest that existing, high-performing NLU models have a naive understanding of the text and readily misbehave under out-of-distribution inputs. They behave similarly to n-gram models since each word’s contribution to downstream tasks remains intact even after the word’s context is lost. In the end, as far as benchmarking of the language models are concerned, many GLUE tasks now cease to be challenging enough for machines to understand a sentence’s meaning.