Monday, November 10, 2025
ad
Home Blog Page 343

AV-MAP: From Short Video To The Entire Floor Plan Using ML

AV-MAP predicts the entire floor plan from short videos.

Researchers from Facebook AI and other universities have developed a framework, AV-MAP, that can infer rooms’ general layout from a house’s short-clip. This framework predicts the house’s whole structure with 66% accuracy from the short clip covering merely 25% of its floor plan.

AV-MAP stands out from current methods. Current methods mostly need movements to map the floor using videos from cameras and 3D sensors. These methods do not use audio of the videos that provide complementary information about distant free space and rooms outside of the camera’s reach, like an echo in the hall, a dishwasher’s humming, showers in the bathroom, and more. Hence, the current methods can not predict beyond the visual field captured in the video. 

The team at Facebook AI, Carnegie Mellon University, and the University of Texas came up with AV-MAP that does not need movement to capture the house’s layout. The basic intuition was to use sound with the video input. Sound inherently driven by geometry, i.e., reflection reveals the distance between rooms. Identifying meaningful sounds because of activities or objects coming from different directions reveals the plausible room layouts. For instance, the sounds from the left and utensil sounds from the right indicate that the drawing-room on the left and the kitchen on the right.

Also read : Computer Vision Has A New DeIT By Facebook

AV-MAP uses a novel multimodal encoder-decoder framework that jointly learns about audio-visual features to reconstruct a floor plan from a given short video clip. The framework consists of three components: Top-Down Feature Extraction, Feature Alignment, and a Sequence Encoder-Decoder architecture. The feature extractor, a modified ResNet, obtains top-down floor plan-aligned features for each modality (ambisonic audio and RGB) independently at each time-step. 

AV-MAP’s internal structure.

These extracted features are mapped to a standard coordinate frame using the relative motion of the camera. At the encoder, the entire feature sequence undergoes pixelwise self-attention operations and convolutions. Lastly, the two modalities are fused at the decoder via a series of self-attention and convolution layers. The AV-MAP model then predicts the interior structure of the environment and the associated rooms’ semantic labels, like bathroom, kitchen, and more.

The team created two experimentation settings (active and passive) to test the framework using Matternet3D and SoundSpaces datasets. These are datasets of 3D modeled houses hosted in Facebook’s AI habitat. In the active setting, a virtual camera emits a known sound while moving throughout the room of a model home. Similarly, in the passive setting, the model uses sounds made by objects and people inside the house. Overall, the researchers found that AV-MAP offers an 8% improvement in floor plan accuracy over the state-of-the-art approach.

Read about the framework here.

Advertisement

Amazon Web Services (AWS) Is Hosting A Free AI Conclave

AWS AI Conclave

AWS will host a two-day free AI online conclave on 28 and 29 January. The conclave is power-packed with 20+ breakout sessions by Amazon and industry experts. The participants will receive training on building, training, and deploying sophisticated models with any Amazon Web Service framework solutions on the cloud and on edge.

The AWS AI conclave will feature prominent personalities like Swami Sivasubramanian, Vice President, AWS, Stefano Soatto, Director of Applied Science, AWS and, Rajeev Rastogi, Vice President, Machine Learning, Amazon India. The speakers will cover topics from strategies and frameworks to real-world applications like Anomaly detection, Real-time personalization and recommendation, Fraud Detection, Automation in Pharmacovigilance (Drug Safety), and more.

The event has two editions — business and technical — based on participants’ different AI and ML solution adoption journeys. While the business edition is earmarked for business and technology leaders to educate them about operational insights on the AWS ecosystem, the technical edition is provisioned for beginners to experienced machine learning and data science practitioners. You can register for the technical edition here.

Also Read: Optical Chips Paves The Way For Faster Machine Learning

The participants will develop the right skills to create new insights and make more informed predictions that translate to operational efficiencies in terms of productivity. They will also learn about proven best practices to apply AI for organizations from thought leaders across fields like Amitabh Kant, CEO at NITI Aayog, Puneet Chandok, President-Commercial Business at AWS India and South Asia, and Eitan Medina, Chief Business Officer at Habana Labs. 

At present, AWS holds the dominant position, a 33% market share in Cloud-based services and infrastructure. Hence, companies that use AWS stack, always look out for candidates who understand how to build smart, customer-centric, scalable solutions in the cloud and on edge using Amazon’s broadest and deepest set of machine learning and AI services. This conclave provides the platform to participants for building their network with industry peers and fostering new collaborations.

Advertisement

Microsoft Is Hosting Free Virtual Workshop On Reinforcement Learning Day, Providing Job Opportunity At Its Research Labs

Reinforcement Learning Day 2021

Microsoft Research will be observing Reinforcement learning day on 14th January 2021. On this day, Microsoft will host a free virtual workshop that features prominent scientists like Yoshua Bengio (one of the Godfathers of deep learning), John Langford and many others to bring together the research communities to learn from each other and build on the latest knowledge.

Reinforcement learning studies how natural and artificial systems learn to make decisions in complex environments based on external, and possibly delayed feedback. The topic amalgamates ideas from computer science, cognitive science, mathematics, economics, control theory, and neuroscience.

This virtual workshop will feature multidisciplinary talks that span across theory to practice. The workshop will also provide a common platform for researchers from industry and academia alike. The aim is to highlight emerging research opportunities for the reinforcement learning community, particularly those driven by the evolving need for robust decision making in practical applications. The speakers at the workshop will speak about applications of Reinforcement Learning: Recommender Systems, Robotics, Healthcare, Education, Conversational AI, Gaming, Finance, Neuroscience, and Manufacturing.

Also Read: Microsoft’s DeBERTa Surpasses Humans On Natural Language Understanding

The agendas for the workshop have been chosen to keep in mind the latest developments in the field like Hierarchical Reinforcement LearningActive Imitation Learning with Noisy GuidanceMETA-Q-LearningFundamental Limits of Imitation Learning and more.

Microsoft Research, earlier, had called for papers to be featured in a virtual poster session to showcase recent and ongoing research in all areas of reinforcement learning like Deep RL, RL Theory, Bandit Algorithms, Multi-Agent RL. This poster presentation will run in parallel with the main workshop event. Since top minds in the RL field will populate the workshop, the organisers have created a job board that has research-based job openings in Microsoft’s own labs around the world.

Register for the Microsoft’s Reinforcement Learning Day 2021 free virtual workshop here.

Advertisement

Optical Chips Paves The Way For Faster Machine Learning

Optical Chips

Silicon transistors, the basic unit of the silicon processors, can not be shrunk further without avoiding quantum-mechanical effects. Consequently, current silicon-based processors have hit their performance limits. And this has led the quest for new architectures that can replace silicon chips with optical chips.

The researchers’ team, led by Prof. Wolfram Pernice from the Institute of Physics and the Center for Soft Nanoscience at the University of Münster, developed an optical chip. These chips process data with a speed of 50 to 100 GHz and in parallel than the new-age graphics cards or specialized hardware like Google’s TPU, which usually work in the low GHz range. 

These photonic chips have achieved the breakneck speed, thanks to a combination of vital structural components:

  • Frequency combs – Provides various optical wavelengths that are processed independently of one another in the same photonic chip.
  • Phase-change materials (PCMs) – Energy-efficient storage elements used in optical data storage like DVDs. These stores and preserve the matrix elements without the need for an energy supply in the new processor.

The chip-based frequency combs are combined with phase-change materials (PCMs) to carry out matrix multiplications on multiple data sets parallel by wavelength multiplexing, calculating all wavelengths simultaneously without extra energy supply. This combination permits data rates and computing densities, i.e., operations per area of processor, never previously attained in an energy-efficient manner.

To test the optical chip, the researchers tried a convolutional neural network to recognise handwritten numbers where the convolutional operation between input data and one or more filters can be transferred very well to the matrix architecture. As a result, the whole training and inference cycle consisting of many matrix multiplications completes in just one timestep.

Optical chips promise various applications that require faster computation over large data volume in an energy-constrained environment. For example, huge amounts of data can be processed simultaneously while saving energy at a much faster rate than previously possible.

Deeper neural networks that allow more accurate forecasts and more precise data analysis are now possible because of the exponential speed-up in matrix operations. These optical chips can also power the evaluation of large quantities of medical data, such as high-resolution 3D imaging data — that can provide faster diagnosis. 

Even self-driving vehicles, which depend on a fast and rapid evaluation of sensor data, can use these optical chips for speedier inference. IT infrastructures such as cloud computing that provide storage space, computing power, or applications software can enhance their throughput and increase their profitability.

Advertisement

Microsoft’s DeBERTa Surpasses Humans On Natural Language Understanding

Microsoft DeBERTa

Microsoft’s DeBERTa (Decoding-enhanced BERT with disentangled attention) has surpassed the natural language understanding baseline set at par with human performance in the SuperGlue benchmark. While the human baseline stands at 89.8 macro-average scores, the DeBERTa ensemble scored 90.3. The model used only half of the training data as compared to RoBERTa.

Microsoft’s DeBERTa uses the neural architectures of Transformer (48 transformers layers) that has 1.5 billion parameters. Like other transformer-based language models, DeBERTa is pre-tuned on considerably large datasets to learn universal language representations. Then, the model is fine-tuned to various downstream NLU tasks like Choice Of Plausible Alternatives (COPA), Multi-Sentence Reading Comprehension (MultiRC), Recognizing Textual Entailment (RTE), Word-in-Context (WiC), Winograd Schema Challenge (WSC), and reading comprehension with commonsense reasoning.

The Microsoft researchers used three techniques to build upon RoBERTa: –

  1. Disentangled attention mechanism (contextual and positional information is preserved for each word by two vectors separately)
  2. Enhanced mask decoder (the training comprises of predicting the correct word for the missing place)
  3. Virtual adversarial training method (for fine-tuning to learn more robust word representation)

Separate encoding for context and position correctly assigns attention weights between words that account for any dependencies. The enhanced masked decoder forces the model to predict words for a mask by accounting for previous words. The idea is to make the model aware of these words’ absolute positions, which is critical to handle syntactic nuances.

Also Read: IIsc Invites Application For Its New Deep Learning Specialisation

For models as big as DeBERTa, improving models’ generalization towards adversarial examples is a challenge. Therefore, the researchers at  Microsoft developed the Scale-Invariant-Fine-Tuning (SiFT) method. The adversarial perturbations are applied to the normalized word embeddings, and the model is regularized to produce the same output on a task before adding perturbations of that example.

Microsoft’s DeBERTa will be a part of the Microsoft Turing natural language representation model (Turing NLRv4). These models will serve Microsoft products like Bing, Office, Dynamics, Azure Cognitive Services, chatbots, recommendation, question answering, search, personal assist, customer support automation, and content generation.

Advertisement

Uber AI Says You Can Increase Task Completion If You Are Polite With Virtual Agents

Uber AI

The Uber AI researchers published an overview of a deep learning framework addressing customer engagement with ‘polite and positive’ assistants. The task-oriented conversational agents like Alexa, Siri, Google’s assistant fulfill tasks like booking cabs by conversing with users and retrieving information (current location, destination, and type of cab) from them.

As users, we love to engage with such virtual entities when we see them generate proper interpersonal responses and build an emotional connection with us. Consequently, Uber’s AI team came up with ways to make the assistants use appropriate social language (language interpreted differently in different social contexts). They examined the relationship of customer service representatives’ use of social language to drivers’ responsiveness and the completion of their first trip, based on an analysis of driver and human-agent conversations. 

Interestingly, scientists defined ‘Politeness’ as a strategy to avoid feeling awkward or embarrassed when the social distance between two parties is enormous. They trained an SVM classifier on a corpus containing domain-independent lexical and syntactic features of politeness with a politeness label.

Similarly, positivity, defined as “the quality or state of being positive,” was evaluated by VADER, a rule-based sentiment analyzer. They found that these social language norms, like politeness and positivity used by human agents, are associated with greater users’ responsiveness and task completion.

This paper proposes, for the first time, an end-to-end deep learning framework for task-oriented dialogue generation that jointly understands input utterances and generates output utterances infused with social language aimed at particular task completion. The model addresses varying meanings of positivity and politeness according to context over time by taking into account the conversation context, completed tasks, and generates language with the desired content and social language norms.

A seq2seq model, built upon an architecture inspired by Huber et al., was modified by introducing a layer with a social language understanding component. The politeness and positivity features are extracted from responses using the pre-trained classifiers — SVM and VADER.

In terms of content preservation and social language level, the model was evaluated using both human judgment and automatic linguistic measures. And it was found that the model can generate responses that enable agents to address users’ issues in a more socially appropriate way.

Advertisement

IIsc Invites For Its New Deep Learning Specialisation

Deep Learning Specialisation IISc

Indian Institute of Science (IISc), currently the best university in India, has announced a masters’ level deep learning specialization in tie-up with TalentSprint. Faculty members from IISc and TalentSprint will jointly teach and mentor participants.

The Centre for Continuing Education wing of IISc offers this 10-month executive program, structures for machine learning enthusiasts and industry practitioners alike. The aim is to train the workforce in deep learning and develop applications in domains like text, video, speech, image, and more.

The program consists of live online faculty-led interactive sessions, curated capstone projects and hackathons, mentorship, case studies, and campus visit. The syllabus currently consists of a Bridge Course (Programming and Mathematical Preliminaries) of 12 Hrs, Mathematical Foundations and Data Visualization of 44 Hrs, Paradigms of Machine Learning of 16 Hrs, Deep Learning and its Applications for 80 Hrs, and Deploying AI Systems for 8 Hrs. 

Deep learning is increasingly being used to extract valuable insights from enormous amounts of data, build innovative products and improve customer experience, thereby enhancing revenue opportunities. This has led to a massive growth in need for professionals with expertise in Deep Learning. This program will fulfill that need. Our team of research faculty will teach and mentor participants and help them build expertise in both the fundamentals and applications of Deep Learning,” said Prof Chiranjib Bhattacharya, chair of the department of computer science and automation, and dean of the advanced deep learning program at IISc.

The program also offers hands-on projects like brain tumor detection, fraud detection, expression identification, and more. At the end of the program, the participants will possess a portfolio that demonstrates their mastery and connect with experts at the forefront of deep learning practices. For entrepreneurs, the program has provisions to boost startup ideas with professional mentorship. 

The enrolments have begun for the first batch (consisting of 50 participants or more), and the classes will start in March 2021. The program expects some form of coding efficacy and sets bachelors’ as the minimum eligibility requirements with a work experience of 1 year. 

Advertisement

Facebook Releases Code Of Its State-Of-The-Art Voice Separation Model

Facebook Voice Seperation Model

Facebook researchers open-sourced code of their work, “Voice Separation with an Unknown Number of Multiple Speakers.” Suppose, there is only one mic and there are multiple speakers, talking simultaneously. Can you separate the voices? For a human, it is easy. But, in the case of a machine, how do you do that? 

The single-source multiple-speaker voice-separation paper answers the question. It extends the state-of-the-art voice separation task to five persons, which were previously limited to two persons. Independent Component Analysis mostly addressed this task in the past. However, with the recent advent of deep learning, it is now possible to separate mixed audio containing multiple unseen speakers.

The main contributions as listed by authors are: 

  1. A novel audio separation model that employs a specific RNN architecture, 
  2. a set of losses for effective training of voice separation networks, 
  3. performing effective model selection in the context of voice separation with an unknown number of speakers,
  4. state of the art results that show a sizable improvement over the current state-of-the-art in an active and competitive domain.

Also Read: Computer Vision Has A New DeIT By Facebook

Previous methods were actually trained using masks for each voice. But this paper introduced a novel mask-free approach. In voice separation, two subtasks exist innately. First, improve the signal quality while screening out the noises and second, identify the speaker to maintain the continuity in the voice sequence. 

The author used utterance level permutation invariant training (uPIT) loss for the first subtask and mean squared error between the L2 distance between the network embeddings of the predicted audio channel and the corresponding source.

To avoid biases arising from the distribution of data and to promote solutions in which the separation models are not detached from the selection process, model selection was based on an activity detection algorithm.

Starting from the model that was trained on the dataset with the largest number of speakers — C, speech detectors are applied to each output channel. If it detects silence (no-activity) in any of the channels, it moves to the model with C − 1 output channels and repeats the process until all output channels contain speech.

Advertisement

Intel’s RealSense ID Now Offers Facial Recognition With Higher Accuracy

Intel RealSense

Intel is now offering RealSense technology to customers for facial recognition under the purview of RealSense ID. Complemented by LIDAR and Infrared sensors, the RealSense 3D cameras are currently the new game-changer in the industry, thanks to Amazon’s Reckognition fiasco. Customers will have access to faster facial recognition without technical glitches around racial components, light conditions, facial changes, or contact height.

Intel claims that RealSense ID has an unprecedented true-acceptance rate (recognizes you as you) of 99.7% with a chance of error in the 1000000 times. The spoofing rate (recognizes a recorded photo of you as you) stands at less than 1%. The timing reported per facial recognition is 1.5s (sensing a presence) + 0.8s (for facial authentication). You do not have to stand in a line for verification.

For privacy concerns, the captured images are stored on the device, and data is encrypted at all levels using the AES-256 scheme. The device uses an algorithm to give the photos an ID, and all further communication uses that designated ID without revealing any visual information. This is done by a neural network that sits at the base of its facial recognition.

Also Read: Intel India Launches An AI Research Center, INAI

Intel is currently offering two builds — Intel RealSense ID Solution F455 and F450. While the former is a ready-to-deploy model, the latter provides a custom solution for specialized use-cases. The company thinks the technology will be used at security checkpoints, ATMs, smart locks, kiosks, POS for verification purposes. The systematic details assure safety, but if we look at the scale of use-cases like airports, ATMs that run into millions, the accuracy level still becomes a challenge. A chance of false acceptance in 1,000,000 can cause security concerns.

The codebase behind the technology has been open-sourced long ago, ensuring no corporate or government backdoors entries. However, neural systems are prone to adversarial inputs. Hence, the adversarial security of these neural networks creates additional room for blunders.

The integration of Intel’s RealSense with Windows Hello remains an issue, so we will not be able to use the models with our laptops or desktops for authentication purposes. However, Intel is now trying hard to salvage its RealSense technology that was lying defunct until now.

Advertisement

OpenAI’s DALL·E Can Create Images From Text

OpenAI Dalle

OpenAI created a multimodal generative neural model — DALL·E — that can create images from text prompts given as input.  This neural network uses the 12-billion GPT-3 parameter version trained to generate images from text descriptions using a text-image pair dataset.

They put their experience of building GPT-3 and ImageGPT to show that manipulating visual concepts through language is now within reach. The researchers demonstrated that language can instruct a large neural network to perform various text generation tasks and generate high fidelity images.

The samples released for each caption are the top 32 of 512 images after reranking with CLIP. This procedure can be seen as a kind of language-guided search and dramatically impacts sample quality.

The demos that are released by the researchers showcase images of imaginary objects with both modified and preserved attributes. The model understands the three-dimensional visualization of items with their internal and external structure and can infer contextual details independently. However, the model goes beyond and also shows zero-shot visual reasoning, geographic knowledge, and temporal knowledge.

The network architecture uses transformers with simple decoder-only components. In essence, it is a language model that receives both the text and the image as a single stream of data containing up to 1280 tokens. It uses maximum likelihood training to generate all of the tokens. 

They use an attention mask at each of its 64 self-attention layers allowing each image token to attend to all text tokens. OpenAI’s DALL·E uses the standard causal mask for the text tokens and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer.  The researchers will be publishing a paper soon detailing other elements of DALL·E.

There are always some trade-offs involved. Here are some granular points that were pointed out by the researchers: –

  1. The phrase determines the success rate
  2. OpenAI DALL·E confuses the associations between the objects and their colors when more items are present
  3. DALL·E can draw multiple copies but is unable to count past three reliably

You try creating images from text here.

Advertisement