Using large-scale training data in computer vision and natural language processing (NLP) models has strengthened and developed new findings. The recently deployed models of CLIP, DALL-E, GPT-3, and Flamingo have used massive task agnostic data to pre-train the neural architecture, which results in a remarkable performance at downstream tasks, including zero and few-short settings. Lately, embodied AI simulators are gaining attention and strengthened by physics, manipulators, object states, deformable objects, fluids, and real-sim counterparts. However, scaling them up to ten of thousand scenes is challenging. Given this, ProcTHOR by Allen Institute researchers was developed to create a procedural generation of embodied AI environments. The name goes for procedural-THOR, which stands for the house of interactions.
What is Embodied AI?
Embodied AI is AI controlling a physical thing, such as robots or autonomous vehicles. It is an interdisciplinary field combining natural language processing, reinforcement learning, computer vision, physic-based simulations, navigation, and robotics. This new age technology is an approach of computer learning to apply a relationship of mind and body identical to the human embodiment, how our mind and body react to complex movements and situations. Embodied AI starts with embodied agents, virtual robots, and egocentric assistants training in a realistic 3D simulation environment. The working of embodied AI is based on reinforcement learning, a type of machine learning that makes the machine perform suitable actions to maximize reward according to the situation. Researchers in embodied AI development are trying to avoid algorithm-led approaches and direct towards attempting to understand how biological system work, then build principles of intelligent behavior so that these can be applied to artificial systems.
The embodiment hypothesis dates back to 2005 when Linda Smith proposed that the idea of intelligence emerges in the interaction in an environment and is a result of the sensorimotor activity. Even though the initial hypothesis was centered on psychology and cognitive science, the recent growth and research developments of embodied intelligence come from computer vision. While the applications of embodied AI seem to have great potential, till now, this has only benefited a couple of manufacturers and startups. Some researchers believe embodied AI can be combined with existing internet of things (IoT) devices that can take life-saving decisions on the spot within milliseconds.
What is ProcTHOR?
ProcTHOR is a machine learning framework based on AI2-THOR used for the procedural generation of embodied AI environments. AI2-THOR is an open-source interactive environment containing four types of scenes for embodied AI. ProcTHOR framework can construct whole interactive procedurally physic-enabled settings for embodied AI research. The PRIOR team developed it at the Allen Institute for AI under a research paper, ‘ProcTHOR: Large-scale Embodied AI using procedural generation.’ ProcTHOR aims to train robots within a virtual environment and then apply the learning in real life.
ProcTHOR allows random sampling of large datasets of varying, interactive, customizable, and high-performing virtual environments to train and evaluate embodied agents. For example, given a room specification, say a 3bhk house, ProcTHOR helps you build varieties of floor plans that meet your requirement. The environments in ProcTHOR are completely interactive and support navigation, object manipulation, and multi-agent interaction.
This framework is a state-of-art application of machine learning that extends AI2-THOR inheriting its huge asset library, robotic agents, and precise physics stimulation. The pre-training with ProcTHOR improves the downstream performance and gives a zero-shot performance, or per se is a zero-shot learning model. Zero-shot learning is a significant technique in machine learning, which refers to the models classifying objects or data based on very few to almost no labeled data points. The ProcTHOR by Allen Institute researchers has five key characteristics:
- Diversity: One can create several varieties of rich environments with ProcTHOR. The framework provides many options for every embodied AI task, like the diversity of floor plans, assets, materials, object placements, and lighting.
- Interactivity: The property of interacting with objects in the environment is fundamental to embodied AI tasks. ProcTHOR has agents with arms for manipulation of objects.
- Customizability: ProcTHOR gives users the complete power of customization from rooms to material and lighting specifications.
- Scale: ProcTHOR provides 16 different scene specifications and 18 semantic asset groups. These result in an indefinite number of assets and scenes for seeding the generation process. So, each environment/ house created on ProcTHOR is scaled to find the best result per requirements.
- Efficiency: ProcTHOR represents scenes in a JSON file and loads them into AI2-THOR at runtime to make the memory overhead of sorting houses astoundingly efficient. Furthermore, ProcTHOR gives high framerates to train embodied AI models where the scene generation process is automatic and fast.
ProcTHOR-10k is the model of ProcTHOR by Allen Insititute researchers using a sample set of 10,000 fully interactive houses obtained by the procedural generation process. In addition, it contains a set of 1,000 validation and 1,000 testing houses for evaluation. The assets are split across train, validation, and test, counting to 1633 unique assets and 108 asset types.
There are two essential requirements for large-scale training in embodied AI simulator:
Scene statistics: The scene statistics of houses in ProcTHOR-10k are generated by applying 16 different room specifications. The room specification provides to change the distribution of size and complexity of houses. It is seen that ProcTHOR has a broader spectrum of scenes than other embodied AI simulators, including AI2-iTHOR, RoboTHOR, Gibson, and HM3D.
Rendering speed: High rendering speed directly proportions to large-scale training because training algorithms require to converge millions of iterations. The GPU experiments were performed and recorded, which tells in the number of experiments how many processes were distributed among the GPUs. It was found for 1 GPU experiment, 15 processes, and for 8 GPU experiments, 120 processes were distributed. In the end, a comparison between ProcTHOR, iTHOR, and RoboTHOR was done, which concluded that ProcTHOR provides more framerates and renders it fast enough to train large models in a fair amount of time.
Training and scalability in ProcTHOR
The former methods of embodied AI environments demand a lot of work from 3D designers who must create 3D elements, organize them in suitable configurations inside sizable spaces, and create the proper textures and lighting in these scenes. In the latter, specialized cameras are moved through various real-world scenarios, and the resulting photos are then pieced together to produce 3D reconstructions of the scenes. Using these strategies, it was impossible to scale up the scene repositories by several orders of magnitude. Then, ProcTHOR came that can handle a higher magnitude of the number of scenes than current modern simulators because of the arbitrary massive collection of settings. Additionally, it supports dynamic material randomizations, which enable the randomization of particular asset colors and materials each time an environment is stored in memory for training.
The training inside ProcTHOR is a complex process over several levels, including room specification, connecting rooms, lighting, object placement, and many more. The paper mentioned above demonstrated the potential of ProcTHOR with the ProcTHOR-10k model, which has a sample of 10,000 generated houses and a simple neural network. The advantages of scaling up from 10 to 100 to 1K, then to 10K scenes are shown by an ablation analysis, and it is suggested that even more benefits could be obtained by using ProcTHOR to create even bigger settings. Modern models for various navigation and interaction benchmarks are produced by agents trained on ProcTHOR-10K with minimum neural architectures (no depth sensor, only RGB channels, no explicit mapping, and no human task supervision). With no fine-tuning on the downstream benchmark, we also show strong zero-shot performance on these benchmarks, frequently outperforming earlier state-of-the-art systems that access the downstream training data. The code used in the research of ProcTHOR will be made publicly available shortly. Until then, ProcTHOR-10K was launched in a Google Colab notebook.
Among other frameworks to build embodied AI environments, ProcTHOR by Allen Institute researchers has made a name for itself because of its procedural approach to generation. Furthermore, the data set produced for ProcTHOR enables the training of simulated embodied agents in more diverse environments.