DALL·E, cleverly named after the artist Salvalor Dalí and Pixar’s WALL·E, is a trained neural network from OpenAI that can create images from user input text captions using PyTorch.
According to OpenAI’s page introducing DALL·E:
“DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve [DALL·E developers] found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.”
- OpenAi’s DALL·E Article
The captions are not just limited to a single word, such as “dog” or “cat.” With this network, the user can input a phrase such as “an illustration of a baby daikon radish in a tutu walking a dog” and get just that — an illustration of a baby daikon radish in a tutu walking a dog.
Not only does DALL·E work on text prompts, it will also work with the combination of image and text prompts. The image below shows DALL·E attempting to replicate an image of a cat as a sketch after receiving an image and text prompt based on that image.
Resources to get started with DALL·E:
Jraph is a graph neural network developed by DeepMind and joins the JAX family, a machine learning framework developed by Google Researchers, as one of its newest members. The developers of Jraph describe it as follows:
“Jraph (pronounced "giraffe") is a lightweight library for working with graph neural networks in jax. It provides a data structure for graphs, a set of utilities for working with graphs, and a 'zoo' of forkable graph neural network models.”
- Jraph GitHub Repository
Jraph takes inspiration from Tensorflow’s graph_nets library when defining its GraphsTuple data structure, a named tuple that contains one or more directed graphs.
Resources to get started with Jraph:
Petting Zoo is a Python library created by Justin Terry, currently a Ph.D. Student at the University of Maryland, for conducting and researching multi-agent reinforcement learning. It is similar to OpenAI’s Gym library, but instead of focusing on a single agent, the models focus on training multiple agents. If you have experience with Gym, or even if you don’t, and want to try your hand at training multiple agents, give Petting Zoo a try.
Petting Zoo offers six families of learning environments to train and test your agents including:
For more information check out the links below and also this interview with Justin hosted on Synthetic Intelligence Forum’s YouTube channel for a deeper dive into Petting Zoo and how it can potentially help you with your projects.
Resources to get started with Petting Zoo:
Vision Transformer (ViT)
The use of transformers in image-based tasks and models, such as convolutional neural networks (CNNs), has been gaining in popularity. Imagining the possibilities this addition could bring to the world of image analysis, Google Researchers have introduced Vision Transformer (ViT), a vision model-based as closely as possible to the transformer architecture used in text-based models.
In their Google AI article presenting the new model, the authors explained ViT as the following:
“ViT represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying Transformers to text, and directly predicts class labels for the image. ViT demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources.”
- Google Research Scientists, ViT Article
So, how does this model work? In the same article, the researchers go on to explain just that:
“ViT divides an image into a grid of square patches. Each patch is flattened into a single vector by concatenating all pixels’ channels in a patch and then linearly projecting it to the desired input dimension. Because Transformers are agnostic to the structure of the input elements we add learnable position embeddings to each patch, which allows the model to learn about the structure of the images. A priori, ViT does not know about the relative location of patches in the image, or even that the image has a 2D structure — it must learn such relevant information from the training data and encode structural information in the position embeddings.”
- Google Research Scientists, ViT Article
Resources to get started with Vision Transformer (ViT):
Room-Across-Room (RxR) - Google Research Dataset
One of the latest datasets from Google Research Scientists is called “Room-Across-Room (RxR).” RxR is a multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments.
The research scientists for RxR describes the dataset as the following from their article:
“the first multilingual dataset for VLN, containing 126,069 human-annotated navigation instructions in three typologically diverse languages — English, Hindi and Telugu. Each instruction describes a path through a photorealistic simulator populated with indoor environments from the Matterport3D dataset, which includes 3D captures of homes, offices and public buildings.”
For those who are unfamiliar with Matterport3D, it is a large, diverse RGB-D dataset for scene understanding that contains 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Google Researchers took this large dataset and made it even bigger — 10x bigger!
As you can see in the image below, the agent using the RxR dataset moves throughout the different rooms and is being depicted by different colors in the image called “pose traces,” which can be found in further detail in the RxR article. The agent then outputs a corresponding description of what it is seeing. If you compare the text to the agent trajectory, the color of the text matches the colored path in the image.
On top of the introduction of this new dataset, Google Researchers also announced that to keep track of the progression of VLN, they are announcing the RxR Challenge. The RxR Challenge is “a competition that encourages the machine learning community to train and evaluate their instruction following agents on RxR instructions.” If you’re interested in learning more or participating in the competition, visit the link in the resources below.
Resources to get started with RxR:
Full Paper: On Generating Extended Summaries of Long Documents
Do you wish there was a way to help summarize your latest paper or document instead of going back and forth trying to decide what is most important to include? Give ExtendedSumm a try.
Researchers Sajad Sotudeh, Arman Cohan, and Nazli Goharian from the IR Lab at Georgetown University created a method that expands upon previous research that was used to create high-level summaries for short documents. Their method is used to generate a more in-depth, extended summary for longer documents such as research papers, legal documents, and books.
They describe their method as one that “aims at jointly learning to predict sentence importance and its corresponding section” and they describe their methodology as the following in the abstract of their publication:
“Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm.”
The study showed that their method either matched or exceeded the baseline, BertSumExt, over a dataset of mixed summarizations that varied in size.
Datasets used for this model:
Resources to get started with Extended Summ: