Meta has just unveiled the basics of its data2vec system, a system that aims to revolutionize AI with an approach that is both generalist and more autonomous than ever.
Like all humans, you have a very impressive capacity for abstraction. If we offer the same information to several of us in different forms (text, image and video, for example), everyone will be more or less able to achieve the same result. An observation that seems obvious, but which is in fact a real small bioalgorithmic feat.
Because this flexibility is one of the determining aspects of human intelligence. Conversely, the rigidity of the machines is a significant barrier that still prevents the use of this technology in sectors where it could work wonders. But we’re probably not far from that point thanks to Meta AI research.
Indeed, the firm announced the genesis of data2vec. This is a program presented as the “first high performance self-supervised algorithm that works with multiple modalities.” In practice, data2vec works simultaneously on the voice, visual and textual parts to try to produce a more accurate, complete and coherent dataset.
What is self-supervision?
Today, the majority of AI applications are still based on so-called “supervised” systems. They work thanks to huge databases, each element of which must be carefully annotated; in essence the objective is to explain to the machine what is expected of it to allow it to extrapolate on new data, this time not labeled.
For example, to train an AI capable of recognizing animals, it will be necessary to recover thousands of images and start by explaining to the AI that image A represents a dog, while image B represents a cow… and so on. You will have understood it, it is a process which is extremely time consuming at best. And at worst, it may simply be impossible to collect enough data to train an AI for the desired task.
Self-taught experts, but too specialized
This is where self-supervised AIs come in. It is a subdivision of artificial intelligence research where machines learn by themselves, directly from their environment and without any labeled data. It’s a concept that has already led to quite spectacular advances; Google AI, for example, has developed a self-supervising system capable of classifying medical images with phenomenal precision.
But the concept also has its limits, one of which has very concrete implications. Indeed, if humans seem able to learn in a similar way regardless of information format, this is not the case for machines. Unlike us, self-supervised learning algorithms can reach wildly different conclusions if given the same information in different forms (Meta calls it “modalities”), such as text, sound, or video.
This means that self-supervising algorithms must be trained for a very specific task, and are often limited to a single modality; with traditional methods, for example, it is impossible to train an AI that generates text in the same way as a text-to-speech program. The research was therefore still awaiting a holistic system, able to work both in a self-supervised way, but also on several modalities at the same time – much like human intelligence does. And that’s what Meta researchers have managed to produce with data2vec.
A generalist algorithmic mille-feuille
Here, their algorithm works in parallel on several units of different modalities at the same time. For this, data2vec takes height and itself trains several “sub-algorithms”. He can thus offer a result that is as coherent and relevant as possible from a wide range of very different information.
To illustrate the concept, imagine a fictitious working group composed of different experts – the subsystems responsible for each modality. Problem: if these experts are each extremely competent in their field, they understand absolutely nothing about the work of their colleagues and therefore do not have a global visione. There is therefore a need for an impartial hub, able to synthesize the work of different experts into a single result. And it is precisely this role of hub that data2vec assumes.
Specifically, the system starts by generating an abstract representation (in practice, a layer of a neural network) of an image, text, or audio clip. This representation, which “physically” corresponds to a layer of the neural network, can be interpreted by all the other subsystems; in essence, it is more or less the briefing from the group leader to the various experts.
This then allows each subsystem to work individually on this representation in using all the data, and not just those that usually fall within his specialty. All of these contributions are then synthesized to achieve a single, consistent result.
A revolutionary paradigm shift
Meta explains that this generalist and holistic approach would be able to outperform single-purpose algorithms in certain critical areas, such as computer vision and voice work. “Data2vec demonstrates that these self-supervised algorithms are able to work with different modalities, and even do it better than the best current algorithms”, explains the press release.
With such a system, researchers will be able to free themselves from the tiresome and time-consuming labor of labeling; they will therefore have more time to work on the theory and strengthen their algorithms; the promise of real scientific progress. This is therefore a major development; the concept of data2vec could have not only spectacular but also very concrete repercussions in a host of fields. Case to follow!
The text of the study is available here.