Training AI Models Lessons from Human Infants

[Xin Zhi Yuan Summary] In order to train AI models, professor Brenden Lake from the State University of New York let his daughter, who is under 2 years old, wear a camera to collect data! It's worth noting that Meta training Llama3 directly used 15 trillion tokens. If Lake can really train AI models to learn from human infants and learn from limited inputs, wouldn't it solve the global data shortage of LLM?

To train AI models, a professor from the State University of New York actually tied a camera similar to GoPro to his daughter's head!

Although it sounds unbelievable, the professor's behavior is actually justified.

To train the complex neural networks behind LLM, massive data is needed.

Is the current process of training LLM the most concise and efficient way?

Definitely not! Scientists have found that human children who are learning to walk, their brains are like sponges absorbing water, and can quickly form a coherent worldview.

Although LLM has amazing performances, over time, human children will be smarter and more creative than the model!

Secret of Children Mastering Language

How to train LLM better?

When scientists were at a loss, human infants enlighten them --

Their way of learning language can be described as masters of language acquisition.

We all know such stories: throw a young child into a country with a completely different language and culture, and in a few months, their mastery of the local language may approach native level.

And large language models seem to be left behind.

First of all, they consume too much data!

Nowadays, major companies training models have almost drained all the data in the world. Because LLM learning requires astronomical level of text mined from the Internet and various places.

To make them master a language, they need to be fed tens of trillions of words.

Secondly, smashing so much data into the models does not necessarily make them learn accurately.

Many LLM outputs predict the next word with a certain degree of accuracy. And this kind of accuracy is becoming more and more unsettling.

In sharp contrast, children do not need so much experience to learn to fluently use a language.

Brenden Lake, a psychologist at the State University of New York who studies human and AI, has paid attention to this.

He decided to conduct an experiment with his 1-year-and-9-month-old daughter Luna.

For the past 11 months, Lake has had Luna wear a camera for an hour every week to record videos of her playing from her perspective.

Through the videos shot by Luna's camera, Lake hopes to train the model by using the same data that children are exposed to.

Tie a GoPro to the toddler's body

Although language experts and child experts have not reached a consensus on how children actually acquire language, Lake is very confident that the key to making LLM more efficient lies in children's learning patterns!

Therefore, Lake launched a research project: studying the stimuli that children go through when learning their first sentence in order to improve the efficiency of training LLM.

For this purpose, Lake's team needs to collect video and audio data from 25 children from across the United States.

This leads to the scene at the beginning of the article - they tied a camera similar to GoPro to the heads of these children, including Lake's daughter Luna.

Lake explained that their model tries to connect video clips from the child's perspective to what the caregiver says in a way similar to how OpenAI's Clip model connects annotations with images.

Clip can take images as input and output a descriptive annotation based on the training data of image-annotation pairs.

Link to the paper: https://openai.com/index/clip/

In addition, Lake's team's model can also take images of scenes as input based on training data from GoPro cameras and audio from caregi vers, and then output language to describe the scene.

Moreover, the model can also convert descriptions into frames seen in previous training.

At first glance, it seems quite simple, right? It is to enable the model to match oral language with objects observed in video frames like human children.

But in practice, there will still be many complex situations.

For example, children do not always look at the described objects or actions.

And there are even more abstract situations, such as giving children milk, but the milk is in an opaque cup, which leads to a very loose association.

Therefore, Lake explains: the purpose of this experiment is not to prove whether we can train the model to match objects in images with corresponding words (OpenAI has already proven this).

On the contrary, the team wants to know whether the model can actually learn to identify objects using the sparse data available to children (sparse to an unbelievable degree).

As you can see, this is completely opposite to the approach taken by major companies such as OpenAI, Google, Meta, etc.

It is worth noting that Meta trained Llama3 using 15 trillion tokens.

If Lake's team's experiment is successful, perhaps the global LLM data shortage that everyone faces can be solved - because then, training LLM does not require so much data!

In other words, the new idea is to let AI models learn from limited inputs and then generalize from the data we see.

I think our focus should not be limited to training bigger and bigger LLMs from more and more data. Yes, you can make LLM perform amazingly through this method, but it is getting further and further away from the wonders of human intelligence as we know it...

Early experiments have been successful

Early experimental results have already shown that Lake's team's approach may be correct.

In February of this year, they trained a neural network with 61 hours of video clips to record a toddler's experience.

The study found that the model could associate various words and phrases spoken by the participants with experiences captured in the video frames - as long as the words or phrases were presented, the model could recall the relevant images. This paper has been published in Science.

Link to the paper: https://www.science.org/doi/10.1126/science.adi1374

Lake said the most surprising thing was that the model could generalize names of objects in untrained images!

Of course, the accuracy is not necessarily very good. But the model was only meant to verify a concept.

The project is not yet complete, as the model has not learned everything a child would know.

After all, it only has about 60 hours of annotated speech, which is only one percent of what a child would experience in two years. And the team needs more data to figure out what is learnable.

And Lake also admitted that the method used by the first model has limitations -

Only analyzing video clips related to caregiver speech, converting at a rate of 5 frames per second, AI did not really learn what verbs are, what abstract words are, it only got a static slice of what the world looks like.

Because it has no knowledge of what happened before, after, or the background of the conversation, it is difficult to learn what walk, run, jump mean.

But in the future, as the technology behind modeling videos matures, Lake believes the team will build more effective models.

If we can build a model that truly begins to acquire language, it will open up important applications for understanding human learning and development, perhaps helping us understand developmental disorders or how children learn language.

Ultimately, such models can also be used to test millions of different language therapies.

All in all, how do children solidify their grasp of a language through their own eyes and ears?

Let's take a close look at this article from Lake's team published in Science.

Linking Words with Real Object s and Visual Images

How do human children shed their ignorance of the world and acquire knowledge? The mystery of this black box not only attracts the continuous exploration of educational researchers, but also entangles us with the questions about individual intelligence at the bottom of our hearts.

In the novel Symbiotic Hypothesis by South Korean sci-fi writer Kim Grass Leaf, he envisages that the intelligence displayed by human children in their infancy actually carries a lost alien civilization, and they choose to coexist with humans in this way, but time is only a short five years, after humans grow up and have truly solid memories, they wipe away this splendid memory of childhood

Internet users often share stories online about those human infants who forget to drink Meng Po's soup.

About the mysterious childhood, it is a place of homesickness that we find it difficult to clarify and difficult to return to. It's like what Kim Grass Leaf wrote, Don't go away. Don't take away that beautiful world. After I grow up, please stay with me.

How do children associate new words with specific objects or visual concepts when they hear the word ball, for example, how do children think of a resilient round object?

For this purpose, Lake's team put a head-mounted camera on a child and tracked the child's growth process from 6 to 25 months, recording a 61-hour stream of visual language data.

On this 1.5-year-old child's edited data set (including 600,000 video frames and 37,500 transcribed speech pairs), researchers trained a model, CVCL, that contrasts learning from a child's perspective.

This model instantiates a form of associative learning across situations, identifying mappings between words and possible visual referents.

The model coordinates the contrasting objectives of two neural networks, a visual encoder and a language encoder, and trains in a self-supervised way (using only recorded sounds from the child's perspective and not using external labels), contrasting the embedding of video frames (vectors) with the combined embeddings of language speeches that occur simultaneously in time (handling embeddings of visually and aurally co-occurring video frames and language speeches)

Of course, this dataset named SAYCam-S is limited because it captures only about 1% of the child's waking time, missing many of their experiences.

But nevertheless, CVCL can still learn powerful multimodal representations from a child's limited experiences!

The team successfully demonstrated that the model acquired many referential mappings that exist in a child’s daily experience, and as a result, it could generalize new visual references without seeing them before and adjust its visual and language concept systems.

Evaluating the Learned Word-Meaning Mappings

Specifically, after training was completed, the team evaluated the quality of the word-referent mappings learned by CVCL and various alternative models.

The results show that CVCL's classification accuracy is 61.6%.

And an image shows that for 11 out of the 22 concepts, CVCL's performance and CLIP's error are within 5%, but CLIP's training data is orders of magnitude larger (4 billion image-text pairs from the web).

The research results show that many of the earliest word-referent mappings can be obtained from at least 10 to 100 naturally occurring word-referent pairs.

Generalizing New Visual Exemplars

In addition, the researchers evaluated whether the words learned by CVCL could be extended to out-of-distribution visual stimuli.

An image shows that CVCL also demonstrates some understanding of these visual concepts, with an overall accuracy of 34.7%.

Clearly, this task requires a larger concept set and additional challenges for out-of-distribution generalization.

The left side shows two randomly selected training cases, and the right side shows four test cases. The percentage below indicates the accuracy and performance of the model in recognizing the ima ge. The selected cases from left to right are the two highest, median, and the lowest values. It can be seen that when the test case has higher similarity in color and shape to the training case, the model’s recognition accuracy is also higher

Good Multi-Modal Consistency

Finally, the researchers tested the consistency of the visual and language concept systems of CVCL.

For example, if compared to ball, the visual embedding and word embedding of car are more similar to that of road, indicating that the multi-modal alignment works well.

The image below shows the high alignment of CVCL's visual and language systems.

The dashed line represents the distance between the visual centroid corresponding to each concept and the word embedding. Different visual concepts have different degrees of close clustering in their examples. Because an infant's gaze will drift between objects that are very close, the model does not form a clear reference mapping when distinguishing between hand and toy, but car and crib perform well. In each image, the researchers intuitively show an example comparison between CVCL prediction and the use of t-SNE for illustration. The blue dots on the left correspond to 100 frames belonging to a specific class, and the green dots on the right correspond to 100 frames with the highest activations (based on cosine similarity to each concept embedding word in CVCL). Below each image are multiple example frames belonging to one or more subclusters in each concept capturing how word embeddings interact with image embeddings in the joint embedding space. For example, for the word “staircase,” we see a cluster representing images of an indoor wooden staircase and another primary cluster representing images of a group of blue outdoor staircases. All these t-SNE figures came from the same joint image-text embeddings

The lower image shows that the model can locate the target being referred to in different views. In the normalized attention map, yellow represents the area with the highest attention. In the first two categories (ball and car), we can see that the model can locate the target being referred to in different views. However, in the next two categories (cat and paper), the attention map sometimes becomes misaligned from the entity being referred to, indicating that the ability to locate the entity being referred to is not consistent for all categories

Of course, there are still many differences between children's learning and machine learning models.

But the research by Lake's team is undoubtedly inspiring for us.

References:

https://www.nytimes.com/2024/04/30/science/ai-infants-language-learning.html

https://www.theregister.com/2024/05/12/boffins_hope_to_make_ai/https://www.science.org/doi/10.1126/science.adi1374

This article is from the WeChat official account: Xin Zhi Yuan (ID: AI_era)

Likes