AI learns language by studying infant’s head cam recordings

Scientists train AI using video frames captured from a child wearing a head-mounted camera (New York University/Wai Keen Vong)
Scientists train AI using video frames captured from a child wearing a head-mounted camera (New York University/Wai Keen Vong)

A breakthrough new AI system has learned language concepts solely using headcam video and audio recordings of a single child from the time when the infant was six months through his second birthday.

Until now, AI systems like GPT-4 behind OpenAI’s ChatGPT learn and use human language from astronomical amounts of input data – much more than what children receive when they learn to speak.

While the text used to train the best AI systems has words in the order of trillions, children receive just millions per year to learn a language as they grow up.

The new study, published in the journal Science on Thursday, demonstrated that an AI system could be developed to learn a substantial number of words and concepts using the limited experiences of a child.

Researchers from New York University showed that the video captured during only one per cent of the child’s waking hours is sufficient for genuine language learning.

In the study, Sam, a baby boy living near Adelaide in Australia, wore a head-mounted camera for around one hour twice each week from the age of six months to around two years, gathering experiences from the infant’s perspective.

Researchers then trained the AI by based on frames from the video and words spoken to Sam transcribed from the recording.

The footage contained about 250,000 word-instances – including word repetitions – linked with video frames of what the child saw when the words were spoken.

It also included a range of different activities, including mealtimes, reading books, and the child playing.

“We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts,” study first author Wai Keen Vong said in a statement.

“Our results demonstrate how recent algorithmic advances paired with one child’s naturalistic experience has the potential to reshape our understanding of early language and concept acquisition,” Dr Vong said.

The research also sheds light on the ingredients of language that children need to learn new words – whether they need any innate knowledge, or just associative learning to get going.

Researchers trained the AI by building associations between the images the child saw at the same time the infant heard certain words.

This is because when a parent says something, it is likely that some of the words used are referring to what the child can see, meaning the infant’s comprehension is instilled by linking visual and language cues.

“Combining these cues is what enables contrastive learning to gradually determine which words belong with which visuals and to capture the learning of a child’s first words,” Dr Vong explained.

After training the AI, they evaluated the system by presenting it with a target word and an array of four different image options and asking it to select the image matching the word.

Not only was the AI model able to learn a substantial number of words such as “crib” and “ball,” and concepts part of a child’s experience, but it could even generalise concepts to different visual cues.

The findings can contribute to a better understanding of early language acquisition in children, scientists say.