The model decodes speech segments with up to 73% top-10 accuracy from a vocabulary of 793 words.
Meta has been working on the practical application of its AI tools in different fields for quite some time, with its MyoSuite being an example of the benefits granted with AI in medicine. This time, the company presented a new AI model that can help people with medical conditions. The model can decode speech from noninvasive recordings of brain activity.
Meta claims that from three seconds of brain activity, the model can decode the corresponding speech segments with up to 73% top-10 accuracy from a vocabulary of 793 words.
The deep learning model is trained with contrastive learning and then used to align noninvasive brain recordings and speech sounds. To do this, the researchers use wave2vec 2.0, an open-source learning model which helps identify the complex representations of speech in the brains of volunteers listening to audiobooks.
"We focused on two noninvasive technologies: electroencephalography and magnetoencephalography (EEG and MEG, for short), which measure the fluctuations of electric and magnetic fields elicited by neuronal activity, respectively. In practice, both systems can take approximately 1,000 snapshots of macroscopic brain activity every second, using hundreds of sensors."
The researchers leveraged four open-source EEG and MEG datasets, capitalizing on more than 150 hours of recordings of 169 healthy volunteers listening to audiobooks and isolated sentences in English and Dutch. Then, they input those EEG and MEG recordings into a “brain” model, which consists of a standard deep convolutional network with residual connections.
The architecture learns to align the output of this brain model to the deep representations of the speech sounds that were presented to the participants. After training, the system performs zero-shot classification: "given a snippet of brain activity, it can determine from a large pool of new audio clips which one the person actually heard. From there, the algorithm infers the words the person has most likely heard."
The next step is to see if the team can extend this model to directly decode speech from brain activity without needing the pool of audio clips.
The paper concludes that this approach benefits from the pulling of large amounts of heterogeneous data, and could help improve the decoding of small datasets. The algorithms could be pretrained on large datasets inclusive of many individuals and conditions, and then support the decoding of brain activity for a new patient with little data.