Microsoft’s new artificial intelligence makes the Mona Lisa rap. How it works?
(CNN) — The Mona Lisa can do more than just smile, thanks to new artificial intelligence technology from Microsoft.
Last week, Microsoft researchers unveiled a new artificial intelligence model that can take a still image of the face and an audio clip of a person speaking and automatically create a realistic video of that person. Videos, which can be created from photorealistic faces, cartoons or illustrations, are enhanced with convincing lip-syncing and natural facial and head movements.
In a demo video, the researchers showed how they animated the Mona Lisa to rap actress Anne Hathaway’s comedy rap.
The results of the AI model, called VASA-1, are as funny as they are a little shocking in their realism. According to Microsoft, the technology could be used in education or to “improve accessibility for people with communication difficulties” or even create virtual companions for people. But it’s also easy to see how this tool can be abused and used to impersonate real people.
The problem extends beyond Microsoft: As new tools emerge to create compelling images, videos and audio generated by artificial intelligence, experts fear their misuse could lead to new forms of misinformation. Some are also concerned that technology could further disrupt creative industries, from film to advertising.
At this time, Microsoft has no immediate plans to make the VASA-1 model public. The move is similar to how Microsoft partner OpenAI is addressing issues with its AI-powered video tool Sora. OpenAI introduced Sora in February but has so far only made it available to a small number of professional users and cybersecurity educators for testing purposes.
“We oppose any behavior that seeks to generate misleading or harmful content from real people,” Microsoft researchers said in a blog post. But, they added, the company has “no plans to publish” the product “until we are confident that the technology will be used responsibly and in accordance with relevant regulations.”
Faces move
Microsoft’s new AI model was trained on numerous videos of people’s faces talking and is designed to recognize natural facial and head movements, including “lip movement, expression (without lips), gaze, and blinking, among others,” the researchers explained. . The result is more realistic video when the VASA-1 animates a still photo.
For example, in a demo video of someone looking agitated, apparently while playing video games, the person speaking is frowning and pursing their lips.
The AI tool can also create a video of the subject looking in a certain direction or expressing a certain emotion.
If you look closely, there are still signs that the video was created by a machine, such as infrequent blinking and exaggerated eyebrow movements. But Microsoft believes its model is “far superior” to other similar tools and “paves the way for real-time interactions with realistic avatars that mimic human conversational behavior.”