Google has unveiled its next major AI model, Gemini 2.0 Flash, designed to compete with the latest offerings from OpenAI. This new model, announced on Wednesday, boasts the ability to natively generate images and audio in addition to text. Furthermore, 2.0 Flash can utilize third-party apps and services, enabling it to tap into Google Search, execute code, and more.
An experimental release of 2.0 Flash is now available through the Gemini API and Google’s AI developer platforms, AI Studio and Vertex AI. However, the audio and image generation capabilities are initially accessible only to “early access partners,” with a broader rollout planned for January.
In the coming months, Google plans to integrate 2.0 Flash into a variety of products, including Android Studio, Chrome DevTools, Firebase, Gemini Code Assist, and others.
Flash, Upgraded
The first-generation Flash, known as 1.5 Flash, was limited to text generation and not designed for particularly demanding workloads. The new 2.0 Flash model is significantly more versatile, partly because it can call tools like Search and interact with external APIs.
“We know Flash is extremely popular with developers for its balance of speed and performance,” said Tulsee Doshi, head of product for the Gemini model at Google, during a briefing on Tuesday. “And with 2.0 Flash, it’s just as fast as ever, but now it’s even more powerful.”
Google claims that 2.0 Flash is twice as fast as the company’s Gemini 1.5 Pro model on certain benchmarks, according to their own testing. The new model is “significantly” improved in areas such as coding and image analysis, and it replaces 1.5 Pro as the flagship Gemini model due to its superior math skills and “factuality.”
New Capabilities
2.0 Flash can generate and modify images alongside text. It can also ingest photos, videos, and audio recordings to answer questions about them, such as “What did he say?”
Another key feature of 2.0 Flash is its audio generation capability, which Doshi described as “steerable” and “customizable.” For instance, the model can narrate text using one of eight voices optimized for different accents and languages. Users can ask it to adjust the speed of speech or even to speak in a specific style, like a pirate.
Despite these advancements, Google has not yet provided images or audio samples from 2.0 Flash, so the quality of its outputs compared to other models remains to be seen.
To address concerns about misuse, Google is using its SynthID technology to watermark all audio and images generated by 2.0 Flash. On software and platforms that support SynthID, the model’s outputs will be flagged as synthetic. This measure aims to mitigate the growing threat of deepfakes, which saw a fourfold increase in detections worldwide from 2023 to 2024, according to ID verification service Sumsub.
Multimodal API
The production version of 2.0 Flash is set to launch in January. In the meantime, Google is releasing an API called the Multimodal Live API to help developers build apps with real-time audio and video streaming functionality.
The Multimodal Live API allows developers to create real-time, multimodal apps with audio and video inputs from cameras or screens. It supports the integration of tools to accomplish tasks and can handle “natural conversation patterns” such as interruptions, similar to OpenAI’s Realtime API.
The Multimodal Live API is generally available starting today, providing developers with the tools to leverage the advanced capabilities of 2.0 Flash in their applications.