At a glance: Google launches Gemini Embedding 2, the first multimodal embedding model that connects text, images, videos, audio and documents in a unified space. The solution supports over 100 languages and enables agent-based RAG applications and visual search.
Google has made Gemini Embedding 2 generally available – the first embedding model of the Gemini API that brings text, images, video, audio and documents together in a unified semantic space. The solution supports over 100 languages and enables innovative applications such as agent-based multimodal Retrieval-Augmented Generation (RAG).
Google has launched General Availability of Gemini Embedding 2 via the Gemini API and the Gemini Enterprise Agent Platform. This is the first embedding model in the Gemini API that projects multiple modalities – text, images, videos, audio and documents – in a shared semantic space while simultaneously supporting more than 100 languages.
The strength of Gemini Embedding 2 lies in its ability to process multiple input types in a single request. The model can simultaneously process up to 8,192 text tokens, 6 images, 120 seconds of video, 143 seconds of audio and 6 PDF pages. A special feature is support for nested inputs – such as combinations of text and images in a single request.
By aligning different modalities in a shared semantic space, developers can create comprehensive applications that can “see” and “hear” proprietary data. This opens up broad application possibilities – from agent-based multimodal RAG to visual search and other innovative use cases that were previously not possible with embedding models.