Skip to content

Gemini Embedding 2: Multimodal AI for Intelligent Applications

At a glance: Google’s new Gemini Embedding 2 unifies text, images, video, audio and documents in a single embedding model. With support for over 100 languages and processing multiple media types simultaneously, it opens up new possibilities for multimodal AI applications and intelligent search systems.

Google has announced the general availability of Gemini Embedding 2. The first embedding model of the Gemini API unifies text, images, video, audio and documents in a single semantic space and supports over 100 languages.

Patrick Löber, Lucia Loher and a team of Google experts present the new capabilities that Gemini Embedding 2 unlocks. The model processes impressively diverse inputs in a single call: up to 8,192 text tokens, 6 images, 120 seconds of video, 143 seconds of audio, and 6 PDF pages. This unified processing enables developers to project different media types into a shared semantic space and thus create rich applications that can “see” and “hear” proprietary data. The real strength of Gemini Embedding 2 lies in its ability to process nested inputs – such as mixed text and image content – in a single request. This opens up diverse use cases, ranging from agent-based multimodal Retrieval-Augmented Generation (RAG) to visual search.

Share on: