Google slashes Gemini AI model costs, boosts speed & efficiency

Thu, 26th Sep 2024

Google has introduced two updated models, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, with significant improvements in speed, cost, and efficiency. The updates have also brought model performance enhancements and reduced prices, which are set to benefit developers using these tools.

"Today, we're releasing two updated production-ready Gemini models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002," said Logan Kilpatrick, Senior Product Manager. The new releases feature a 50 per cent price reduction on the 1.5 Pro model for both input and output tokens for prompts under 128,000 tokens. The models offer higher rate limits and reduced latency; the 1.5 Flash sees a rate limit increase to 2,000 requests per minute (RPM), while the 1.5 Pro sees an increase to 1,000 RPM. Both models also offer faster output, with a twofold increase in speed and a threefold decrease in latency.

Developers can access these models for free via the Google AI Studio and Gemini API. Additionally, large organisations and Google Cloud customers can utilise these models on Vertex AI. The updates build on experimental model releases and provide meaningful improvements over the Gemini models presented at Google I/O in May.

The Gemini 1.5 series models present enhancements in a wide range of tasks, including text synthesis, coding, and vision applications. "The Gemini models can be used to synthesise information from 1000-page PDFs, answer questions about repositories with more than 10,000 lines of code, take in hour-long videos, and create useful content from them," noted Shrestha Basu Mallick, Group Product Manager. The models have shown approximately a seven per cent increase in the MMLU-Pro, a challenging version of a popular benchmark. On MATH and HiddenMath benchmarks, the models have achieved around 20 per cent improvements. Additionally, the performance of visual understanding and Python code generation improved by around two to seven percent.

The models' default filter settings have undergone updates focusing on balancing user instructions and maintaining safety across their outputs. Kilpatrick highlighted that developers using the latest versions (dubbed -002 models) "will not have filters applied by default, allowing them to configure the models based on their specific needs".

The Gemini-1.5-Pro-002 model is particularly noteworthy for its reduced pricing. Effective from the start of October, there will be a 64 percent price reduction on input tokens, a 52 percent reduction on output tokens, and a 64 percent reduction on incremental cached tokens, impacting projects utilising less than 128,000 tokens. This price drop aims to further reduce the cost of using Gemini in production, particularly coupled with context caching features.

The updated models also reflect changes based on developer feedback. "We have made the models' responses more concise to reduce costs and make them easier to use," highlighted Kilpatrick. Compared to previous models, the default output length for summarisation, question answering, and extraction tasks has been reduced by approximately 5-20 percent. For chat-based products requiring longer responses, customised prompting strategies are available to make the models more verbose and conversational.

Additional experimental updates for the Gemini-1.5 model include the release of Gemini-1.5-Flash-8B-Exp-0924, offering significant performance improvements across text and multimodal use cases. "The overwhelmingly positive feedback developers have shared about 1.5 Flash-8B has been incredible to see," said Mallick. The company plans to continue shaping its experimental to production release pipeline based on developer feedback.

These updates reflect Gemini's commitment to providing developers with powerful and cost-effective AI models, improving performance benchmarks, and reducing operational costs. The newly introduced improvements appear to reinforce Gemini's position in the competitive AI landscape while enabling more accessible and efficient application development.

Share on: