How to Utilize Apple's MM1 AI Model for Siri 2.0

(Image credit: Shutterstock)

Though it looks to be coming up swiftly, Apple is something of a latecomer to the large language model (LLM) field, trailing behind Google, Microsoft, and Meta in the development of potent AI tools.

CEO Tim Cook promised investors earlier this year that there will be a big announcement on AI, referring to it as a "major breakthrough." There are many who believe that this will be a new Siri that runs on an LLM, much as how Google replaced Assistant with Gemini.

Recent disclosures by Apple researchers provide insight into the potential foundation for the next iteration of Siri, which, if rumors are to be believed, may coexist with Gemini on the iPhone to provide users with an option.

Publicated as a preprint research article, MM1 basically presents a novel approach to leverage AI-generated labels and data to accelerate the training of new models, perhaps including Siri 2.0.

What is the Apple MM1?

Fundamentally, MM1 is a novel approach to multimodal model training with synthetic input, such as text and pictures.

According to the researchers behind MM1, their novel approach improves performance and requires fewer follow-up cues to produce the intended outcome.

Enhancing fast comprehension and achieving the intended result with minimal AI input is ideal for consumer technology, particularly for Siri, which will be utilized by a diverse range of users with differing levels of technological aptitude.

With the biggest model having almost 30 billion parameters, MM1 seems to represent a family of AI models. Despite the fact that this is far less than the trillion plus parameters in GPT-4 and Claude 3 Opus, the researchers assert that efficiency gains have allowed them to match important benchmarks.

They noted, "By scaling up their recipe, they built MM1, a family of multimodal models up to 30B parameters that achieve competitive performance on multimodal benchmarks after fine-tuning and state-of-the-art pre-training metrics."

The capacity to analyze pictures and other visual input and comprehend the results is a major advancement in vision. I just tested the performance of ChatGPT, Claude, and Gemini on this assignment.

How does the Apple MM1 operate?

(Image credit: Apple)

Methods, Analysis and Insights from Multimodal LLM Pre-training is the entire title of the article. With little fanfare, it was discreetly launched and made open source, complete with training data and benchmark details.

In it, researchers make the case that cutting-edge performance may be attained by integrating many training data sources and model architectures rather than depending only on one idea.

The researchers stated that in order to achieve such performance, a "diverse dataset spanning visual and linguistic information" is needed. They employed a combination of image-caption, image-text, and text-only data.

This covers natural language comprehension, picture captioning, and visual question answering, including one-shot or few-shot instructions to get the intended result.


"MM1 has attractive properties like improved in-context learning and multi-image reasoning, enabling few-shot chain-of-thought prompting, thanks to large-scale pre-training," the researchers said.

What distinguishes the Apple MM1?

MM1 focuses on exploiting that data mix to increase overall performance from a single prompt, employs a new sort of architecture to combine models, such as encoders with greater picture solution, and approaches pre-training and labeling in a different way.

In order to scale up without increasing processing power, it also employs a mixture-of-experts (MoE) paradigm, which suggests that it may be used on devices other than cloud servers, such as laptops or iPhones.

With a context window that can hold over a million tokens, Google's Gemini 1.5 Pro model recently made use of a MoE design. As a result, it was able to increase efficiency across significant input data.

Will Siri 2.0 run on Apple MM1?

(Image credit: Apple/Google)

The focus on performance and efficiency, getting good outcomes with little assistance, and the requirement for comprehensive multimodal capabilities all point to the route Apple will take with Siri in the future, even though the paper makes no mention of Siri or any planned products.

Given Apple's long-standing privacy policies, it is expected that a large number of capabilities of any LLM-powered Siri will need to operate "on device," especially when it comes to processing personal data.

It's a significant step to be able to create a very potent model that can learn from user interactions and is tiny enough to operate on an iPhone.

It appears that Apple is adopting a multifaceted approach to realizing the "big bang" Cook promised investors in AI, with the current report that the firm may be bringing Gemini to the iPhone and earlier rumors that the company is also in negotiations with ChatGPT developer OpenAI.

Post a Comment

Respectful, on-topic comments only; no spam or hate speech.

Previous Post Next Post