Meta's new Llama 3.2 Multi Modal Open source AI Model

Meta has recently unveiled its latest innovation in artificial intelligence: the Llama 3.2 models, which introduce significant advancements in multimodal capabilities. This release marks a pivotal moment for developers and businesses seeking to leverage AI for various applications, particularly those involving both text and image processing.

The Llama 3.2-Vision collection is an advanced multimodal large language model (LLM) designed by Meta. It integrates both text and image processing capabilities, optimized for tasks such as image reasoning, visual recognition, captioning, and answering questions based on images. The collection is available in four sizes 1B, 3B, 11B and 90B. Their lightweight models, available in 1B and 3B sizes, are designed for efficiency and can be deployed across mobile and edge devices. The 11B and 90B multimodal models, on the other hand, are more versatile and capable of performing complex reasoning tasks on high-resolution images. There are now more variants of the open-source AI model accessible that you can refine, condense, and use wherever.

Model	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Data volume	Knowledge cutoff
Llama 3.2-Vision 11B	(Image, text) pairs	11B (10.6)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023
Llama 3.2-Vision 90B	(Image, text) pairs	90B (88.8)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023

Key Features of Llama 3.2

Multimodal Capabilities

The Llama 3.2 models are Meta's first to support multimodal functionality, allowing them to handle both text and images. Specifically, the 11B and 90B Vision Instruct models are optimized for tasks that require image reasoning and understanding, making them suitable for applications like visual question answering and image captioning . This capability is facilitated through specially trained image reasoning adaptor weights that work in conjunction with the language model, enhancing the AI's ability to interpret and respond to visual inputs.

Optimized for Edge and Mobile Devices

In addition to their multimodal features, Llama 3.2 includes smaller models (1B and 3B) designed for deployment on edge devices and mobile platforms. This focus on lightweight models ensures that developers can create applications that prioritize user privacy and reduce reliance on cloud computing, which is especially beneficial for multilingual summarization and real-time processing .

Performance Enhancements

The new models utilize an optimized transformer architecture that incorporates supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). These enhancements improve the models' alignment with human preferences, making them more helpful and safe for users. Furthermore, Llama 3.2 supports long context lengths of up to 128k tokens, allowing for complex interactions and richer content generation.

Integration with Major Platforms

Meta has partnered with several major cloud providers, including Microsoft Azure and Amazon Web Services (AWS), to ensure that Llama 3.2 is easily accessible for developers. The models are available through managed compute services on these platforms, enabling seamless integration into existing workflows . Developers can utilize tools like Azure AI Content Safety and prompt flow to enhance their applications while adhering to ethical AI practices.

Community Engagement and Open Source Commitment

Meta continues its commitment to openness by making the Llama 3.2 models available for download on platforms like Hugging Face and its own llama.com site. This approach encourages collaboration within the developer community, fostering innovation while ensuring responsible use of AI technologies . The company has also introduced new tools and resources aimed at helping developers navigate the complexities of building with AI responsibly.

Energy Use for Training Llama 3.2 Model

Training was performed on custom hardware, requiring 2.02 million GPU hours. Meta reports zero market-based greenhouse gas emissions due to its renewable energy usage.

Benchmarks of Llama 3.2 - Image Reasoning

The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.

For Base Pretrained Llama 3.2 Models

Category	Benchmark	Metric	Llama 3.2 11B	Llama 3.2 90B
Image Understanding	VQAv2 (val)	Accuracy	66.8	73.6
	Text VQA (val)	Relaxed accuracy	73.1	73.5
	DocVQA (val, unseen)	ANLS	62.3	70.7
Visual Reasoning	MMMU (val, 0-shot)	Micro average accuracy	41.7	49.3
	ChartQA (test)	Accuracy	39.4	54.2
	InfographicsQA (val, unseen)	ANLS	43.2	56.8
	AI2 Diagram (test)	Accuracy	62.4	75.3

For Instruction Tuned Llama 3.2 Models

Modality	Capability	Benchmark	# Shots	Metric	Llama 3.2 11B	Llama 3.2 90B
Image	College-level Problems and Mathematical Reasoning	MMMU (val, CoT)	0	Micro average accuracy	50.7	60.3
		MMMU-Pro, Standard (10 opts, test)	0	Accuracy	33.0	45.2
		MMMU-Pro, Vision (test)	0	Accuracy	23.7	33.8
		MathVista (testmini)	0	Accuracy	51.5	57.3
	Charts and Diagram Understanding	ChartQA (test, CoT)	0	Relaxed accuracy	83.4	85.5
		AI2 Diagram (test)	0	Accuracy	91.1	92.3
		DocVQA (test)	0	ANLS	88.4	90.1
	General Visual Question Answering	VQAv2 (test)	0	Accuracy	75.2	78.1

Text	General	MMLU (CoT)	0	Macro_avg/acc	73.0	86.0
	Math	MATH (CoT)	0	Final_em	51.9	68.0
	Reasoning	GPQA	0	Accuracy	32.8	46.7
	Multilingual	MGSM (CoT)	0	em	68.9	86.9

How to use Llama 3.2 Model

For developers they can start using it on all the cloud service providers

Amazon AWS - Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart & Amazon Bedrock.
Microsoft Azure - Meta’s new Llama 3.2 SLMs and image reasoning models now available on Azure AI Model Catalog
Hugging Face - Llama 3.2 - a Meta Llama Collection
GitHub - The Meta Llama 3.2 collection of multilingual large language models (LLMs) on GitHub
Llama - Llama 3.2 models available for download on Meta's llama official website.
Google Cloud - Meta's Llama 3.2 is now available on Google Cloud

For individuals or explorers they can start using it on Creatosaurus for Free.

Llama 3.2 Model evaluations with other AI models

Their evaluation indicates that the Llama 3.2 vision models perform competitively with leading models like Claude 3 Haiku and GPT-4o-mini, particularly in image recognition and various visual understanding tasks. The Llama 3.2 3B model surpasses models like Gemma 2 (2.6B) and Phi 3.5-mini in tasks such as instruction-following, summarization, prompt rewriting, and tool-use, while the 1B model holds its own against Gemma.

They assessed performance across over 150 benchmark datasets covering multiple languages, with a specific focus on image understanding and visual reasoning benchmarks for the vision LLMs.

Llama 3.2 Vision instruction-tuned benchmarks

Llama 3.2 Lightweight instruction-tuned benchmarks

Meta also allows developers to fine-tune Llama 3.2 for specific applications, provided they comply with the Llama 3.2 Community License. The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.

Conclusion

The release of Meta's Llama 3.2 models represents a significant leap forward in the realm of AI, particularly in terms of multimodal capabilities and accessibility for edge computing. By combining advanced image reasoning with robust language processing, these models open up new possibilities for developers looking to create innovative applications across various industries. As Meta continues to support an open-source ethos, the future looks promising for those eager to harness the power of Llama 3.2 in their projects.

What is Llama 3.2 Multi Modal Vision AI Open Source Model?