AI

Meta's new Llama 3.2 Multi Modal Open source AI Model

Discover Meta's groundbreaking Llama 3.2 models, featuring advanced multimodal capabilities for text and image processing.

blogpic

Meta has recently unveiled its latest innovation in artificial intelligence: the Llama 3.2 models, which introduce significant advancements in multimodal capabilities. This release marks a pivotal moment for developers and businesses seeking to leverage AI for various applications, particularly those involving both text and image processing.

What is Llama 3.2 Multi Modal Vision AI Open Source Model?

The Llama 3.2-Vision collection is an advanced multimodal large language model (LLM) designed by Meta. It integrates both text and image processing capabilities, optimized for tasks such as image reasoning, visual recognition, captioning, and answering questions based on images. The collection is available in four sizes 1B, 3B, 11B and 90B. Their lightweight models, available in 1B and 3B sizes, are designed for efficiency and can be deployed across mobile and edge devices. The 11B and 90B multimodal models, on the other hand, are more versatile and capable of performing complex reasoning tasks on high-resolution images. There are now more variants of the open-source AI model accessible that you can refine, condense, and use wherever.

Model Training Data Params Input modalities Output modalities Context length GQA Data volume Knowledge cutoff
Llama 3.2-Vision 11B (Image, text) pairs 11B (10.6) Text + Image Text 128k Yes 6B (image, text) pairs December 2023
Llama 3.2-Vision 90B (Image, text) pairs 90B (88.8) Text + Image Text 128k Yes 6B (image, text) pairs December 2023

Key Features of Llama 3.2

Multimodal Capabilities

The Llama 3.2 models are Meta's first to support multimodal functionality, allowing them to handle both text and images. Specifically, the 11B and 90B Vision Instruct models are optimized for tasks that require image reasoning and understanding, making them suitable for applications like visual question answering and image captioning. This capability is facilitated through specially trained image reasoning adaptor weights that work in conjunction with the language model, enhancing the AI's ability to interpret and respond to visual inputs.

Optimized for Edge and Mobile Devices

In addition to their multimodal features, Llama 3.2 includes smaller models (1B and 3B) designed for deployment on edge devices and mobile platforms. This focus on lightweight models ensures that developers can create applications that prioritize user privacy and reduce reliance on cloud computing, which is especially beneficial for multilingual summarization and real-time processing.

Performance Enhancements

The new models utilize an optimized transformer architecture that incorporates supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). These enhancements improve the models' alignment with human preferences, making them more helpful and safe for users. Furthermore, Llama 3.2 supports long context lengths of up to 128k tokens, allowing for complex interactions and richer content generation.

Integration with Major Platforms

Meta has partnered with several major cloud providers, including Microsoft Azure and Amazon Web Services (AWS), to ensure that Llama 3.2 is easily accessible for developers. The models are available through managed compute services on these platforms, enabling seamless integration into existing workflows. Developers can utilize tools like Azure AI Content Safety and prompt flow to enhance their applications while adhering to ethical AI practices.

Community Engagement and Open Source Commitment

Meta continues its commitment to openness by making the Llama 3.2 models available for download on platforms like Hugging Face and its own llama.com site. This approach encourages collaboration within the developer community, fostering innovation while ensuring responsible use of AI technologies. The company has also introduced new tools and resources aimed at helping developers navigate the complexities of building with AI responsibly.

Energy Use for Training Llama 3.2 Model

Training was performed on custom hardware, requiring 2.02 million GPU hours. Meta reports zero market-based greenhouse gas emissions due to its renewable energy usage.

Benchmarks of Llama 3.2 - Image Reasoning

The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.

  1. For Base Pretrained Llama 3.2 Models
Category Benchmark # Shots Metric Llama 3.2 11B Llama 3.2 90B
Image Understanding VQAv2 (val) 0 Accuracy 66.8 73.6
Text VQA (val) 0 Relaxed accuracy 73.1 73.5
DocVQA (val, unseen) 0 ANLS 62.3 70.7
Visual Reasoning MMMU (val, 0-shot) 0 Micro average accuracy 41.7 49.3
ChartQA (test) 0 Accuracy 39.4 54.2
InfographicsQA (val, unseen) 0 ANLS 43.2 56.8
AI2 Diagram (test) 0 Accuracy 62.4 75.3
  1. For Instruction Tuned Llama 3.2 Models
Modality Capability Benchmark # Shots Metric Llama 3.2 11B Llama 3.2 90B
Image College-level Problems and Mathematical Reasoning MMMU (val, CoT) 0 Micro average accuracy 50.7 60.3
MMMU-Pro, Standard (10 opts, test) 0 Accuracy 33.0 45.2
MMMU-Pro, Vision (test) 0 Accuracy 23.7 33.8
MathVista (testmini) 0 Accuracy 51.5 57.3
Charts and Diagram Understanding ChartQA (test, CoT) 0 Relaxed accuracy 83.4 85.5
AI2 Diagram (test) 0 Accuracy 91.1 92.3
DocVQA (test) 0 ANLS 88.4 90.1
General Visual Question Answering VQAv2 (test) 0 Accuracy 75.2 78.1
Text General MMLU (CoT) 0 Macro_avg/acc 73.0 86.0
Math MATH (CoT) 0 Final_em 51.9 68.0
Reasoning GPQA 0 Accuracy 32.8 46.7
Multilingual MGSM (CoT) 0 em 68.9 86.9

How to use Llama 3.2 Model

For developers they can start using it on all the cloud service providers

  1. Amazon AWS - Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart & Amazon Bedrock.
  2. Microsoft Azure - Meta’s new Llama 3.2 SLMs and image reasoning models now available on Azure AI Model Catalog
  3. Hugging Face - Llama 3.2 - a Meta Llama Collection
  4. GitHub - The Meta Llama 3.2 collection of multilingual large language models (LLMs) on GitHub
  5. Llama - Llama 3.2 models available for download on Meta's llama official website.
  6. Google Cloud - Meta's Llama 3.2 is now available on Google Cloud

For individuals or explorers they can start using it on Creatosaurus for Free.

Llama 3.2 Model evaluations with other AI models

Their evaluation indicates that the Llama 3.2 vision models perform competitively with leading models like Claude 3 Haiku and GPT-4o-mini, particularly in image recognition and various visual understanding tasks. The Llama 3.2 3B model surpasses models like Gemma 2 (2.6B) and Phi 3.5-mini in tasks such as instruction-following, summarization, prompt rewriting, and tool-use, while the 1B model holds its own against Gemma.

They assessed performance across over 150 benchmark datasets covering multiple languages, with a specific focus on image understanding and visual reasoning benchmarks for the vision LLMs.

Llama 3.2 Vision instruction-tuned benchmarks
Llama 3.2 Lightweight instruction-tuned benchmarks

Meta also allows developers to fine-tune Llama 3.2 for specific applications, provided they comply with the Llama 3.2 Community License. The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.

Conclusion

The release of Meta's Llama 3.2 models represents a significant leap forward in the realm of AI, particularly in terms of multimodal capabilities and accessibility for edge computing. By combining advanced image reasoning with robust language processing, these models open up new possibilities for developers looking to create innovative applications across various industries. As Meta continues to support an open-source ethos, the future looks promising for those eager to harness the power of Llama 3.2 in their projects.

Article byHarshita SharmaI am Proficient in problem solving with Writing & Research. Also, I am helping people to boost their SERP Rankings grow 2x faster with my Web Content Writing Skills.

You focus on telling stories,we do everything else.