Meta's new Llama 3.2 Multi Modal Open source AI Model
Discover Meta's groundbreaking Llama 3.2 models, featuring advanced multimodal capabilities for text and image processing.
Meta has recently unveiled its latest innovation in artificial intelligence: the Llama 3.2 models, which introduce significant advancements in multimodal capabilities. This release marks a pivotal moment for developers and businesses seeking to leverage AI for various applications, particularly those involving both text and image processing.
What is Llama 3.2 Multi Modal Vision AI Open Source Model?
The Llama 3.2-Vision collection is an advanced multimodal large language model (LLM) designed by Meta. It integrates both text and image processing capabilities, optimized for tasks such as image reasoning, visual recognition, captioning, and answering questions based on images. The collection is available in four sizes 1B, 3B, 11B and 90B. Their lightweight models, available in 1B and 3B sizes, are designed for efficiency and can be deployed across mobile and edge devices. The 11B and 90B multimodal models, on the other hand, are more versatile and capable of performing complex reasoning tasks on high-resolution images. There are now more variants of the open-source AI model accessible that you can refine, condense, and use wherever.
Model | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Data volume | Knowledge cutoff |
---|---|---|---|---|---|---|---|---|
Llama 3.2-Vision 11B | (Image, text) pairs | 11B (10.6) | Text + Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
Llama 3.2-Vision 90B | (Image, text) pairs | 90B (88.8) | Text + Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
Key Features of Llama 3.2
Multimodal Capabilities
The Llama 3.2 models are Meta's first to support multimodal functionality, allowing them to handle both text and images. Specifically, the 11B and 90B Vision Instruct models are optimized for tasks that require image reasoning and understanding, making them suitable for applications like visual question answering and image captioning. This capability is facilitated through specially trained image reasoning adaptor weights that work in conjunction with the language model, enhancing the AI's ability to interpret and respond to visual inputs.
Optimized for Edge and Mobile Devices
In addition to their multimodal features, Llama 3.2 includes smaller models (1B and 3B) designed for deployment on edge devices and mobile platforms. This focus on lightweight models ensures that developers can create applications that prioritize user privacy and reduce reliance on cloud computing, which is especially beneficial for multilingual summarization and real-time processing.
Performance Enhancements
The new models utilize an optimized transformer architecture that incorporates supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). These enhancements improve the models' alignment with human preferences, making them more helpful and safe for users. Furthermore, Llama 3.2 supports long context lengths of up to 128k tokens, allowing for complex interactions and richer content generation.
Integration with Major Platforms
Meta has partnered with several major cloud providers, including Microsoft Azure and Amazon Web Services (AWS), to ensure that Llama 3.2 is easily accessible for developers. The models are available through managed compute services on these platforms, enabling seamless integration into existing workflows. Developers can utilize tools like Azure AI Content Safety and prompt flow to enhance their applications while adhering to ethical AI practices.
Community Engagement and Open Source Commitment
Meta continues its commitment to openness by making the Llama 3.2 models available for download on platforms like Hugging Face and its own llama.com site. This approach encourages collaboration within the developer community, fostering innovation while ensuring responsible use of AI technologies. The company has also introduced new tools and resources aimed at helping developers navigate the complexities of building with AI responsibly.
Energy Use for Training Llama 3.2 Model
Training was performed on custom hardware, requiring 2.02 million GPU hours. Meta reports zero market-based greenhouse gas emissions due to its renewable energy usage.
Benchmarks of Llama 3.2 - Image Reasoning
The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.
- For Base Pretrained Llama 3.2 Models
Category | Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
---|---|---|---|---|---|
Image Understanding | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 |
Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 | |
DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 | |
Visual Reasoning | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 |
ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 | |
InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 | |
AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 |
- For Instruction Tuned Llama 3.2 Models
Modality | Capability | Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
---|---|---|---|---|---|---|
Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 |
MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 | ||
MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 | ||
MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 | ||
Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 | |
AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 | ||
DocVQA (test) | 0 | ANLS | 88.4 | 90.1 | ||
General Visual Question Answering | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 | |
Text | General | MMLU (CoT) | 0 | Macro_avg/acc | 73.0 | 86.0 |
Math | MATH (CoT) | 0 | Final_em | 51.9 | 68.0 | |
Reasoning | GPQA | 0 | Accuracy | 32.8 | 46.7 | |
Multilingual | MGSM (CoT) | 0 | em | 68.9 | 86.9 |
How to use Llama 3.2 Model
For developers they can start using it on all the cloud service providers
- Amazon AWS - Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart & Amazon Bedrock.
- Microsoft Azure - Meta’s new Llama 3.2 SLMs and image reasoning models now available on Azure AI Model Catalog
- Hugging Face - Llama 3.2 - a Meta Llama Collection
- GitHub - The Meta Llama 3.2 collection of multilingual large language models (LLMs) on GitHub
- Llama - Llama 3.2 models available for download on Meta's llama official website.
- Google Cloud - Meta's Llama 3.2 is now available on Google Cloud
For individuals or explorers they can start using it on Creatosaurus for Free.
Llama 3.2 Model evaluations with other AI models
Their evaluation indicates that the Llama 3.2 vision models perform competitively with leading models like Claude 3 Haiku and GPT-4o-mini, particularly in image recognition and various visual understanding tasks. The Llama 3.2 3B model surpasses models like Gemma 2 (2.6B) and Phi 3.5-mini in tasks such as instruction-following, summarization, prompt rewriting, and tool-use, while the 1B model holds its own against Gemma.
They assessed performance across over 150 benchmark datasets covering multiple languages, with a specific focus on image understanding and visual reasoning benchmarks for the vision LLMs.
Meta also allows developers to fine-tune Llama 3.2 for specific applications, provided they comply with the Llama 3.2 Community License. The model's benchmarks demonstrate strong performance in visual and mathematical reasoning tasks, with the larger 90B version offering higher accuracy across tasks compared to the 11B version.
Conclusion
The release of Meta's Llama 3.2 models represents a significant leap forward in the realm of AI, particularly in terms of multimodal capabilities and accessibility for edge computing. By combining advanced image reasoning with robust language processing, these models open up new possibilities for developers looking to create innovative applications across various industries. As Meta continues to support an open-source ethos, the future looks promising for those eager to harness the power of Llama 3.2 in their projects.
You focus on telling stories,we do everything else.