Issue 19: Apple Joins the AI Race, Mistral Releases 12B LLM, and Synthetic Data Surpasses Teachers - July 21, 2024
This week's AI landscape is buzzing with groundbreaking releases from tech giants, innovative research in synthetic data, and practical insights for LLM applications.
News
Apple Unveils Open-Source 7B LLM
Apple has entered the AI arena with a bang, releasing a 7B open-source LLM complete with weights, training code, and dataset. Trained on 2.5T tokens from open datasets, this English-focused model boasts a 2048 context window and outperforms Mistral on the MMLU benchmark. The model's open license and impressive performance make it a compelling option for researchers and developers alike.
Mistral AI Launches 12B Open LLM Powerhouse
Mistral is turning heads with their latest release: a 12B open LLM that's multilingual and packs a 128k context window. Dubbed Mistral Nemo, this model comes in base and instruct flavors, supporting nine languages and featuring quantized aware-training for FP8 inference. This powerhouse, trained on 3,072 H100 80GB GPUs, sets a new standard for open-source language models.
Five New AI Models Released in One Day
The AI world saw a flurry of activity with five new model releases in a single day. DeepSeek and Nemo Mistral updated their LLMs, while Deepset, Mixbread, and Snowflake introduced new embedding models. From DeepSeek's V2-Chat-0628 to Snowflake's arctic-embed-m-v1.5, these releases offer improved performance and new capabilities across various domains.
Hugging Face Introduces Docmatix for Semantic OCR
The Hugging Face science team has unveiled Docmatix, a massive dataset aimed at revolutionizing semantic OCR for PDFs and documents. With 9,500,000 Q/A pairs across 2,444,750 images, this dataset promises to enhance document processing using Vision Language Models (VLMs).
SmolLM: Bringing AI to Your Pocket
On-device AI takes a leap forward with the SmolLM series of small language models. Ranging from 135M to 1.7B parameters, these models are designed to run locally on devices like smartphones. The largest model outperforms Phi-1.5 and Qwen2 1.5B across benchmarks, showcasing the potential for powerful AI in compact packages.
Mistral's Double Release: Mamba Model and Math Specialist
Mistral AI continues to innovate with two new releases. The Codestral Mamba 7B, their first Mamba-based Code LLM, achieves 75% on HumanEval for Python coding. Additionally, they've introduced a math-focused fine-tune of Mistral 7B, boasting impressive accuracy on MATH and MMLU benchmarks.
Research
Synthetic Data Surpasses Human Teachers in AI-MO Challenge
The AI-MO team has shattered expectations with their latest synthetic dataset and fine-tuned Qwen2 model. Using Tool Integrated Reasoning (TIR), their model matches or surpasses OpenAI's GPT-4o and Anthropic's Claude 3.5 in math competitions. This breakthrough challenges the notion that fine-tuned models can't outperform their teachers.
Auto Evol-Instruct: Automating Synthetic Data Evolution
Microsoft and WizardLM have introduced Auto Evol-Instruct, a method for automatically evolving synthetic data without human expertise. By using an Evol LLM to create new instructions and an Optimizer LLM for critique, this approach improves data quality, diversity, and complexity. The research paper outlines impressive performance gains on MTBench and AlpacaEval.
General
Building a Continuous Data Flywheel for LLM Applications
Shreya Shankar outlines a strategy for maintaining peak performance in LLM applications through a continuous data flywheel. Her blog post details a comprehensive approach combining rigorous evaluation, automated monitoring, and continual improvement. From using LLMs as judges to regularly sampling production data, this methodology ensures AI applications stay sharp and relevant.
I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.
See you next week 👋🏻👋🏻