Issue 26: Qwen 2.5 Rivals GPT-4 Performance, Salesforce Boosts RAG and Moshi Revolutionizes Speech Tech - September 23, 2024

Disclaimer: This content is generated by AI using my social media posts. Make sure to follow.

This week's AI Insights dives into the latest advancements in Large Language Models (LLMs), featuring the Qwen 2.5 rivaling GPT-4, Salesforce's boost to RAG with SFR-RAG, Google DeepMind's self-correcting SCoRe, and much more.

News

Qwen 2.5: A Multilingual LLM Powerhouse

Qwen 2.5, the next iteration of Qwen 2, introduces nine new models with significant performance improvements across various benchmarks (https://qwenlm.github.io/blog/qwen2.5-llm/). These multilingual models support over 29 languages, excel in instruction following and structured data understanding, and offer enhanced role-playing capabilities. Notably, Qwen 2.5 72B outperforms Llama 3.1 70B and matches the performance of the massive 405B Llama model, demonstrating the rapid advancements in open-source LLMs. OpenAI's GPT-4 has long been a leader in coding benchmarks, but Qwen 2.5 7B Coder, an open LLM under Apache 2.0 license, is challenging that dominance (https://qwenlm.github.io/blog/qwen2.5-llm/). It achieves comparable performance to GPT-4 0613 across various benchmarks, offering a cost-effective alternative to GPT-4's pricing. Running Qwen 2.5 7B Coder on your own machine makes it virtually free, opening up exciting possibilities for developers seeking a powerful and affordable coding solution.

Kyutai's Moshi: Real-Time Speech-to-Speech-Text

Kyutai has unveiled Moshi, a real-time speech-to-speech-text foundation model achieving impressive low latency (https://huggingface.co/collections/kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd). This 7B parameter model runs on-device, making it ideal for real-time applications. Moshi utilizes Mimi, a cutting-edge streaming audio codec, to achieve 160-200ms end-to-end latency. With open weights, code, and technical reports available under a permissive license, Moshi paves the way for innovative speech-based applications.

Hugging Face Hub Integrates with Google Cloud's Vertex AI

The Hugging Face Hub now boasts tighter integration with Google Cloud's Vertex AI Model Garden (https://console.cloud.google.com/vertex-ai/model-garden/featured-partners/hugging-face). This enhanced integration allows users to seamlessly browse and deploy thousands of open generative AI models from Hugging Face directly to Vertex AI or Google Kubernetes Engine (GKE). This streamlined deployment process empowers developers to leverage the vast Hugging Face ecosystem within Google Cloud's robust infrastructure.

Research

Salesforce Boosts RAG Performance with SFR-RAG

Reasoning remains a crucial frontier in AI, with Retrieval Augmented Generation (RAG) serving as a key stepping stone. Salesforce has introduced ContextualBench, a leaderboard and framework for evaluating RAG performance across benchmarks like HotpotQA (https://huggingface.co/spaces/Salesforce/ContextualBench-Leaderboard). Alongside this, they unveiled SFR-RAG 9B, a fine-tuned LLM for RAG that rivals Cohere Command-R+ and OpenAI GPT-4o in accuracy. This highlights the power of task-specific models, particularly for search and RAG applications, where smaller, fine-tuned models can outperform larger generic models.

Five Must-Read Papers on LLM Reasoning

Stay at the forefront of LLM advancements with these five recent research papers exploring how to enhance reasoning and long-term generation capabilities (https://huggingface.co/collections/philschmid/llm-reasoning-papers-66e6abbdf5579b829f214de8). These papers delve into topics such as self-correction through reinforcement learning, scaling test-time compute, and the impact of self-reflection on problem-solving, providing valuable insights for researchers and developers alike.

Google DeepMind's SCoRe: Self-Correction through RL

Google DeepMind has developed SCoRe, a multi-turn chain of thought online reinforcement learning approach for improving self-correction in LLMs (https://huggingface.co/papers/2409.12917). This innovative technique uses self-generated data to achieve state-of-the-art self-correction, boosting performance on benchmarks like MATH and HumanEval by 15.6% and 9.1%, respectively. SCoRe trains a single model to produce and correct responses without relying on external feedback, showcasing the potential of self-learning in LLMs.

OpenAI Improves Reasoning with Process Supervision

LLMs can sometimes arrive at the correct answer through faulty reasoning. Research from OpenAI demonstrates that focusing on the correctness of each reasoning step, rather than just the final answer, significantly improves performance (https://huggingface.co/papers/2305.20050). Their process-supervised reward models (PRMs) train LLMs to evaluate intermediate reasoning steps, resulting in a substantial boost in accuracy on complex tasks like the MATH dataset. This emphasizes the importance of refining the reasoning process itself to build more reliable and trustworthy LLMs.

V-STaR: Leveraging Incorrect Results for LLM Improvement

Can we leverage incorrect results to enhance LLMs? V-STaR introduces a novel approach that utilizes preference pairs generated during self-reflection to train a verifier (https://huggingface.co/papers/2402.06457). This verifier judges the correctness of model-generated solutions during inference, leading to performance improvements of 4% to 17% on code generation and math reasoning benchmarks. V-STaR highlights that learning from mistakes can be a valuable strategy for improving LLM accuracy and robustness.

RefAug: Augmenting Data for Reflective Thinking

OpenAI's models exhibit impressive "thinking" capabilities. RefAug introduces a method for augmenting existing training data to embed problem reflection into LLMs, particularly for math problems (https://huggingface.co/papers/2406.12050). This approach involves generating alternative reasonings and analogies related to the original problem, improving accuracy on math tasks by 6.8 points and code performance by +3.5 points. Scaling RefAug to larger datasets and other domains holds great promise for developing open models with capabilities rivaling those of OpenAI's advanced models.

General

Evaluating LLMs with Hugging Face TGI and vLLM

Evaluating LLMs on benchmarks like IFEval and GSM8K can be complex. A new guide simplifies this process by using Hugging Face's Text Generation Inference (TGI) or vLLM with the Evaluation Harness (https://www.philschmid.de/evaluate-llms-with-lm-eval-and-tgi-vllm). You'll learn how to evaluate Llama 3.1 8B Instruct using chain-of-thought reasoning, reproduce Meta's reported results, and leverage cloud instances for faster and more efficient evaluation.

Deploy Llama 3.1 on Serverless GPUs with Google Cloud Run

Google Cloud Run's new Serverless GPU preview unlocks new possibilities for deploying LLMs efficiently. A recent guide demonstrates how to deploy Meta Llama 3.1 8B, quantized to INT4 using AWQ, on a single NVIDIA L4 GPU using Cloud Run (https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/cloud-run/tgi-deployment). This setup leverages the Hugging Face Deep Learning Container for Google Cloud, providing automatic scaling and cost savings, making it an attractive option for deploying LLMs in a serverless environment.

Streamlining Self-Reflection in LLMs

Self-reflection is a powerful technique for improving LLM reasoning. A clean code snippet has been released to facilitate the use of self-reflection with the <thinking> tag in OpenAI-compatible endpoints (https://github.com/codelion/optillm/blob/main/cot_reflection.py). This snippet simplifies the implementation of self-reflection for models like Llama 3.1 70B and Gemma 2 27B, enabling developers to easily experiment with this technique and explore its benefits for enhancing LLM reasoning capabilities.

I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.

See you next week 👋🏻👋🏻