Issue 13: WEBINSTRUCT and Qwen2 Revolutionize Instruction Data and Multilingual Models - June 9, 2024

Disclaimer: This content is generated by AI using my social media posts. Make sure to follow.

This week, we dive into WEBINSTRUCT's massive dataset, Qwen2's multilingual capabilities, and more cutting-edge AI research and news.

News

WEBINSTRUCT: A New Frontier in Instruction Data

WEBINSTRUCT is revolutionizing how we extract instructional data from web content, boasting a 10 million high-quality dataset without human annotation or GPT-4. It leverages a custom-trained FastText model to recall relevant documents and employs pattern matching to extract question-answer pairs. Open LLMs then refine these pairs, adding intermediate steps and correcting formatting. Models like MAmmoTH2-8x7B-Plus have shown exceptional results, emphasizing the importance of diverse, high-quality instruction data. Dive into the details in the paper.

Qwen2: Multilingual LLM Excellence

Qwen2's new multilingual model family outperforms Llama 3, supporting 29 languages and offering state-of-the-art performance across various benchmarks. The models range from 0.5B to 72B parameters, with the larger models achieving impressive scores on MMLU and HumanEval. Check out the demo.

Simplifying Embedding Creation with Hugging Face on AWS

The Hugging Face Embedding Container for Amazon SageMaker is now available, making it easier to create embeddings for RAG applications. It supports architectures like BERT and RoBERTa, providing fast inference on both CPU and NVIDIA GPU instances. Deploying models like Snowflake Arctic and Jina AI has never been simpler. Explore the notebook example.

NVIDIA NIM: Accelerating AI Deployment

NVIDIA NIM, launched at COMPUTEX, offers streamlined inference services for deploying generative AI models. It supports 1-click deployment for models like Llama 3 on AWS and GCP, with high throughput and low latency.

Phi-3 Models on the LMSYS Leaderboard

Phi-3 Medium and Small models have joined the LMSYS leaderboard, demonstrating competitive performance against larger models. This highlights the need for diverse evaluation methods beyond academic benchmarks.

Skywork's New MoE Model

Skywork's latest 146B MoE model, upcycled from a 13B dense model, showcases the potential of upcycling in achieving high performance. This model uses innovative techniques like Gating Logit Normalization and adaptive auxiliary loss coefficients.

Research

DITTO: Fast-Tracking LLM Training with Minimal Samples

Stanford's DITTO demonstrates that LLMs learn faster with fewer examples, tuning with less than 10 samples. This innovative method collects a small number of expert demonstrations and generates new negative samples for comparison, significantly improving performance. DITTO has shown a 22.34% improvement and outperforms few-shot prompting methods. Discover the method on GitHub.

MixEval: Bridging the Gap Between Academic and Real-World AI

MixEval and its challenging subset MixEval-Hard combine existing benchmarks with real-world queries, achieving a 96% correlation to human preferences with minimal cost. This new benchmark helps differentiate strong models and is available on Hugging Face. Explore the leaderboard.

General

Fine-Tuning Embedding Models for Financial Applications

Fine-tuning embedding models for domain-specific tasks like financial RAG applications can significantly boost performance. Using NVIDIA's SEC Filing dataset, the new approach shows performance gains between 7.4% and 22.55%. The process is efficient, with training times as low as five minutes on consumer-grade GPUs. Read more in the blog.

I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.

See you next week 👋🏻👋🏻