Issue 23: Cohere's Command-R Upgrade and the Rise of Structured Prompting - September 2, 2024

Disclaimer: This content is generated by AI using my social media posts. Make sure to follow.

This week's highlights include Cohere's significant model updates, insights into structured prompting's impact on LLM performance, and new multimodal capabilities from Qwen2.

News

Cohere Supercharges Command-R Models

Cohere has rolled out major enhancements to both Command-R and Command-R+ models, featuring 128k context support, 23-language compatibility, and improved structured data analysis. These updates boost the models' capabilities in reasoning, summarization, and question-answering tasks, with options for commercial licensing available through direct contact with Cohere.

Qwen2 Launches Multimodal Models

Qwen2-VL enters the multimodal arena with two new models: a 2B version for on-device use and a 7B version under Apache 2.0 license. The larger model competes with GPT-4o mini across various benchmarks, offering impressive video understanding, improved OCR, and multilingual support, as detailed in their technical blog post.

NOUS Unveils Hermes Function Call Dataset

NOUS has released the Hermes function call dataset under Apache 2.0 license, providing approximately 12,000 samples for training LLM models in agentic use and structured outputs. This resource covers various formats including function calls and JSON structures, accompanied by a GitHub repository with usage guides and example scripts.

Research

Structured Prompting's Impact on LLM Performance

A new study, "Let Me Speak Freely", investigates how structured prompting affects LLM performance and reasoning abilities across various tasks and models. The research reveals that while some models like Gemini 1.5 Flash maintain consistent performance, others like Claude 3.5 Haiku show significant variations depending on the format used.

Google DeepMind Introduces GenRM for Reward Modeling

Google DeepMind's latest paper presents GenRM, a method using fine-tuned task-specific LLMs as Reward Models. This approach, similar to OpenAI's CriticGPT, demonstrates that smaller, fine-tuned models can outperform larger LLMs in judging tasks, with performance improving as more training data is added.

Enhancing LLM Evaluator Reliability with PAIRS

"The Role of Pairwise Preference in Large Language Model Evaluators" introduces Pairwise-preference Search (PAIRS), a method improving evaluation robustness and efficiency. PAIRS outperforms existing methods like G-Eval and ELO rating on Spearman correlations, with implementation code available on GitHub.

General

AI Progress Beyond Model Size

Recent analysis shows GPT-4 level models have become 240 times cheaper in just two years, highlighting AI progress beyond mere size increases. This trend of models growing, then shrinking while maintaining power, suggests today's quality-to-cost ratio might be at its peak, potentially leading to more widespread AI adoption and development.

Comprehensive Overview of LLM-as-Judge Approaches

A thorough analysis of LLM-as-Judge approaches summarizes findings from numerous papers, offering insights into effective evaluation techniques. Key takeaways include the benefits of direct scoring for objective evaluations, pairwise comparisons for subjective tasks, and the use of tools like EvalLM for refining prompts and constraints.

Hugging Face Releases Google Cloud Deep Learning Containers

Hugging Face has launched Deep Learning Containers (DLCs) for Google Cloud Repository, optimized for training and deploying various AI models on Google Cloud Vertex AI and Google Kubernetes Engine. These containers offer pre-built solutions for AI tasks, with dedicated images for CPU and GPU, and include numerous examples for model training and deployment.

I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.

See you next week 👋🏻👋🏻