Issue 15: NVIDIA's Nemotron 340B Matches GPT-4, BigCodeBench Redefines Code Evaluation - June 23, 2024

Disclaimer: This content is generated by AI using my social media posts. Make sure to follow.

This week's tech roundup features NVIDIA's Nemotron 340B matching GPT-4, a new coding benchmark from BigCode, the introduction of GRPO, and the latest on multimodal capabilities from major players.

News

Hermes 2 Theta 70B: The Ultimate Llama 3
Hermes 2 Theta, a fine-tuned and merged model by Nous and Arcee, matches GPT-4 on MT Bench and surpasses Llama-3 70B Instruct in benchmarks, supporting structured outputs in JSON mode.

GenQA: A New Instruction Dataset
The GenQA dataset introduces over 10M cleaned and deduplicated instruction samples, generated without human oversight, excelling in AlpacaEval 2.0 and MT-Bench. Delve into the research paper for more insights.

Infinity Instruct: A New Massive Instruction Dataset
Infinity Instruct offers 3M deduplicated samples curated from various datasets, aiming to scale to 10M samples by the end of June, significantly boosting SFT experiments.

Character.AI's Impressive Query Handling
Character.AI handles 20,000 queries per second, rivaling Google Search volumes. Their innovations include Multi-Query Attention and Context Caching, significantly reducing serving costs.

Deploying LLMs on AWS Inferentia2
Learn how to deploy the Mixtral 8x7B model on AWS Inferentia2 using Hugging Face Text Generation Inference in this step-by-step guide, showcasing impressive throughput and cost-effectiveness.

DeepSeek v2 Coder: MoE Code LLM
DeepSeek v2 Coder is a Mixture-of-Experts model approaching GPT-4-Turbo on coding tasks, available in sizes 16B and 236B, setting new benchmarks in HumanEval and LiveCodeBench.

Research

AutoIF: Enhancing Instruction-Following Models
AutoIF is revolutionizing instruction-following by validating generated code, boosting model performance by up to 15% on IFEval. Explore its innovative process and implementation on GitHub.

Group Relative Policy Optimization (GRPO) Explained
DeepSeek Coder v2's technical report introduces GRPO, a method enhancing mathematical reasoning with reduced memory consumption. The Math-Shepherd process details the innovative training techniques used.

BigCodeBench: The New Coding Benchmark
BigCode introduces BigCodeBench, focusing on realistic coding tasks and diverse library use, surpassing HumanEval with 1,140 tasks and comprehensive evaluation metrics. Check out the leaderboard and code for more details.

General

Multimodal Advances from Major Players
Recent releases by AnthropicAI, GoogleDeepMind, and OpenAI highlight the growing importance of multimodality. Claude 3.5's sonnet ranks #1 on MixEval Hard, while GPT-4o's MixEval Hard shows a 96% correlation to lmsysorg's Chat Arena.

NVIDIA's Nemotron 340B
NVIDIA's Nemotron 340B matches GPT-4 (0314) with advanced training techniques and synthetic data generation. Discover their methods and insights in the technical report.


I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.

See you next week 👋🏻👋🏻