Issue 12: FineWeb and FineWeb-Edu Reports, MAP-Neo Insights, and More! - June 2, 2024

Disclaimer: This content is generated by AI using my social media posts. Make sure to follow.

This week's highlights include the FineWeb and FineWeb-Edu reports, innovative dataset creation methods from Lightblue, and new breakthroughs in LLM technology.

News

FineWeb Technical Report: An Inside Look

The FineWeb Technical Report delves into the creation of the best open-source dataset derived from CommonCrawl. With 96 snapshots, URL filtering, trafilatura for text extraction, and MinHash deduplication, FineWeb ensures top-notch quality. Rigorous filtering and ablation studies set FineWeb apart from other datasets.

FineWeb-Edu: A New Benchmark in Educational Data

The FineWeb-Edu Report details how FineWeb-Edu, a subset of FineWeb, was created by annotating 500k samples for educational quality using Llama-3-70B-Instruct. A classifier trained on this synthetic dataset filtered FineWeb, producing a high-quality educational dataset.

Yuan2-M32: An Innovative Mixture of Experts Model

IEIT-Yuan's Yuan2-M32, detailed in the Yuan2-M32 paper, is a 40B Mixture of Experts model with a new Attention Router mechanism. Achieving 72.2% on MMLU and 74.4% on HumanEval with only 3.7B active parameters, it outperforms previous benchmarks while using significantly less compute.

NVIDIA L4s Now Available on AWS Inference Endpoints

New NVIDIA L4s on AWS offer up to 8x L4s per user and organization at a 20% cost saving compared to on-demand AWS EC2, perfect for models like Llama 3B or Mistral 7B.

Research

Lightblue's Enhanced Synthetic Dataset Creation

Lightblue's new dataset creation method, outlined in this paper, uses repeated ranking to improve data quality. By collecting diverse prompts, generating multiple responses, and using evaluators like GPT-4, the consistency and quality of data are significantly enhanced.

MAP-Neo: A Comprehensive Open-Source LLM

The MAP-Neo paper covers MAP-Neo's open-source LLM, including tokenizers, data preprocessing, model architecture, and training. The 7B model, trained on 4.5T tokens, showcases impressive results across various benchmarks.

CodeAct: A New Framework for LLM Agents**

The CodeAct paper proposes using executable Python code for LLM agents, enhancing performance and flexibility over traditional JSON-based methods. This approach consolidates actions into a unified “action space,” improving success rates and reducing the number of actions needed.

General

Caution on Codestral-22B-v0.1 License

Mistral AI's new Code LLM, Codestral-22B-v0.1, is under a restrictive Non-Production License. It prohibits use in commercial activities or business operations, impacting its suitability for coding assistant tools.

Sentence Transformers 3.0: Custom Embedding Models

The new release of Sentence Transformers 3.0 introduces multi-GPU training, bf16 support, and enhanced monitoring, making it easier to train custom embedding models and boosting Retrieval-Augmented Generation applications.

Understanding the True Cost of Deploying Generative AI

Deploying Generative AI models involves more than just raw compute costs. Read my blog for a comprehensive understanding of the total cost of ownership (TCO), including hidden factors crucial for successful implementation and maintenance.

I hope you enjoyed this newsletter. 🤗 If you have any questions or are interested in collaborating, feel free to contact me on Twitter or LinkedIn.

See you next week 👋🏻👋🏻