Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

10:02 AM · May 7, 2025

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Introduces two openly licensed datasets:
1. SwallowCode (≈16.1 billion tokens) refines Python snippets from The-Stack-v2
2. SwallowMath (≈2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations

abs: https://arxiv.org/abs/2505.02881
datasets: https://huggingface.co/datasets/tokyotech-llm/swallow-code
https://huggingface.co/datasets/tokyotech-llm/swallow-math

Share

Explore

TwitterXDownload

v1.4.62

Download Twitter videos and media content for free. No registration required. Fast and easy Twitter video downloader. Twitter Media Saver. Twitter X Download.

Other Links

Product

English 简体中文繁體中文 हिन्दी Español Français Deutsch বাংলা Русский Português اردو 日本語 한국어 Tiếng Việt Italiano ไทย Türkçe

© 2024 TwitterXDownload All rights reserved.

support@twitterxdownload.com