mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 16:15:57 +08:00
docs: add sourced transformer report folklore
This commit is contained in:
@@ -0,0 +1,568 @@
|
||||
# Atropos - Nous Research's LLM RL Gym
|
||||
|
||||

|
||||
|
||||
<div align="center">
|
||||
|
||||
*In Greek mythology, Atropos was the eldest of the three Fates. While her sisters spun and measured the threads of mortal lives, Atropos alone held the shears that would cut these threads, determining the final destiny of each soul. Just as Atropos guided souls to their ultimate fate, this system guides language models toward their optimal potential through reinforcement learning.*
|
||||
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
</div>
|
||||
<div id="badges" align="center">
|
||||
<a href="https://huggingface.co/NousResearch">
|
||||
<img src="https://img.shields.io/badge/NousResearch-orange?style=for-the-badge&logo=huggingface&logoColor=white" alt="HuggingFace"/>
|
||||
</a>
|
||||
<a href="https://nousresearch.com">
|
||||
<img src="https://img.shields.io/badge/NousResearch.com-white?style=for-the-badge&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACQAAAAlCAYAAAAqXEs9AAAAIGNIUk0AAHomAACAhAAA+gAAAIDoAAB1MAAA6mAAADqYAAAXcJy6UTwAAAAJcEhZcwAAFiUAABYlAUlSJPAAAAAGYktHRAD/AP8A/6C9p5MAAAAldEVYdGRhdGU6Y3JlYXRlADIwMjUtMDQtMjlUMTU6NDI6MjcrMDA6MDAUtMrgAAAAJXRFWHRkYXRlOm1vZGlmeQAyMDI1LTA0LTI5VDE1OjQyOjI3KzAwOjAwZelyXAAAACh0RVh0ZGF0ZTp0aW1lc3RhbXAAMjAyNS0wNC0yOVQxNTo0MjoyNyswMDowMDL8U4MAAAhJSURBVFhHzVhZTJVXEB4RNxBBwBXiwqIo7juKorjjFn1QaVKrUROLRkxKjKKNItWkJrYuscYXouVNrCa0LO5bArgHUFxRqIJAFRAQUJbpfMP9/96LF7RvfMnJ/Zfzn/OdmW/mzLntiIiltRk4WH7bDEwLxcXFkb+/P7Vrh0etw+hTX19PtbW19OnTJ33m5OREXbp0kWsHamioJ+YvGx99njx5QuvWrdN7k9CVK1do+vTpuGwVjY2N9OFDtbQqys/Ppxs3btC1a9eopKSEevfuTSNGjKCQkOk0fvx4cnXtZvmqZYDQpUuXaPbs2ZYnTYRYCLFMZrcBZWVlfPr0aY74PoKXLl3Kq1Z9x0eOHOGHD3M4PT2DV6xYwR06dNCxxEo8bdo0TkhI4Nraj/q9vXHRGhoa+MKFC/qdpbVMCBB3cHJyMs+cOZM7depk/SE7ODiwWIIvXrzIVVVVvG3bdpMUmrOzM2/dupXfv6/QsZqPj/bVhIDS0lLeEb2D3dzcrD8wm+hGm6+vL6empiqpVd+uMt+DcLdu3fjAgQMserMZ32jNCdmNMgi0pOQf2r59O/3y6y+ilw/k4uJCjo6OpqA7duxIPj4+NGvWLNq580eSQamoqIiid0STENT3aJMnT6bOnTtTTk6O+W1rsEsIUZOU9BdlZ2eLMF1pwoQJ5OHhQQEBARQcHEwDBgygPn360JYtWygqKoqGDx9OeXl59ODBAxo8eDCJzmjNmjW0ePFiioyMpFGjR5Pojd69e/dFUo6WXxP4ACudOHEi9erVi8QVNGnSJCFZRzU11TRw4EASV1L37t2lzyRKSDhFERERVF5erikAhOfNn0d37t7RtLBv3z7y9vKijJs3acGCBbRkyRLLTPbxGSEMevbsWQlHIk9PD6qoqKCTJ09KXmnQSdFEXtS3b19tCPeXL19qOrh69SrdlIm9vb3p6bNnlC9Wq6mpoYLXr6myspISE/+k+fPnkwjfMpt9qJggauDp06c8btw4FuuomCFMNPRp3749i7vMe/yKrvTauA8LW8AhISEcGBio4S+5iSVhsrieBw0axI8fP9Z5vlrUyJqPHj0yrYGVy/f6Dlbq2bOn6gnAO7jFAO5zch7SM7EOngtBFTa++/jxI70WS927d8/S2z4+IwR9VFdX6wAGevToIQM3mVkSJI0ePUavAUxqAPp78+aNuvHFixc6Tl1dHUk+0l+4738TQlQhvK2BMA4P/4bElSRupNCZobRo0SIKDQ1VkQ4bNky/gSWxEFgHBHAP7SBlwEq4z8rKUqKtRZv6ztBQXl4+Bw4NZDG16kMm1qQng7CsmvfExrLsebx5cyRnZmbK8xrVhaQEUwdiUe7Xr59+D93Jps2O8ot3SKKyB+pcX9SQ9KH+/fvRrt27KCgoSPMOcPv2bV0hQn7unLkkGVlckktpaWnqMuhKthXtKwQ0F40aNUpTA76DtZzEbQBSCtzZGpSZ9dYB1iJMPnXqFE+dOpU3bdrE5bK5AmVl5ZyYmMgpKan89u1bfVZYWMhCwlxl8+bu7s6SCsz7gwcP6ndftJAB+NfPz0/ziZOzE23cuJFcRTsyhpYUCxcupHnz5prRBqEjIlsCdAQLGrq5e/euWs4e7BIygIjz8/UnWaFGiDVADg149eqVDSG4zTow4DIQMNwKYWNse8JulZC4RAbpqCuEVlqKDBG1TZoAUcMCRi7CggxC2PeQq+yhVUJFRcW6mvSMDDpx4gRly+aJ5NccSKQAJsfOjj6G9WAplLU1EuqwHN5je8qQMe2hVULFxUV069YtHWzkyJHkLlFjDVgMK8/NzdV77G2oDKxhuK9WLAirISchr6H0NWpxa9gQgq9LS8v0GsmtoKBAN9fLly9rrewlu7Z1ZgbwXqJMryUiaezYsXoNYDKQwcQYDw2kkFKwRSGjt0gIK0E9A/OjE3INfA0kJSXR+fMX9BrvMCjIAxAzogxo7gpDd8jMcCGI4Xfo0KFa9EF7FRWVNqTMUMDRBYQgzuDgKSS7vroCnSHqqKgfpLy4on0RVSghYmL2mOQwuZtbd62H7t+/r+QcHNrpryFwTZByVML2BEIYRw4BYsX2+h4wCTU2Nu3Ihw4dEu0U67EIK4K/QQwN76yB9yjOQBoEx4wZI24oVtcA9fXYv/6rBgAvyW25uS/MNJGVlSm7Q3+9BmwE4enpSV27dqWf9+9XC4WHh+vKsSJ7RRVcefjwYXUBFgMrokI0CBmRZg0nibgzZ/5QiaA9f/7c7A/YEAoIGKKFO4hs2LBBrQKzDhky5LPBEb7Yr5YvX06xsT9JmRukpAz32AMsiaiFoLELIBjgOiReAyYh5I6AACnQUR+LSI8dO6YnhWXLlql1rFcB+Pj4kpzDdPDS0ncUH/+7nFhDNOcgGmFt1OT+/oM0QrExwyIQOIBnKSkpNGfOHE0FBiBvXTo0gwGxCgjx6NHfxDp/y0mhlJKTkzTbYqc2AFdGR0eru0AWWV33LOl3U6wAcc+YMYNcurrQe0kNGRnpSuDcuXNaBUg5q5rbu3evWg3HKQDy3o2L1atX6/EGgEW8vb3o+vUblJefR3JMptfiOmtCAPIU9IV9CZNjgqTkZAoLC6P169dronT3cJffPipc5B00fLN27Vo9RkGzKEfi4+MtozZZyCzQrCF6kKN0Hcupg0Uz2g9FF4p3FPErV67ktLR0Fp1xVlY2x8TEcFxcnJYV9iCaZNEmS6K1PGkqQazLDzPskQTT09Mtd02AWwoKCtWscCM0gPMaNkkkQ5SwcOXx48dVBxDplClT9Cgkc1lGsQXew8VIoEYf/ItiwNRQW4HtxtQG0MYIEf0L1N75qS9kGwUAAAAASUVORK5CYII=" alt="Website"/>
|
||||
</a>
|
||||
<a href="https://x.com/NousResearch">
|
||||
<img src="https://img.shields.io/badge/@NousResearch-black?style=for-the-badge&logo=X&logoColor=white" alt="@NousResearch"/>
|
||||
</a>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## What is Atropos?
|
||||
Atropos is an environment microservice framework for async RL with LLMs.
|
||||
|
||||
Atropos encompasses both environments, which are set up as services, and a trajectory API for the environments to send data to and for the trainer to pull batches from.
|
||||
|
||||

|
||||
<div align="center">
|
||||
|
||||
*Here is a diagram of how Atropos' components can interact with a trainer & inference server to complete the RL loop (trainer & inference engine not included with the atropos package)*
|
||||
|
||||
</div>
|
||||
|
||||
Atropos is a robust, scalable framework for **Reinforcement Learning Environments with LLMs**.
|
||||
|
||||
The goal: provide a flexible, scalable, and standardized platform to accelerate LLM-based RL research across diverse, interactive settings.
|
||||
|
||||
The framework supports collecting, distributing and evaluating LLM trajectories through diverse environments including:
|
||||
|
||||
<div align="center">
|
||||
|
||||
| Environment Type | Examples | Purpose |
|
||||
|---------------------------|--------------------------------------------|----------------------------------------------------|
|
||||
| 📚 Dataset environments | GSM8K, MMLU, Custom HF Datasets | Evaluate and improve LLM performance on static data|
|
||||
| 🎮 Online environments | Blackjack, Taxi, Text-based games | Train LLMs through interactive game-based learning |
|
||||
| 🤖 RLAIF and RLHF | LLM Judge/Reward Models | Fine-tune LLMs using human feedback and alignment |
|
||||
| 🔄 Multi-Turn RL | deepresearch, internal tool calling | Train LLMs on complex multi-step interactions |
|
||||
| 💻 Code Execution | MBPP, HumanEval (via `coding_server.py`) | Train LLMs to generate and execute code |
|
||||
| 🖼️ Multimodal | OCR VQA, Clevr (via `multimodal_dpo/`) | Train LLMs on tasks involving vision and language |
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## Experimental results from models trained using Atropos' environments
|
||||
|
||||
We have been able to achieve significant improvements on specific domains or tasks with Atropos - Below are some of the results.
|
||||
|
||||
**Tool Calling Environment Results:**
|
||||
|
||||
<div align="center">
|
||||
|
||||
| Berkeley Function Calling Benchmark Type | Base Model | With Atropos RL | Improvement |
|
||||
|---------------|------------|-----------------|-------------|
|
||||
| Parallel Tasks| 10% | 46% | **4.6x** ⬆️ |
|
||||
| Simple Tasks | 21% | 51.75% | **2.5x** ⬆️ |
|
||||
|
||||
</div>
|
||||
|
||||
Model Artifact:
|
||||
https://huggingface.co/NousResearch/DeepHermes-ToolCalling-Specialist-Atropos
|
||||
|
||||
|
||||
Environment Used:
|
||||
[https://github.com/NousResearch/atropos/blob/main/environments/tool_calling_server.py](https://github.com/NousResearch/atropos/blob/main/environments/tool_calling_server.py)
|
||||
|
||||
---
|
||||
|
||||
**Financial Fundamentals Prediction Environment Results**:
|
||||
|
||||
<div align="center">
|
||||
|
||||
| Metric | Initial Accuracy | With Atropos RL | Improvement |
|
||||
|--------|-----------------|-----------------|-------------|
|
||||
| Directional Prediction Eval Accuracy | 20% | 50% | **2.5x** 📈 |
|
||||
|
||||
</div>
|
||||
|
||||
Model Artifact:
|
||||
https://huggingface.co/NousResearch/DeepHermes-Financial-Fundamentals-Prediction-Specialist-Atropos
|
||||
|
||||
Environment Used:
|
||||
[https://github.com/NousResearch/atropos/blob/main/environments/fundamental_prediction_environment.py](https://github.com/NousResearch/atropos/blob/main/environments/fundamental_prediction_environment.py)
|
||||
|
||||
---
|
||||
|
||||
## RLAIF Experiment Artifacts
|
||||
Using the RLAIF Environment to change the personality of the model, we have produced several artifacts of interesting and weird personalities.
|
||||
|
||||
**DeepHermes Egregore v1 and v2 8B:**
|
||||
|
||||
https://huggingface.co/NousResearch/DeepHermes-Egregore-v1-RLAIF-8b-Atropos
|
||||
https://huggingface.co/NousResearch/DeepHermes-Egregore-v2-RLAIF-8b-Atropos
|
||||
|
||||
**DeepHermes Ascension Maze 8B:**
|
||||
|
||||
https://huggingface.co/NousResearch/DeepHermes-AscensionMaze-RLAIF-8b-Atropos
|
||||
|
||||
Environment Used: [https://github.com/NousResearch/atropos/blob/main/environments/rlaif_server.py](https://github.com/NousResearch/atropos/blob/main/environments/rlaif_server.py)
|
||||
|
||||
---
|
||||
|
||||
## Navigating the Repo
|
||||
|
||||
| Category | Description |
|
||||
|----------|------------|
|
||||
| 📁 [`atroposlib/`](atroposlib/) | Core library containing base classes and utilities |
|
||||
| 🎮 [`environments/`](environments/) | Collection of ready-to-use RL environments. Community contributions are typically placed in the [`environments/community/`](environments/community/) subdirectory. |
|
||||
| 📚 [`example_trainer/`](example_trainer/) | Example training scripts and configurations |
|
||||
|
||||
Key Documents:
|
||||
- [Base Environment Class](atroposlib/envs/README.md) - Documentation for creating custom environments
|
||||
- [ManagedServer Guide](atroposlib/envs/server_handling/MANAGED_SERVER.md) - **Recommended approach** for automatic token and logprob tracking
|
||||
- [Environments Overview and Contribution Guide](environments/community/README.md) - Documentation for existing environments and how to contribute new ones.
|
||||
- [Full Environment Config Options](CONFIG.md) - Documentation for creating custom environments
|
||||
- [Example Trainer](example_trainer/README.md) - Getting started with training
|
||||
- [Slurm Guide](SLURM.md) - Guide for using Atropos with Slurm for distributed inference
|
||||
- [Frequently Asked Questions (FAQ)](atroposlib/FAQ.md) - Answers to common questions for new users
|
||||
- [Contributing Guide](CONTRIBUTING.md) - Guidelines for contributors
|
||||
- [License](LICENSE) - MIT license details
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before installing Atropos, ensure you have the following:
|
||||
|
||||
- **Python 3.10+** — Required. Check with `python --version`
|
||||
- **Git** — For cloning the repository
|
||||
- **An OpenAI-compatible API endpoint** — Atropos environments need an inference server. Options include:
|
||||
- A local [vLLM](https://github.com/vllm-project/vllm) or [SGLang](https://github.com/sgl-project/sglang) instance
|
||||
- An [OpenAI API key](https://platform.openai.com/api-keys) (set as `OPENAI_API_KEY` environment variable)
|
||||
- Any provider with an OpenAI-compatible endpoint (e.g., [Together AI](https://together.ai), [OpenRouter](https://openrouter.ai))
|
||||
- **Weights & Biases account** *(optional)* — For experiment tracking. Set `use_wandb=False` in your environment config to skip
|
||||
|
||||
> **Note:** You do not need a GPU to develop or test environments locally. A GPU is only required for running inference servers locally or for training.
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
Get your Python 3.10 (or later) environment ready, then simply pip install:
|
||||
|
||||
```bash
|
||||
pip install atroposlib
|
||||
```
|
||||
|
||||
If you're looking to get into developing the repo or using the environments:
|
||||
|
||||
|
||||
```bash
|
||||
pip install -e . # for using
|
||||
pip install -e .[dev] # for development
|
||||
pip install -e .[examples] # for running examples
|
||||
pip install -e .[verifiers] # for verifiers integration
|
||||
pip install -e .[all] # for everything
|
||||
```
|
||||
|
||||
**Important:** If you're committing to the repository, please install the pre-commit hooks:
|
||||
```bash
|
||||
pre-commit install
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Quick Start Guide
|
||||
|
||||
1. **Create Your First Environment**
|
||||
- Review our [Base Class Documentation](atroposlib/envs/README.md) to understand the core concepts
|
||||
- Check out existing environments in the [`environments/`](environments) directory for examples
|
||||
|
||||
2. **Run an Example Environment**
|
||||
|
||||
You should edit the config_init section of the environment file you want ([For example, in GSM8K Environment](https://github.com/NousResearch/atropos/blob/main/environments/gsm8k_server.py#L53)) to point to a running VLLM or SGLang inference server as well as any other [configuration changes](CONFIG.md) you'd like to make, such as the group size, then:
|
||||
|
||||
> **Note:** By default, Atropos uses the OpenAI-compatible API endpoint which works with any provider. For enhanced features, use `VLLMServer` (atroposlib/envs/server_handling/vllm_server.py) or `SGLangServer` (atroposlib/envs/server_handling/sglang_server.py) for direct access to native APIs with full token and logprob tracking.
|
||||
|
||||
```bash
|
||||
# Start the API server
|
||||
run-api
|
||||
```
|
||||
In a separate terminal, start the GSM8K environment microservice
|
||||
```bash
|
||||
python environments/gsm8k_server.py serve --openai.model_name Qwen/Qwen2.5-1.5B-Instruct --slurm false
|
||||
# alternatively
|
||||
# python environments/gsm8k_server.py serve --config environments/configs/example.yaml
|
||||
# python environments/gsm8k_server.py serve --config environments/configs/example.yaml --env.group_size 8 # cli args override corresponding config settings
|
||||
```
|
||||
3. **Grabbing Rollouts**
|
||||
|
||||
If you want to just start getting rollouts, and not use a trainer, see the [debug section](#testing-and-debugging-tools)
|
||||
for help getting started with the available tools, we recommend starting with process or view-run
|
||||
|
||||
4. **Training Your Model**
|
||||
- Follow our [training example guide](example_trainer/README.md) for detailed instructions
|
||||
- Monitor progress through our built-in logging and reporting system:
|
||||
- Completion lengths
|
||||
- Evaluation accuracies
|
||||
- Full rollouts and scores
|
||||
|
||||
You can use multiple environments at once, just point them all to the same server.
|
||||
|
||||
Environments come with detailed logging and reporting support, runs track completion lengths, eval accuracies, full rollouts and scores, and more:
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
# Trainer Integrations
|
||||
## Axolotl
|
||||
<a href="https://github.com/axolotl-ai-cloud/plugin-atropos">
|
||||
<img
|
||||
src="https://github.com/user-attachments/assets/be629253-a8b1-4354-b6da-5e404e9c854d"
|
||||
alt="Atropos plugin logo"
|
||||
width="50%">
|
||||
</a>
|
||||
|
||||
Axolotl is a powerful tool for fine-tuning a wide range of AI models, supporting techniques like LoRA and QLoRA through simple YAML configurations.
|
||||
|
||||
The [Atropos plugin for Axolotl](https://github.com/axolotl-ai-cloud/plugin-atropos) seamlessly integrates Atropos' RL environments into Axolotl's training pipelines.
|
||||
This allows you to leverage Atropos for reinforcement learning while utilizing Axolotl's extensive features for model fine-tuning.
|
||||
|
||||
To use, follow the README on the [plugin repository](https://github.com/axolotl-ai-cloud/plugin-atropos).
|
||||
|
||||
## Tinker
|
||||
<a href="https://github.com/NousResearch/tinker-atropos">
|
||||
<img
|
||||
src="https://github.com/user-attachments/assets/6c226187-4df8-4094-be5d-72f3f58de423"
|
||||
alt="Atropos Tinker logo"
|
||||
width="50%">
|
||||
</a>
|
||||
|
||||
The Tinker API is a simple and flexible LoRA trainer framework for researchers and developers to quickly build out their ideas without worrying about the complexities of distributed training. Users write a simple loop that runs on their CPU, and Tinker manages the backend computation on their GPUs, while still providing full control over the training and algorithmic details.
|
||||
|
||||
The [Tinker-Atropos](https://github.com/NousResearch/tinker-atropos) integration layer enables all Atropos environments to leverage the power of Tinker for their RL experiments. This allows users with little or no compute to develop and build Atropos environments with minimal worry about the underlying compute behavior, as well as providing an easy environment integration point for Tinker users.
|
||||
|
||||
To get started, check out the README at the [project repository](https://github.com/NousResearch/tinker-atropos).
|
||||
|
||||
## Atropos' Example Trainer
|
||||
Atropos repo contains an example trainer that should primarily be used as a reference example to show how a trainer and inference provider can be integrated with Atropos to complete the RL Training Loop.
|
||||
|
||||
To use the example trainer, see this page: [training example guide](example_trainer/README.md)
|
||||
|
||||
## On-Policy Distillation (API + ScoredDataGroup Contract)
|
||||
|
||||
Atropos now supports OPD at the transport layer by carrying distillation arrays
|
||||
through `ScoredDataGroup` and the API queue/batch endpoints.
|
||||
|
||||
### Scope of this change
|
||||
|
||||
- No teacher fetching/orchestration in `BaseEnv`.
|
||||
- Environments or external pipelines are responsible for populating distillation arrays.
|
||||
- API stores and returns those arrays unchanged.
|
||||
|
||||
### Distillation payload fields
|
||||
|
||||
Each scored group may include:
|
||||
|
||||
- `distill_token_ids`: shape `[sequence][position][top_k]`
|
||||
- `distill_logprobs`: shape `[sequence][position][top_k]`
|
||||
|
||||
These fields are optional, and when present are forwarded from:
|
||||
|
||||
- environment -> `/scored_data` or `/scored_data_list`
|
||||
- API queue -> `/batch` -> trainer
|
||||
|
||||
### Minimal producer example (environment side)
|
||||
|
||||
```python
|
||||
scores["distill_token_ids"] = distill_token_ids
|
||||
scores["distill_logprobs"] = distill_logprobs
|
||||
```
|
||||
|
||||
### Minimal consumer check (trainer/debug side)
|
||||
|
||||
```bash
|
||||
curl -s http://localhost:8002/latest_example | jq '{has_ids:(.distill_token_ids!=null), has_lps:(.distill_logprobs!=null)}'
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- The API does not validate cross-field semantics beyond schema typing.
|
||||
- Trainers should validate alignment assumptions they require (sequence length, per-position top-k, etc.).
|
||||
- Teacher-side architecture and prompt/rendering strategy are intentionally out of scope for this PR.
|
||||
|
||||
### TeacherDistillationEnv follow-up
|
||||
|
||||
The follow-up teacher environment uses a dedicated teacher server config and
|
||||
attaches teacher prompt logprobs before the group is sent to the API.
|
||||
|
||||
Teacher config shape:
|
||||
|
||||
```python
|
||||
TeacherDistillationConfig(
|
||||
teacher_enabled=True,
|
||||
teacher_top_k=8,
|
||||
)
|
||||
```
|
||||
|
||||
Teacher server configs are passed separately at init, just like the primary
|
||||
`server_configs`:
|
||||
|
||||
```python
|
||||
env = MyTeacherEnv(
|
||||
config=env_config,
|
||||
server_configs=student_server_configs,
|
||||
teacher_server_configs=[
|
||||
APIServerConfig(
|
||||
base_url="http://localhost:9003/v1",
|
||||
model_name="Qwen/Qwen3-30B-A3B-Instruct-2507",
|
||||
api_key="",
|
||||
server_type="vllm",
|
||||
tokenizer_name="Qwen/Qwen3-30B-A3B-Instruct-2507",
|
||||
)
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
You can either:
|
||||
|
||||
- build a teacher-enabled env by mixing `TeacherDistillationEnv` into an existing
|
||||
`BaseEnv`-derived env such as `GSM8kEnv`, or
|
||||
- subclass `TeacherDistillationEnv` directly and implement the usual environment
|
||||
methods yourself.
|
||||
|
||||
In both cases, `TeacherDistillationEnv` still assumes the normal `BaseEnv`
|
||||
runtime contract: tokenized rollouts, `ScoredDataGroup` payloads, and the
|
||||
standard `handle_send_to_api(...)` transport path.
|
||||
|
||||
CLI shape:
|
||||
|
||||
```bash
|
||||
--env.teacher_enabled true \
|
||||
--teacher.base_url "http://localhost:9003/v1" \
|
||||
--teacher.model_name "Qwen/Qwen3-30B-A3B-Instruct-2507" \
|
||||
--teacher.server_type vllm \
|
||||
--env.teacher_top_k 8
|
||||
```
|
||||
|
||||
If `--teacher.model_name` is a deployment alias rather than a tokenizer
|
||||
identifier, also set `--teacher.tokenizer_name ...` so the env can validate
|
||||
tokenizer compatibility.
|
||||
|
||||
Scope note:
|
||||
|
||||
- The teacher-aware CLI wiring currently exists for `serve`.
|
||||
- If `teacher_enabled=True`, the generic `process` and `evaluate` commands will
|
||||
fail loudly at env construction time unless you instantiate the env yourself
|
||||
and pass `teacher_server_configs=...`.
|
||||
|
||||
Tokenizer requirement:
|
||||
|
||||
- Teacher distillation currently requires the teacher and student to use the same tokenizer vocabulary.
|
||||
- If the tokenizers do not match, `TeacherDistillationEnv` raises an error instead of attempting token conversion.
|
||||
|
||||
Why same-tokenizer is required:
|
||||
|
||||
- `distill_token_ids` are consumed as student-vocabulary IDs by the trainer.
|
||||
- If the teacher uses a different vocabulary, the same integer token ID refers to different text on the teacher and student sides.
|
||||
- A decode/re-tokenize/remap pipeline is not a safe drop-in fix because it changes both token positions and token identities, which breaks the exact per-position token supervision that the current distillation loss assumes.
|
||||
|
||||
---
|
||||
|
||||
## Testing and Debugging Tools
|
||||
|
||||
The trajectory-handler provides several debugging tools to help environment developers test and understand their environments locally without requiring the full distributed infrastructure.
|
||||
|
||||
* **Flexible Model Provider Support:** Atropos natively supports any model provider that adheres to the OpenAI API standard. Simply provide the provider's base URL and your API key, and Atropos can integrate with their models seamlessly for testing or running environments locally.
|
||||
|
||||
After launching the API and your selected environments (e.g. `run-api & python environments/gsm8k_server.py serve`), you are then able to view them to get a quick look, or try to prepare some datasets for some offline training:
|
||||
|
||||
* **View Run (`view-run`):** Launch a Gradio UI to inspect batches of rollouts generated by your environment runs. This is useful for visually debugging the interactions and data flow.
|
||||
* **Offline Data Generation:** Use `atropos-sft-gen` and `atropos-dpo-gen` to collect rollouts from environments and convert them into formats suitable for Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO).
|
||||
|
||||
### In-depth Local Environment Analysis with `process`
|
||||
|
||||
For developers looking to inspect and debug a single environment without the overhead of the `run-api` server or a full training loop, Atropos environments offer a `process` subcommand. This mode performs inference-only rollouts, meaning it runs your model within the environment to generate interactions, but does not perform any model training or updates.
|
||||
|
||||
The `process` subcommand executes the environment's full data pipeline:
|
||||
|
||||
1. **Generation:** Produces model responses based on inputs from the environment.
|
||||
2. **Parsing:** Processes these raw model outputs into a structured format.
|
||||
3. **Scoring:** Applies the environment's reward logic to evaluate the quality of the generated responses.
|
||||
|
||||
**Outputs and Visualization:**
|
||||
|
||||
When you specify a path to save the generated data using the `--env.data_path_to_save_groups your_output_file.jsonl` argument (or a similar argument defined by the specific environment, check with `--help`), the `process` command provides several benefits:
|
||||
|
||||
* **JSONL Output:** Saves all generated rollout groups, including prompts, responses, and scores, to the specified `.jsonl` file. This data can be useful for detailed offline analysis and debugging.
|
||||
* **Static HTML Visualization:** Automatically generates a corresponding `.html` file (e.g., `your_output_file.html`) that provides a user-friendly, browser-based view of the rollouts contained in the JSONL file. This is excellent for quickly understanding model behavior and identifying issues.
|
||||
* **WandB Logging:** If Weights & Biases (`use_wandb=True`) is enabled in your environment's configuration, the `process` subcommand will also log the run data, metrics, and generated rollouts to your WandB dashboard, allowing for persistent tracking and comparison even for these inference-only runs.
|
||||
|
||||
**Example Usage:**
|
||||
|
||||
To run the `process` subcommand for an environment like `gsm8k_server.py` and save the outputs:
|
||||
|
||||
```sh
|
||||
python environments/gsm8k_server.py process --env.data_path_to_save_groups gsm8k_rollouts.jsonl
|
||||
```
|
||||
|
||||
This will create `gsm8k_rollouts.jsonl` and `gsm8k_rollouts.html`.
|
||||
|
||||
**Customization:**
|
||||
|
||||
You can customize the inference endpoint and other parameters for the `process` subcommand. For example, to use a different model or API endpoint:
|
||||
|
||||
```sh
|
||||
python environments/gsm8k_server.py process \
|
||||
--env.data_path_to_save_groups gsm8k_rollouts.jsonl \
|
||||
--env.my_custom_field "value" \
|
||||
--openai.base_url https://your-custom-api-url/v1 \
|
||||
--openai.api_key YOUR_API_KEY \
|
||||
--openai.model_name your_model_identifier
|
||||
```
|
||||
|
||||
You can add custom fields to the `env` namespace by returning a custom subclass of BaseEnvConfig in `config_init` [[example](https://github.com/NousResearch/atropos/blob/bdb15e5d85ddcf8a6ede352977719df442e60a22/environments/math_server.py#L181)].
|
||||
|
||||
Always refer to the specific environment script's help for all available options:
|
||||
|
||||
```sh
|
||||
python environments/your_environment_script.py process --help
|
||||
```
|
||||
|
||||
### Environment Evaluation with `evaluate`
|
||||
|
||||
For running evaluation on environments, Atropos provides an `evaluate` subcommand that calls the environment's `evaluate` method:
|
||||
|
||||
```sh
|
||||
python gsm8k_server.py evaluate \
|
||||
--openai.base_url https://openrouter.ai/api/v1 \
|
||||
--openai.api_key $OPENROUTER_API_KEY \
|
||||
--openai.model_name qwen/qwen3-14b
|
||||
```
|
||||
|
||||
### Offline Data Generation Quick Start
|
||||
|
||||
Run the following commands in **separate terminals**, in this order:
|
||||
|
||||
**Terminal 1** — Start the API server first (must be running before environments connect):
|
||||
```sh
|
||||
run-api
|
||||
```
|
||||
|
||||
**Terminal 2** — Start an environment:
|
||||
```sh
|
||||
python gsm8k_server.py serve --slurm False # or an env of your choice
|
||||
```
|
||||
|
||||
**Terminal 3** — Generate data:
|
||||
```sh
|
||||
atropos-sft-gen path/to/output.jsonl --tokenizer Qwen/Qwen2.5-1.5B-Instruct # or whichever tokenizer you have in your env config
|
||||
```
|
||||
Rejection sampling can be controlled via `--save-top-n-per-group`, `--allow-negative-scores`, and `--minimum-score-diff-max-min`. See `atropos-sft-gen -h` for more detailed usage info.
|
||||
|
||||
If you would like to use OpenAI models, please edit your `config_init` to something like the following:
|
||||
```python
|
||||
@classmethod
|
||||
def config_init(cls) -> Tuple[BaseEnvConfig, List[APIServerConfig]]:
|
||||
env_config = BaseEnvConfig(
|
||||
tokenizer_name="Qwen/Qwen2.5-1.5B-Instruct",
|
||||
group_size=8,
|
||||
use_wandb=True,
|
||||
rollout_server_url="http://localhost:8000",
|
||||
total_steps=1000,
|
||||
batch_size=12,
|
||||
steps_per_eval=100,
|
||||
max_token_length=2048,
|
||||
wandb_name="gsm8k",
|
||||
)
|
||||
server_configs = [
|
||||
APIServerConfig(
|
||||
model_name="gpt-4.1-nano",
|
||||
base_url=None,
|
||||
api_key=os.environ.get("OPENAI_API_KEY"),
|
||||
num_requests_for_eval=256,
|
||||
),
|
||||
]
|
||||
|
||||
return env_config, server_configs
|
||||
```
|
||||
|
||||
For DPO, replace `atropos-sft-gen` with `atropos-dpo-gen` and check `atropos-dpo-gen -h` for data filtering and saving options.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`Address already in use` when running `run-api`**
|
||||
|
||||
Port 8000 is already occupied. Either stop the existing process or specify a different port:
|
||||
|
||||
```bash
|
||||
# Find and stop the process using port 8000
|
||||
lsof -ti:8000 | xargs kill -9
|
||||
|
||||
# Or use a different port
|
||||
run-api --port 8001
|
||||
```
|
||||
|
||||
**`ModuleNotFoundError` or dependency conflicts**
|
||||
|
||||
Ensure you're using a clean virtual environment with the correct Python version:
|
||||
|
||||
```bash
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
pip install -e ".[dev]"
|
||||
```
|
||||
|
||||
**`OPENAI_API_KEY` not set errors**
|
||||
|
||||
Set your API key as an environment variable, or configure it in the environment's `config_init`:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY="your-key-here"
|
||||
```
|
||||
|
||||
**Out of memory (OOM) when running environments locally**
|
||||
|
||||
Use a smaller model for local development and testing. For example, configure `model_name` to a lightweight model like `gpt-4.1-nano` with an OpenAI API key, or use a quantized local model with vLLM.
|
||||
|
||||
**Environment not connecting to the API server**
|
||||
|
||||
Ensure `run-api` is running before starting any environments. By default, environments connect to `http://localhost:8000`. If your API is on a different host or port, update `rollout_server_url` in your environment's config.
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
If you have found the library helpful in your work, you can cite this repository as:
|
||||
|
||||
```latex
|
||||
@misc{atropos,
|
||||
title = {Atropos: An Async First Environment Rollout Controller},
|
||||
author = {Mahan, Dakota and Jin, Roger and Teknium and Sands, Shannon and Yatsenko, Artem and Suphavadeeprasit, Jai and Malhotra, Karan and Guang, Chen and Li, Joe},
|
||||
howpublished = {\url{https://www.github.com/NousResearch/atropos}},
|
||||
year = {2025},
|
||||
month = {apr},
|
||||
note = {Version 0.3.0},
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
Atropos is built by the open-source AI community, and relies on our amazing contributors! Please see our [contributing](CONTRIBUTING.md) guide for more details on our code formatting, testing, etc.
|
||||
Please follow the [Code of Conduct](CODE_OF_CONDUCT.md).
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
Atropos uses the MIT license, see the [LICENSE](LICENSE) file here for more information
|
||||
@@ -0,0 +1,188 @@
|
||||
<div align="center">
|
||||
<!-- <img src="https://github.com/allenai/OLMo/assets/8812459/774ac485-a535-4768-8f7c-db7be20f5cc3" width="300"/> -->
|
||||
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/olmo2/olmo.png" alt="OLMo Logo" width="280" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
||||
<br>
|
||||
<h1>OLMo-core</h1>
|
||||
<h4>Building blocks for OLMo modeling and training</h4>
|
||||
</div>
|
||||
<p align="center">
|
||||
<a href="https://olmo-core.readthedocs.io/en/latest/">
|
||||
<img alt="Docs" src="https://img.shields.io/badge/API-docs-red">
|
||||
</a>
|
||||
<a href="https://github.com/allenai/OLMo-core/tree/main/src/examples">
|
||||
<img alt="Examples" src="https://img.shields.io/badge/API-examples-994B00">
|
||||
</a>
|
||||
<a href="https://github.com/allenai/OLMo-core/releases/tag/v1.9.0">
|
||||
<img alt="Pypi" src="https://img.shields.io/pypi/v/ai2-olmo-core.svg">
|
||||
</a>
|
||||
<a href="https://github.com/allenai/OLMo-core/blob/main/LICENSE">
|
||||
<img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo">
|
||||
</a>
|
||||
<a href="https://arxiv.org/pdf/2501.00656.pdf">
|
||||
<img alt="Paper URL" src="https://img.shields.io/badge/arxiv-2402.00838-orange">
|
||||
</a>
|
||||
<a href="https://playground.allenai.org">
|
||||
<img alt="Playground" src="https://img.shields.io/badge/Ai2-Playground-F0529C">
|
||||
</a>
|
||||
<a href="https://discord.gg/sZq3jTNVNG">
|
||||
<img alt="Discord" src="https://img.shields.io/badge/Discord%20-%20blue?style=flat&logo=discord&label=Ai2&color=%235B65E9">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
## Installation
|
||||
|
||||
First install [PyTorch](https://pytorch.org) according to the instructions specific to your operating system and hardware.
|
||||
|
||||
For development, we recommend installing from source:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/allenai/OLMo-core.git
|
||||
cd OLMo-core
|
||||
pip install -e .[all]
|
||||
```
|
||||
Or you can install from PyPI with:
|
||||
|
||||
```bash
|
||||
pip install ai2-olmo-core
|
||||
```
|
||||
|
||||
There are a number of optional dependencies that must be installed to use certain functionality as well, including:
|
||||
|
||||
- [flash-attn](https://github.com/Dao-AILab/flash-attention), [ring-flash-attn](https://github.com/zhuzilin/ring-flash-attention), and [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) for the corresponding attention backends.
|
||||
- [Liger-Kernel](https://github.com/linkedin/Liger-Kernel) for a low-memory "fused-linear" loss implementation.
|
||||
- [torchao](https://github.com/pytorch/ao) for float8 training.
|
||||
- [grouped_gemm](https://github.com/tgale96/grouped_gemm) for dropless mixture-of-experts (MoE) models. You may need to compile from source until [PR #21](https://github.com/tgale96/grouped_gemm/pull/21) is released (post v0.1.6).
|
||||
- [QuACK](https://github.com/Dao-AILab/quack) for some CuTe-based kernels.
|
||||
|
||||
The published [Docker images](https://github.com/orgs/allenai/packages?repo_name=OLMo-core) contain all core and optional dependencies, and are regularly tested on our in-house H100 clusters.
|
||||
But there are several things to keep in mind if you intend to use these images:
|
||||
|
||||
- They do not come with the OLMo-core package installed, only its dependencies, to accommodate for regular code changes.
|
||||
- They may not work on your own cluster if you have different hardware or driver/CUDA versions.
|
||||
|
||||
If the published images do not work for your use-case for any of the above reasons, you could adapt our [Dockerfile](https://github.com/allenai/OLMo-core/blob/main/src/Dockerfile) to build your own images.
|
||||
|
||||
## Official training scripts
|
||||
|
||||
Official training scripts for released models can be found in [`src/scripts/official/`](https://github.com/allenai/OLMo-core/tree/main/src/scripts/official).
|
||||
|
||||
These scripts are meant to be launched with ``torchrun``, or with OLMo-core's Beaker launch CLI if you have access to Beaker.
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
torchrun --nproc-per-node=8 src/scripts/official/OLMo2/OLMo-2-0325-32B-train.py \
|
||||
--save-folder=/path/to/save/checkpoints
|
||||
```
|
||||
|
||||
You can override most configuration options from the command-line. For example, to override the learning rate you could launch the script like this:
|
||||
|
||||
```bash
|
||||
torchrun --nproc-per-node=8 src/scripts/official/OLMo2/OLMo-2-0325-32B-train.py \
|
||||
--save-folder=/path/to/save/checkpoints \
|
||||
--train_module.optim.lr=6e-3
|
||||
```
|
||||
|
||||
To continue annealing from a checkpoint, we use a separate script which can be launched like this:
|
||||
|
||||
```bash
|
||||
torchrun --nproc-per-node=8 src/scripts/official/OLMo2/OLMo-2-0325-32B-anneal.py \
|
||||
--save-folder=/path/to/save/checkpoints \
|
||||
--checkpoint=https://storage.googleapis.com/ai2-llm/peteish32/step721901
|
||||
```
|
||||
|
||||
### Available Training Scripts
|
||||
|
||||
| Model Family | Directory | Description |
|
||||
|--------------|-----------|-------------|
|
||||
| **OLMo-2** | [`src/scripts/official/OLMo2/`](https://github.com/allenai/OLMo-core/tree/main/src/scripts/official/OLMo2) | Training scripts and model card for OLMo-2 32B models |
|
||||
| **OLMo-3** | [`src/scripts/official/OLMo3/`](https://github.com/allenai/OLMo-core/tree/main/src/scripts/official/OLMo3) | Training scripts and model cards for OLMo-3 7B and 32B models |
|
||||
|
||||
## Inference
|
||||
|
||||
### With Hugging Face Transformers
|
||||
|
||||
You can use our Hugging Face [transformers](https://github.com/huggingface/transformers) integration to run inference on the OLMo checkpoints:
|
||||
|
||||
```bash
|
||||
pip install transformers>=4.57.0
|
||||
```
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
olmo = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-1125-32B")
|
||||
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-1125-32B")
|
||||
message = ["Language modeling is "]
|
||||
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
|
||||
# inputs = {k: v.to('cuda') for k,v in inputs.items()} # optional verifying cuda
|
||||
# olmo = olmo.to('cuda')
|
||||
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=1.0, top_p=0.7)
|
||||
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
||||
```
|
||||
|
||||
Alternatively, with the Hugging Face pipeline abstraction:
|
||||
|
||||
```python
|
||||
from transformers import pipeline
|
||||
olmo_pipe = pipeline("text-generation", model="allenai/Olmo-3-1125-32B")
|
||||
print(olmo_pipe("Language modeling is"))
|
||||
```
|
||||
|
||||
### With vLLM
|
||||
|
||||
[vLLM](https://docs.vllm.ai/en/latest/) provides high-throughput inference for OLMo models. You can use it for offline batched inference:
|
||||
|
||||
```bash
|
||||
pip install vllm>=0.11.0
|
||||
```
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
llm = LLM(model="allenai/Olmo-3-1125-32B")
|
||||
sampling_params = SamplingParams(temperature=1.0, top_p=0.7)
|
||||
prompts = ["Language modeling is"]
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
For more details, see the [vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart/#offline-batched-inference).
|
||||
|
||||
### With Olmo-core (beta)
|
||||
|
||||
Autoregressive generation is supported directly in Olmo-core. Using this capability, we provide a chat-loop demo that can be used to interact with models in an interactive chat session:
|
||||
|
||||
```bash
|
||||
python -m olmo_core.generate.chat https://olmo-checkpoints.org/ai2-llm/Olmo-3-1025-7B/stage3/step11921/ --max-new-tokens 512
|
||||
```
|
||||
|
||||
## Evaluation
|
||||
|
||||
Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/OLMo-eval) and [olmes](https://github.com/allenai/olmes) repositories.
|
||||
|
||||
## Development
|
||||
|
||||
The Python library source code is located in `src/olmo_core`. The corresponding tests are located in `src/test`. The library docs are located in `docs`. You can build the docs locally with `make docs`.
|
||||
|
||||
Code checks:
|
||||
|
||||
- We use `pytest` to run tests. You can run all tests with `pytest -v src/test`. You can also point `pytest` at a specific test file to run it individually.
|
||||
- We use `isort` and `black` for code formatting. Ideally you should integrate these into your editor, but you can also run them manually or configure them with a pre-commit hook. To validate that all files are formatted correctly, run `make style-check`.
|
||||
- We use `ruff` as our primary linter. You can run it with `make lint-check`.
|
||||
- We use `mypy` as our type checker. You can run it with `make type-check`.
|
||||
|
||||
## Citing
|
||||
|
||||
```bibtex
|
||||
@misc{olmo20242olmo2furious,
|
||||
title={{2 OLMo 2 Furious}},
|
||||
author={{Team OLMo} and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
|
||||
year={2024},
|
||||
eprint={2501.00656},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2501.00656},
|
||||
}
|
||||
```
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@@ -0,0 +1,986 @@
|
||||
Title: 2205.01068v4.pdf
|
||||
|
||||
URL Source: https://arxiv.org/pdf/2205.01068
|
||||
|
||||
Published Time: Mon, 23 Jan 2023 14:35:50 GMT
|
||||
|
||||
Number of Pages: 30
|
||||
|
||||
Markdown Content:
|
||||
# OPT: Open Pre-trained Transformer Language Models
|
||||
|
||||
Susan Zhang ∗∗
|
||||
|
||||
, Stephen Roller ∗
|
||||
|
||||
, Naman Goyal ∗
|
||||
|
||||
,
|
||||
|
||||
Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li ,
|
||||
|
||||
Xi Victoria Lin , Todor Mihaylov , Myle Ott ††
|
||||
|
||||
, Sam Shleifer †
|
||||
|
||||
, Kurt Shuster , Daniel Simig ,
|
||||
|
||||
Punit Singh Koura , Anjali Sridhar , Tianlu Wang , Luke Zettlemoyer
|
||||
|
||||
Meta AI
|
||||
|
||||
{susanz,roller,naman}@fb.com
|
||||
|
||||
Abstract
|
||||
|
||||
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their com-putational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no ac-cess is granted to the full model weights, mak-ing them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is com-parable to GPT-3, 1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastruc-ture challenges we faced, along with code for experimenting with all of the released models.
|
||||
|
||||
1 Introduction
|
||||
|
||||
Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd-hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs. 2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering
|
||||
|
||||
> ∗Equal contribution.
|
||||
> †Work done while at Meta AI.
|
||||
> 1Following Brown et al. (2020), we use GPT-3 to refer to both the 175B model and the smaller scale models as well.
|
||||
> 2Exceptions include work by EleutherAI, who released dense models up to 20B in size (Black et al., 2022), Salesforce (Nijkamp et al., 2022), and Meta AI, who re-leased dense models up to 13B and sparse models up to 1.1T (Artetxe et al., 2021). There is also ongoing work from the BigScience workshop ( https://bigscience. huggingface.co/ ), which aims to open source very large multilingual language models and datasets.
|
||||
|
||||
progress on improving known challenges in areas such as robustness, bias, and toxicity. In this technical report, we present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the per-formance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in de-veloping this suite of OPT models is to enable re-producible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study. We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Ac-cess will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry re-search laboratories. We are also releasing both the logbook of our model creation as well as our code-base, metaseq, 3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hard-ware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs. We believe the entire AI community — aca-demic researchers, civil society, policymakers, and industry — must work together to develop clear
|
||||
|
||||
> 3https://github.com/facebookresearch/ metaseq
|
||||
> arXiv:2205.01068v4 [cs.CL] 21 Jun 2022
|
||||
|
||||
Model #L #H dmodel LR Batch
|
||||
|
||||
125M 12 12 768 6.0e−4 0.5M 350M 24 16 1024 3.0e−4 0.5M 1.3B 24 32 2048 2.0e−4 1M 2.7B 32 32 2560 1.6e−4 1M 6.7B 32 32 4096 1.2e−4 2M 13B 40 40 5120 1.0e−4 4M 30B 48 56 7168 1.0e−4 4M 66B 64 72 9216 0.8e−4 2M 175B 96 96 12288 1.2e−4 2M
|
||||
|
||||
> Table 1: Model architecture details. We report the number of layers (#L), number of attention heads (#H), and the embedding size (d model ). We also report the peak Learning Rate (LR) and global batch size in num-ber of tokens (Batch).
|
||||
|
||||
guidelines around responsible AI in general and responsible LLMs in particular, given their cen-trality in many downstream language applications. A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the di-versity of voices defining the ethical considerations of such technologies.
|
||||
|
||||
2 Method
|
||||
|
||||
2.1 Models
|
||||
|
||||
We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to re-duce risk of training instabilities, our models and hyperparameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain in-creased computational efficiency.
|
||||
|
||||
2.2 Training Setup
|
||||
|
||||
For weight initialization, we follow the same set-tings provided in the Megatron-LM codebase, 4 us-ing a normal distribution with zero mean and stan-dard deviation of 0.006. Standard deviation for output layers are scaled by a 1.0/√2L term where
|
||||
|
||||
L is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.
|
||||
|
||||
> 4https://github.com/NVIDIA/ Megatron-LM/blob/main/examples/pretrain_ gpt3_175B.sh
|
||||
|
||||
We use an AdamW optimizer (Loshchilov and Hutter, 2017) with (β1, β 2) set to (0 .9, 0.95) , and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to 10% of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from 0.5M to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training. We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some mid-flight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also in-clude a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of N into two division operations by √N ).
|
||||
|
||||
2.3 Pre-training Corpus
|
||||
|
||||
The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Red-dit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via CommonCrawl. We removed duplicated documents across all datasets by filtering out documents via Min-hashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity ≥ .95 . We found the Pile was particularly full of duplicate documents, and ad-vise future researchers using the Pile to perform additional de-duplication processing. We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.
|
||||
|
||||
RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) sub-sets of the RoBERTa corpus and utilized an up-dated version of CCNews, containing news stories crawled through September 28, 2021. This CC-News v2 corpus was preprocessed the same way as the original RoBERTa CCNews (Liu et al., 2019b).
|
||||
|
||||
The Pile We included a subset of the Pile (Gao et al., 2021a), including: CommonCrawl, DM Mathematics, Project Gutenberg, Hack-erNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. Other subsets of the Pile were elim-inated as we found they increased the risk of insta-bilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were other-wise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.
|
||||
|
||||
PushShift.io Reddit We included a subset of the Pushshift.io corpus produced by Baumgart-ner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we ex-tracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about 66%.
|
||||
|
||||
2.4 Training Efficiency
|
||||
|
||||
We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid under-flows, we used dynamic loss scaling, as described in Micikevicius et al. (2017).
|
||||
|
||||
2.5 Training Processes
|
||||
|
||||
Here we describe significant training process ad-justments that arose during OPT-175B pre-training.
|
||||
|
||||
Hardware Failures We faced a significant num-ber of hardware failures in our compute cluster while training OPT-175B. In total, hardware fail-ures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was re-sumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.
|
||||
|
||||
Loss Divergences Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restart-ing from an earlier checkpoint allowed for the job to recover and continue training. We noticed a cor-relation between loss divergence, our dynamic loss 0k 20k 40k 60k 80k 100k 120k 140k Iterations
|
||||
|
||||
> 0.0e-4
|
||||
> 0.2e-4
|
||||
> 0.4e-4
|
||||
> 0.6e-4
|
||||
> 0.8e-4
|
||||
> 1.0e-4
|
||||
> 1.2e-4 Learning Rate
|
||||
> Empirical Learning Rate
|
||||
|
||||
Figure 1: Empirical LR schedule. We found that low-ering learning rate was helpful for avoiding instabili-ties. 0k 20k 40k 60k 80k 100k 120k 140k Iterations
|
||||
|
||||
> 7.0
|
||||
> 7.5
|
||||
> 8.0
|
||||
> 8.5
|
||||
> 9.0
|
||||
> 9.5
|
||||
> 10.0 Perplexity
|
||||
> Validation Perplexity
|
||||
|
||||
Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.
|
||||
|
||||
scalar crashing to 0, and the l2-norm of the activa-tions of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state ( ≥ 1.0), and after which our activation norms would trend downward instead of growing unboundedly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.
|
||||
|
||||
Other Mid-flight Changes We conducted anumber of other experimental mid-flight changes to handle loss divergences. These included: switch-ing to vanilla SGD (optimization plateaued quickly, and we reverted back to AdamW); resetting the dy-namic loss scalar (this helped recover some but not all divergences); and switching to a newer version of Megatron (this reduced pressure on activation norms and improved throughput). 3 Evaluations
|
||||
|
||||
3.1 Prompting & Few-Shot
|
||||
|
||||
We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Mostafazadeh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), Wino-Grande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall ex-perimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation set-tings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022) We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in eval-uation metrics). For the Winograd Schema Chal-lenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).
|
||||
|
||||
Zero-shot Overall average zero-shot perfor-mance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full break-down, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately. Our performance roughly matched GPT-3 for 10 tasks, and underperformed in 3 tasks (ARC Chal-lenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpre-dictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always out-perform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem question-able, given WIC being a binary classification task. 5
|
||||
|
||||
For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API 6 within our evalua-tion setup, suggesting differences in the methods
|
||||
|
||||
> 5Brown et al. (2020) reports 0% accuracy on WIC, which implies 100% accuracy if the classification was inverted.
|
||||
> 6https://beta.openai.com/docs/engines/ overview 10 810 910 10 10 11
|
||||
> Parameters
|
||||
> 50
|
||||
> 55
|
||||
> 60
|
||||
> 65
|
||||
> 70 Avg. Accuracy
|
||||
> Average across 14 NLP Tasks (Zero-Shot)
|
||||
> OPT GPT
|
||||
|
||||
Figure 3: Zero-shot NLP Evaluation Averages .Across a variety of tasks and model sizes, OPT largely matches the reported averages of GPT-3. However, per-formance varies greatly per task: see Appendix A. 10 8 10 9 10 10 10 11
|
||||
|
||||
> Parameters
|
||||
> 50
|
||||
> 55
|
||||
> 60
|
||||
> 65
|
||||
> 70
|
||||
> 75 Avg. Accuracy
|
||||
> Average across 14 NLP Tasks
|
||||
> Shot 0132 Series OPT GPT
|
||||
|
||||
Figure 4: Multi-shot performance . OPT perfor-mance for one- and few-shot lags behind GPT-3 mod-els, but performance depends heavily per task; see Ap-pendix A.
|
||||
|
||||
of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations. Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for num-ber of parameters. We speculate the high perfor-mance of PaLM comes predominantly from higher quality and diversity of pre-training data.
|
||||
|
||||
One-shot and Few-shot Average multi-shot in-context performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed perfor-mances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zero-shot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see un-derperformance of OPT models compared to GPT-3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).
|
||||
|
||||
3.2 Dialogue
|
||||
|
||||
Given that LLMs are known to be an integral com-ponent of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we fol-low Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Di-nan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against ex-isting open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022). We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 com-petition (Dinan et al., 2020b). To control for dif-ferent tokenization in each of the models, we nor-malize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1. Results are shown in Table 2. We see that OPT-175B significantly outperforms the also-unsupervised Reddit 2.7B model on all tasks, and performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-of-Wikipedia supervision. We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as com-petitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We addi-tionally evaluated OPT-175B on the ConvAI2 hid-den test set, which has never been publicly released, and achieved 10.7 ppl and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2-like MultiSessionChat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple PersonaChat-like datasets. Since both MSC and WoI datasets were released after the CommonCrawl snapshot used in pre-training cor-pus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behav-ior also highlighted in LaMDA (Thoppilan et al., 2022).
|
||||
|
||||
4 Bias & Toxicity Evaluations
|
||||
|
||||
To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcom-ings in these benchmarks (Blodgett et al., 2021; Ja-cobs and Wallach, 2021), these measurements pro-vide a first step towards understanding the limita-tions of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).
|
||||
|
||||
4.1 Hate Speech Detection
|
||||
|
||||
Using the ETHOS dataset provided in Mollas et al. (2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-, Perplexity ( ↓) Unigram F1 ( ↑)
|
||||
|
||||
Model Eval C2 WW ED BST WoI C2 WW ED BST WoI
|
||||
|
||||
Reddit 2.7B Unsup. 18.9 21.0 11.6 17.4 18.0 .126 .133 .135 .133 .124 BlenderBot 1 Sup. 10.2 12.5 9.0 11.9 14.7 .183 .189 .192 .178 .154 R2C2 BlenderBot Sup. 10.5 12.4 9.1 11.7 14.6 .205 .198 .197 .186 .160
|
||||
|
||||
OPT-175B Unsup. 10.8 13.3 10.3 12.1 12.0 .185 .152 .149 .162 .147
|
||||
|
||||
> Table 2: Dialogue Evaluations. OPT-175B, in a fully unsupervised setting, performs competitively against fully supervised models.
|
||||
|
||||
Setup Davinci OPT-175B
|
||||
|
||||
Zero-shot .628 .667
|
||||
|
||||
One-shot .616 .713
|
||||
|
||||
Few-shot (binary) .354 .759
|
||||
|
||||
Few-shot (multiclass) .672 .812
|
||||
|
||||
> Table 3: Hate speech detection. F1 scores of detect-ing hate speech between Davinci and OPT-175B. OPT-175B considerably outperforms Davinci in all settings.
|
||||
|
||||
and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response. Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT-175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.
|
||||
|
||||
4.2 CrowS-Pairs
|
||||
|
||||
Developed for masked language models, CrowS-Pairs (Nangia et al., 2020) is a crowdsourced bench-mark aiming to measure intrasentence level biases in 9 categories: gender, religion, race/color, sex-ual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each exam-ple consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model. Category GPT-3 OPT-175B
|
||||
|
||||
Gender 62.6 65.7 Religion 73.3 68.6
|
||||
|
||||
Race/Color 64.7 68.6 Sexual orientation 76.2 78.6 Age 64.4 67.8 Nationality 61.6 62.9 Disability 76.7 76.7
|
||||
|
||||
Physical appearance 74.6 76.2 Socioeconomic status 73.8 76.2 Overall 67.2 69.5
|
||||
|
||||
> Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.
|
||||
|
||||
When compared with Davinci in Table 4, OPT-175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Red-dit corpus has a higher incidence rate for stereo-types and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.
|
||||
|
||||
4.3 StereoSet
|
||||
|
||||
Following Lieber et al. (2021) and Artetxe et al. (2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intrasentence measurement (similar to CrowS-Pairs), StereoSet includes measurement at the inter-sentence level to test a model’s ability to incorpo-rate additional context. To account for a potential trade-off between bias detection and language mod-eling capability, StereoSet includes two metrics: Category Davinci OPT-175B
|
||||
|
||||
Prof. LMS ( ↑) 78.4 74.1 SS ( ↓) 63.4 62.6
|
||||
|
||||
ICAT ( ↑) 57.5 55.4 Gend. LMS ( ↑) 75.6 74.0 SS ( ↓) 66.5 63.6
|
||||
|
||||
ICAT ( ↑) 50.6 53.8
|
||||
|
||||
Reli. LMS ( ↑) 80.8 84.0
|
||||
|
||||
SS ( ↓) 59.0 59.0
|
||||
|
||||
ICAT ( ↑) 66.3 68.9
|
||||
|
||||
Race LMS ( ↑) 77.0 74.9 SS ( ↓) 57.4 56.8
|
||||
|
||||
ICAT ( ↑) 65.7 64.8 Overall LMS ( ↑) 77.6 74.8 SS ( ↓) 60.8 59.9
|
||||
|
||||
ICAT ( ↑) 60.8 60.0
|
||||
|
||||
Table 5: StereoSet Evaluations . Davinci and OPT-175B perform similarly across all evaluations.
|
||||
|
||||
Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models. Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B out-performs in the areas of Gender and Religion. OPT-175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.
|
||||
|
||||
4.4 RealToxicityPrompts
|
||||
|
||||
We evaluate the tendency of OPT-175B to respond with toxic language via the RealToxicityPrompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 genera-tions of 20 tokens using nucleus sampling (Holtz-man et al., 2020) ( p = 0 .9) for each of 10 , 000
|
||||
|
||||
randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxi-city rates from Davinci and PaLM. Results are shown in Figure 5. Overall, we see 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Prompt Toxicity Probability (Binned)
|
||||
|
||||
> 0.00
|
||||
> 0.05
|
||||
> 0.10
|
||||
> 0.15
|
||||
> 0.20
|
||||
> 0.25
|
||||
> 0.30
|
||||
> 0.35
|
||||
> 0.40
|
||||
> 0.45 Toxicity Probability of Continuation (TPC)
|
||||
> Toxicity Probability of Prompt (TPP)
|
||||
> OPT 175B Davinci PaLM
|
||||
|
||||
Figure 5: RealToxicityPompts . OPT-175B is more likely to generate toxic responses than either Davinci or PaLM. Consistent with prior work, toxicity rates in-crease as prompt toxicity increases.
|
||||
|
||||
that OPT-175B has a higher toxicity rate than ei-ther PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our exper-iments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional miti-gations, or avoid usage entirely as appropriate.
|
||||
|
||||
4.5 Dialogue Safety Evaluations
|
||||
|
||||
Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, SaferDialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apol-ogizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), mea-sures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realis-tic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models. Results for both experiments are shown in Ta-ble 6. We observe that OPT-175B has similar per-formance as the Reddit 2.7B model across both SaferDialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adver-sarial settings. Consistent with Roller et al. (2021) Safe. Dia. Unit Tests ( ↓)
|
||||
|
||||
> Model PPL F1 Sa Re Un Ad
|
||||
> Reddit 2.7B 16.2 .140 .300 .261 .450 .439 BlenderBot 1 12.4 .161 .028 .150 .250 .194
|
||||
> R2C2 BlenderBot 13.8 .160 .022 .133 .289 .222 OPT-175B 14.7 .141 .033 .261 .567 .283
|
||||
> Table 6: Dialogue Responsible AI evaluations. OPT-175B is roughly on par with the Reddit 2.7B model, but performs worse in the Unsafe setting.
|
||||
|
||||
and Xu et al. (2020), we find that the models fine-tuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dia-logue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.
|
||||
|
||||
5 Limitations
|
||||
|
||||
In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evalu-ation datasets used in the GPT-3 models. More-over, we performed safety, bias, and inclusion eval-uations, again seeing largely comparable perfor-mance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021). In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue begin-ning with such an instruction, rather than an execu-tion of the instruction. Future work into instruction learning, in the vein of InstructGPT (Ouyang et al., 2022), may alleviate these limitations. OPT-175B also tends to be repetitive and can eas-ily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtz-man et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meis-ter et al., 2020). Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Wei-dinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can im-prove factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also bene-fit from retrieval-augmentation in future iterations. As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when pro-vided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigations for toxicity and bi-ases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applica-tions, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially be-fore any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigations in this first release. In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and se-lection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evalua-tions, we would ideally have more streamlined and consistent evaluation setups to ensure replicability and reproducibility of evaluation scenarios. Dif-ferences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues. 6 Considerations for Release
|
||||
|
||||
Following the recommendations for individual re-searchers generated by the Partnership for AI, 7
|
||||
|
||||
along with the governance guidance outlined by NIST, 8 we are disclosing all of the details in-volved in training OPT-175B through our log-book, 9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT-175B. We aim to be fully accountable for the devel-opment lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs. By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training pro-cess itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and col-lectively increase the experimental robustness in developing models at this scale. Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our develop-ment codebase, we aim to shed light on any imple-mentation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoder-only transformer that is ≥175B parameters without the use of pipeline paralellism on NVIDIA GPUs. To enable experimentation at 175B scale, we are providing researchers with direct access to the pa-rameters of OPT-175B. The reasoning here is two-fold: enable Responsible AI research into LLMs while simultaneously reducing the environmental
|
||||
|
||||
> 7https://partnershiponai.org/paper/ responsible-publication-recommendations/
|
||||
> 8https://nvlpubs.nist.gov/nistpubs/ SpecialPublications/NIST.SP.1270.pdf
|
||||
> 9https://github.com/facebookresearch/ metaseq/blob/main/projects/OPT/ chronicles/OPT175B_Logbook.pdf
|
||||
|
||||
impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emer-gent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, be-fore broader commercial deployment occurs. Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons, 10
|
||||
|
||||
GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally re-ported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the over-all carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the grow-ing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical car-bon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to in-clude the entire LLM development lifecycle. We need to understand the manufacturing (or embod-ied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defin-ing additional factors to consider when measuring the impact of scale on the environment. Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and con-tinuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may oc-cur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.
|
||||
|
||||
> 10 With ablations, baselines and downtime, our own esti-mates of total cost is roughly 2 ×higher.
|
||||
|
||||
7 Related Work
|
||||
|
||||
Since the publication of the Transformer architec-ture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and Megatron-LM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoff-mann et al., 2022). Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowd-hery et al., 2022). The resulting massive improve-ment in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API ser-vices. There are a few notable efforts towards open sourcing LLMs from non-profit research organiza-tions including EleutherAI (Black et al., 2022) and BigScience. 11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies. Since Brown et al. (2020), the primary evalu-ation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluat-ing on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Cham-bers and Jurafsky, 2008; Mostafazadeh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behav-ior in smaller models (Schick and Schütze, 2020;
|
||||
|
||||
> 11 https://huggingface.co/bigscience/ tr11-176B-ml-logs/tensorboard
|
||||
|
||||
Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexi-bility of prompting (Shin et al., 2020), and under-standing why and how prompting works (Liu et al., 2021; Min et al., 2022). Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, ef-fective prompt engineering remains an open re-search challenge. Results vary significantly and unpredictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zero- or few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in ex-isting settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.
|
||||
|
||||
8 Conclusion
|
||||
|
||||
In this technical report, we introduced OPT, a col-lection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue set-tings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and dis-cussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
|
||||
|
||||
Acknowledgements
|
||||
|
||||
We would like to thank Scott Jeschonek, Giri Anan-tharaman, Diego Sarina, Joaquin Colombo, Chris Bray, Stephen Roylance, Kalyan Saladi, Shubho Sengupta, and Brian O’Horo for helping to remove infrastructure blockers along the way; Percy Liang, Rishi Bommasani, and Emily Dinan for discus-sions on responsible release practices; Carole-Jean Wu for discussions on sustainability and carbon footprint considerations; Srini Iyer, Ramakanth Pa-sunuru, and Shruti Bhosale for previous contribu-tions to evaluations; Benjamin Lefaudeux, Geeta Chauhan, Natalia Gimelshein, Horace He, and Sam Gross for discussions on performance improvement work; Emily Dinan, Carole-Jean Wu, Daniel McK-innon, and Mark Tygert for feedback on this draft; Antoine Bordes, Joelle Pineau, Mary Williamson, Necip Fazil Ayan, Armand Joulin, Sergey Edunov, Melanie Kambadur, Zornitsa Kozareva, Ves Stoy-anov, Vitaliy Liptchinsky, Rahul Iyer, Jing Xu, Ja-son Weston, and many others for supporting this project internally.
|
||||
|
||||
References
|
||||
|
||||
Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977 .Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona T. Diab, Zornitsa Kozareva, and Ves Stoyanov. 2021. Efficient large scale lan-guage modeling with mixtures of experts. CoRR ,abs/2112.10684. Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. CoRR , abs/2001.08435. Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Confer-ence on Fairness, Accountability, and Transparency ,pages 610–623. Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Pro-ceedings of the AAAI Conference on Artificial Intel-ligence , 34(05):7432–7439. Sid Black, Stella Biderman, Eric Hallahan, Quentin An-thony, Leo Gao, Laurence Golding, Horace He, Con-nor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. Gpt-neox-20b: An open-source autoregressive language model. Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyp-ing Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Compu-tational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol-ume 1: Long Papers) , pages 1004–1015, Online. As-sociation for Computational Linguistics. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shya-mal Buch, Dallas Card, Rodrigo Castellon, Ni-ladri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Don-ahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Juraf-sky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kudi-tipudi, and et al. 2021. On the opportunities and risks of foundation models. CoRR , abs/2108.07258. Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-mann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2021. Improv-ing language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426 .Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In
|
||||
|
||||
Advances in Neural Information Processing Systems ,volume 33, pages 1877–1901. Curran Associates, Inc. Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-vised learning of narrative event chains. In Proceed-ings of ACL-08: HLT , pages 789–797, Columbus, Ohio. Association for Computational Linguistics. Ke-Li Chiu and Rohan Alexander. 2021. Detect-ing hate speech with gpt-3. arXiv preprint arXiv:2103.12407 .Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-odkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-mawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fe-dus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankara-narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge.
|
||||
|
||||
CoRR , abs/1803.05457. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language mod-els: A simple approach to controlled text generation.
|
||||
|
||||
arXiv preprint arXiv:1912.02164 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under-standing. In North American Association for Com-putational Linguistics (NAACL) .Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language gen-eration. In Proceedings of the 2021 ACM Confer-ence on Fairness, Accountability, and Transparency ,pages 862–872. Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2021. Anticipating safety issues in e2e conversational ai: Framework and tooling.
|
||||
|
||||
arXiv preprint arXiv:2107.03451 .Emily Dinan, Angela Fan, Adina Williams, Jack Ur-banek, Douwe Kiela, and Jason Weston. 2020a. Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 8173–8188, On-line. Association for Computational Linguistics. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019a. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083 .Emily Dinan, Varvara Logacheva, Valentin Ma-lykh, Alexander Miller, Kurt Shuster, Jack Ur-banek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020b. The second conversational intelligence challenge (Con-vAI2). In The NeurIPS ’18 Competition , pages 187– 208, Cham. Springer International Publishing. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019b. Wiz-ard of Wikipedia: Knowledge-powered conversa-tional agents. In Proceedings of the International Conference on Learning Representations .Leo Gao, Stella Biderman, Sid Black, Laurence Gold-ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021a. The pile: An 800gb dataset of diverse text for language modeling.
|
||||
|
||||
CoRR , abs/2101.00027. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021b. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meet-ing of the Association for Computational Linguis-tics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , pages 3816–3830. Association for Computa-tional Linguistics. Timnit Gebru, Jamie Morgenstern, Briana Vec-chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM ,64(12):86–92. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxi-cityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 3356–3369, Online. Association for Computational Linguistics. Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 2021. Chasing carbon: The elu-sive environmental footprint of computing. IEEE In-ternational Symposium on High-Performance Com-puter Architecture (HPCA 2021) .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770– 778. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Si-monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. 2022. Training compute-optimal large language models. Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degener-ation. ArXiv , abs/1904.09751. Abigail Z. Jacobs and Hanna Wallach. 2021. Measure-ment and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Trans-parency , FAccT ’21, page 375–385, New York, NY, USA. Association for Computing Machinery. Zachary Kenton, Tom Everitt, Laura Weidinger, Ia-son Gabriel, Vladimir Mikulik, and Geoffrey Irv-ing. 2021. Alignment of language agents. CoRR ,abs/2103.14659. Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021. Internet-augmented dialogue generation.
|
||||
|
||||
CoRR , abs/2107.07566. Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc-Cann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2020. GEDI: Generative discriminator guided sequence genera-tion. arXiv preprint arXiv:2009.06367 .Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. CoRR , abs/2104.08691. Hector J Levesque, Ernest Davis, and Leora Morgen-stern. 2011. The Winograd schema challenge. In
|
||||
|
||||
AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, page 47. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein-rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neu-ral Information Processing Systems , 33:9459–9474. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. pages 4582–4597. Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards under-standing and mitigating social biases in language models. In International Conference on Machine Learning , pages 6565–6576. PMLR. Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs. Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. 2019a. Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486 .Haokun Liu, William Huang, Dhara Mungra, and Samuel R. Bowman. 2020. Precise task formaliza-tion matters in Winograd schema evaluations. In
|
||||
|
||||
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 8275–8280, Online. Association for Computa-tional Linguistics. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What makes good in-context examples for gpt-3? CoRR ,abs/2101.06804. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR ,abs/1711.05101. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcom-ing few-shot prompt order sensitivity. Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. Best-first beam search. Transactions of the Associa-tion for Computational Linguistics , 8:795–809. Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 .Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct elec-tricity? A new dataset for open book question an-swering. CoRR , abs/1809.02789. Tomas Mikolov, Jiri Kopecky, Lukas Burget, Ondrej Glembek, et al. 2009. Neural network based lan-guage models for highly inflective languages. In
|
||||
|
||||
2009 IEEE international conference on acoustics, speech and signal processing , pages 4725–4728. IEEE. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han-naneh Hajishirzi. 2021. Metaicl: Learning to learn in context. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-moyer. 2022. Rethinking the role of demonstra-tions: What makes in-context learning work? arXiv preprint arXiv:2202.12837 .Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2018. Model cards for model reporting.
|
||||
|
||||
CoRR , abs/1810.03993. Ioannis Mollas, Zoe Chrysopoulou, Stamatis Kar-los, and Grigorios Tsoumakas. 2020. ETHOS: an online hate speech detection dataset. CoRR ,abs/2006.08328. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vander-wende, Pushmeet Kohli, and James F. Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR ,abs/1604.01696. Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pre-trained language models. In Association for Com-putational Linguistics (ACL) .Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 .Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R Bowman. 2020. Crows-pairs: A chal-lenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133 .Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for program synthesis. arXiv preprint .Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow in-structions with human feedback. arXiv preprint arXiv:2203.02155 .David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Car-bon emissions and large neural network training.
|
||||
|
||||
arXiv preprint arXiv:2104.10350 .Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Ad-vances in Neural Information Processing Systems ,34. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowl-edge bases? In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics. Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language un-derstanding with unsupervised learning. Technical report, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Techni-cal report, OpenAI. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susan-nah Young, Eliza Rutherford, Tom Hennigan, Ja-cob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Mari-beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-hannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, An-tonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Ne-matzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cy-prien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake A. Hecht-man, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. 2021. Scaling language models: Methods, analysis & in-sights from training gopher. CoRR , abs/2112.11446. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans-former. The Journal of Machine Learning Research (JMLR) , 21:1–67. Anand Rajaraman and Jeffrey David Ullman. 2011.
|
||||
|
||||
Mining of massive datasets . Cambridge University Press. Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 5370–5381, Florence, Italy. Association for Computational Linguistics. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason We-ston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Compu-tational Linguistics: Main Volume , pages 300–325, Online. Association for Computational Linguistics. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-ula, and Yejin Choi. 2020. Winogrande: An adver-sarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelli-gence, AAAI 2020, The Thirty-Second Innovative Ap-plications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 , pages 8732– 8740. AAAI Press. Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Ab-heesht Sharma, Andrea Santilli, Thibault Fevry, Ja-son Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization. Teven Le Scao and Alexander M. Rush. 2021. How many data points is a prompt worth? pages 2627– 2636. Timo Schick and Hinrich Schütze. 2020. It’s not just size that matters: Small language models are also few-shot learners. CoRR , abs/2009.07118. Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics ,9:1408–1424. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An-nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1715– 1725, Berlin, Germany. Association for Computa-tional Linguistics. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326 .Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. pages 4222– 4235. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan-zaro. 2019. Megatron-lm: Training multi-billion pa-rameter language models using model parallelism.
|
||||
|
||||
arXiv preprint arXiv:1909.08053 .Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason We-ston. 2022. Language models that seek for knowledge: Modular search & generation for di-alogue and prompt completion. arXiv preprint arXiv:2203.13224 .Eric Smith, Mary Williamson, Kurt Shuster, Jason We-ston, and Y-Lan Boureau. 2020. Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin-guistics . ACL. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zheng, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale genera-tive language model. CoRR , abs/2201.11990. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 .Trieu H. Trinh and Quoc V. Le. 2018. A sim-ple method for commonsense reasoning. CoRR ,abs/1806.02847. Megan Ung, Jing Xu, and Y-Lan Boureau. 2021. Safer-dialogues: Taking feedback gracefully after conver-sational safety failures. ArXiv , abs/2110.07518. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro-cessing systems .Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537 .Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247 .Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An-drew M. Dai, and Quoc V. Le. 2021. Finetuned language models are zero-shot learners. CoRR ,abs/2109.01652. Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021a. Ethical and social risks of harm from language models. Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021b. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 .Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-nan, Kyunghyun Cho, and Jason Weston. 2020. Neu-ral text generation with unlikelihood training. In
|
||||
|
||||
International Conference on Learning Representa-tions .Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Glo-ria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Ba-landat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood. 2022. Sustainable AI: environmental implications, challenges and opportunities. In Pro-ceedings of the Conference on Machine Learning and Systems .Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Ja-son Weston, and Emily Dinan. 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079 .Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason We-ston, and Emily Dinan. 2021a. Bot-adversarial dia-logue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chap-ter of the Association for Computational Linguistics: Human Language Technologies , pages 2950–2968, Online. Association for Computational Linguistics. Jing Xu, Arthur Szlam, and Jason Weston. 2021b. Be-yond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567 .Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Pro-ceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Pa-pers , pages 4791–4800. Association for Computa-tional Linguistics. Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. CoRR , abs/1506.06724. A Additional Evaluations
|
||||
|
||||
.10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
30
|
||||
|
||||
40
|
||||
|
||||
50
|
||||
|
||||
60
|
||||
|
||||
70
|
||||
|
||||
80 Accuracy
|
||||
|
||||
HellaSwag
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
62.5
|
||||
|
||||
65.0
|
||||
|
||||
67.5
|
||||
|
||||
70.0
|
||||
|
||||
72.5
|
||||
|
||||
75.0
|
||||
|
||||
77.5
|
||||
|
||||
80.0
|
||||
|
||||
82.5
|
||||
|
||||
StoryCloze
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
62.5
|
||||
|
||||
65.0
|
||||
|
||||
67.5
|
||||
|
||||
70.0
|
||||
|
||||
72.5
|
||||
|
||||
75.0
|
||||
|
||||
77.5
|
||||
|
||||
80.0
|
||||
|
||||
82.5
|
||||
|
||||
PIQA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
ARC (Easy)
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
30
|
||||
|
||||
35
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50 Accuracy
|
||||
|
||||
ARC (Challenge)
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
30
|
||||
|
||||
35
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
OpenBookQA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
Winogrande
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
Winograd
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85 Accuracy
|
||||
|
||||
BoolQ
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
0
|
||||
|
||||
10
|
||||
|
||||
20
|
||||
|
||||
30
|
||||
|
||||
40
|
||||
|
||||
50
|
||||
|
||||
CB
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
COPA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
0
|
||||
|
||||
10
|
||||
|
||||
20
|
||||
|
||||
30
|
||||
|
||||
40
|
||||
|
||||
50
|
||||
|
||||
60
|
||||
|
||||
WIC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90 Accuracy
|
||||
|
||||
WSC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
5
|
||||
|
||||
10
|
||||
|
||||
15
|
||||
|
||||
20
|
||||
|
||||
25
|
||||
|
||||
MultiRC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
RTE
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
ReCoRD
|
||||
|
||||
OPT GPT PaLM Chinchilla Gopher Eleuther Jurassic
|
||||
|
||||
Figure 6: Zero-shot NLP Evaluations . Full evaluations on all 16 NLP tasks, with comparisons where available. We find that across most tasks, GPT-3 models and OPT models perform similarly, but some tasks display highly erratic behavior. 10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
30
|
||||
|
||||
40
|
||||
|
||||
50
|
||||
|
||||
60
|
||||
|
||||
70
|
||||
|
||||
80 Accuracy
|
||||
|
||||
HellaSwag
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
StoryCloze
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
62.5
|
||||
|
||||
65.0
|
||||
|
||||
67.5
|
||||
|
||||
70.0
|
||||
|
||||
72.5
|
||||
|
||||
75.0
|
||||
|
||||
77.5
|
||||
|
||||
80.0
|
||||
|
||||
82.5
|
||||
|
||||
PIQA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
ARC (Easy)
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
25
|
||||
|
||||
30
|
||||
|
||||
35
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50 Accuracy
|
||||
|
||||
ARC (Challenge)
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
35
|
||||
|
||||
40
|
||||
|
||||
45
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
OpenBookQA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
Winogrande
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
Winograd
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
45
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75 Accuracy
|
||||
|
||||
BoolQ
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
0
|
||||
|
||||
20
|
||||
|
||||
40
|
||||
|
||||
60
|
||||
|
||||
80
|
||||
|
||||
CB
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
COPA
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
0
|
||||
|
||||
10
|
||||
|
||||
20
|
||||
|
||||
30
|
||||
|
||||
40
|
||||
|
||||
50
|
||||
|
||||
WIC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
75 Accuracy
|
||||
|
||||
WSC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
5
|
||||
|
||||
10
|
||||
|
||||
15
|
||||
|
||||
20
|
||||
|
||||
25
|
||||
|
||||
30
|
||||
|
||||
MultiRC
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
50
|
||||
|
||||
55
|
||||
|
||||
60
|
||||
|
||||
65
|
||||
|
||||
70
|
||||
|
||||
RTE
|
||||
|
||||
10 8 10 9 10 10 10 11 10 12
|
||||
|
||||
Parameters
|
||||
|
||||
70
|
||||
|
||||
75
|
||||
|
||||
80
|
||||
|
||||
85
|
||||
|
||||
90
|
||||
|
||||
ReCoRD
|
||||
|
||||
Shot 0 1 32 Series OPT GPT Figure 7: Multishot-shot NLP Evaluations . Full evaluations on all 16 NLP tasks, with comparisons to the GPT-3 reported performance. As with zero-shot, performance is roughly similar for most tasks, with some tasks demonstrating erratic behavior. B Contributions
|
||||
|
||||
Pre-training
|
||||
|
||||
• Initial planning: Susan Zhang • Training infrastructure and initial ablations: Naman Goyal, Myle Ott, Stephen Roller, Sam Shleifer, Susan Zhang • Training efficiency: Naman Goyal, Myle Ott, Sam Shleifer • Data curation and deduplication: Shuhoi Chen, Myle Ott, Stephen Roller • Training and monitoring OPT-175B: Mikel Artetxe, Moya Chen, Naman Goyal, Punit Singh Koura, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Stephen Roller, Susan Zhang • Training 125M–66B baselines: Naman Goyal, Stephen Roller, Susan Zhang
|
||||
|
||||
Evaluations
|
||||
|
||||
• NLP: Xian Li, Xi Victoria Lin, Todor Mihaylov, Stephen Roller, Anjali Sridhar • Dialogue: Stephen Roller • Responsible AI Evaluations: Punit Singh Koura, Stephen Roller, Tianlu Wang
|
||||
|
||||
Paper writing: Moya Chen, Stephen Roller, Luke Zettlemoyer, Susan Zhang
|
||||
|
||||
Code release preparation: Christopher Dewan, Susan Zhang
|
||||
|
||||
Responsible AI conduct: Mona Diab, Susan Zhang
|
||||
|
||||
C Datasheet
|
||||
|
||||
We follow the recommendations of Gebru et al. (2021) and provide a data card for the dataset used to train the OPT models.
|
||||
|
||||
C.1 Motivation
|
||||
|
||||
• For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The pre-training data for training the OPT-175B model was created by a union of five datasets, including three datasets used by RoBERTa (Liu et al., 2019b), a subset of the Pile (Gao et al., 2021a), along with the Pushshift.io Reddit dataset that was developed in (Baumgartner et al., 2020) and processed in (Roller et al., 2021). These purpose of creating this dataset was to pre-train the language model on a broad corpus of text, with emphasis on human-generated text. • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? Meta AI. • Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. Meta AI. • Any other comments? No. C.2 Composition
|
||||
|
||||
• What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. The instances are textual documents. The overall dataset is composed from a union of the following datasets:
|
||||
|
||||
– BookCorpus (Zhu et al., 2015) consists of more than 10K unpublished books
|
||||
|
||||
– CC-Stories (Trinh and Le, 2018) contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas
|
||||
|
||||
– The Pile (Gao et al., 2021a) from which the following was included: * Pile-CC * OpenWebText2 * USPTO * Project Gutenberg * OpenSubtitles * Wikipedia * DM Mathematics * HackerNews
|
||||
|
||||
– Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021).
|
||||
|
||||
– CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) • How many instances are there in total (of each type, if appropriate)? The training data contains 180B tokens corresponding to 800 GB of data. • Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). The CC-stories dataset contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. The remainder of the dataset was collected from the above sources, reformatted, and deduplicated. • What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. Each instance consists of raw text data. • Is there a label or target associated with each instance? If so, please provide a description. No. • Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No. • Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit. There are no explicit relationships between individual instances. • Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. We hold out a random validation set of approximately 200MB from the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Outside of naturally occurring duplication from potential overlaps between the datasets, there are no other redundancies, errors, or sources of noise that we add. • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? It’s self-contained. • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. Parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety. • Does the dataset relate to people? If not, you may skip the remaining questions in this section.
|
||||
|
||||
Some documents of this data relate to people, such as news articles, Wikipedia descriptions, etc. • Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. No, the dataset does not explicitly include subpopulation identification. • Any other comments? No.
|
||||
|
||||
C.3 Collection Process
|
||||
|
||||
• How was the data associated with each instance acquired? Was the data directly observ-able (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/ derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. N/A. The dataset is a union of five publicly available datasets. • What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mecha-nisms or procedures validated? The data was downloaded from the internet. • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Please see previous answers for how the dataset was created. • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? This data is mined, filtered and sampled by machines. • Over what timeframe was the data collected? Does this timeframe match the creation time-frame of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The CC-News dataset contains English news articles crawled between September 2016 and September 2021. • Does the dataset relate to people? If not, you may skip the remainder of the questions in this section. No. • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? N/A. • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. N/A. • Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and pro-vided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. N/A. • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). N/A. • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. Some toxicity and bias evaluations were performed. Please refer to the main document and the model card for these details. • Any other comments? No.
|
||||
|
||||
C.4 Preprocessing/cleaning/labeling
|
||||
|
||||
• Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, to-kenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section. The component datasets went through standard cleaning and re-formatting practices, including removing repetitive/non-informative text like “Chapter One,” or “This ebook by Project Gutenberg.” • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to sup-port unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. The “raw” component datasets is publicly available in their respective locations (more details can be seen in the respective papers linked in references). • Any other comments? No.
|
||||
|
||||
C.5 Uses
|
||||
|
||||
• Has the dataset been used for any tasks already? If so, please provide a description. Yes, this dataset was used to pre-train the OPT models. • Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. https://github.com/facebookresearch/ metaseq
|
||||
|
||||
• What (other) tasks could the dataset be used for? This data can be used to pre-train language models, which are foundation to many current and future language tasks. • Is there anything about the composition of the dataset or the way it was collected and prepro-cessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individ-uals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? The pipeline for creating this dataset paves a way for building a scalable infrastructure for mining datasets. • Are there tasks for which the dataset should not be used? If so, please provide a description.
|
||||
|
||||
None that we are currently aware of. • Any other comments? No. C.6 Distribution
|
||||
|
||||
• Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
|
||||
|
||||
Not at this time. • How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? N/A. • When will the dataset be distributed? N/A. • Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. N/A. • Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. N/A. • Any other comments? No.
|
||||
|
||||
C.7 Maintenance
|
||||
|
||||
• Who is supporting/hosting/maintaining the dataset? Meta AI. • How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Refer to the main document. • Is there an erratum? If so, please provide a link or other access point. N/A. • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete in-stances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? No current plan for updating. • If the dataset relates to people, are there applicable limits on the retention of the data as-sociated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. N/A. • Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users. N/A. • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/ verified? If so, please describe how. If not, why not? Is there a process for communicating/ dis-tributing these contributions to other users? If so, please provide a description. No mechanism is available right now. • Any other comments? No.
|
||||
|
||||
D Model Card
|
||||
|
||||
Following Mitchell et al. (2018), we provide a model card for OPT-175B. D.1 Model Details
|
||||
|
||||
• Person or organization developing model: OPT-175B was developed by Meta AI. • Model date: OPT-175B was released on May 3, 2022. • Model version: OPT-175B described in this paper is version 1.0.0. • Model type: OPT-175B is a large decoder-only transformer language model. • Information about training algorithms, parameters, fairness constraints or other applied ap-proaches, and features: OPT-175B was trained with AdamW for parameter sizes from 125M to 175B. See the Data Card (Appendix C) for information about training data and Section 2.2 - 2.5 for information about the training process. • Paper or other resource for more information: See the rest of this paper for more details on OPT-175B as well as the corresponding post on the Meta AI Research Blog. More details are also available in metaseq, our open-source repository. 12
|
||||
|
||||
• License: OPT-175B and the smaller baseline models are made available through a non-commercial use license agreement provided in our model license. 13
|
||||
|
||||
• Where to send questions or comments about the model: Please contact the corresponding authors
|
||||
|
||||
{susanz,roller,namangoyal}@fb.com for any questions or comments.
|
||||
|
||||
D.2 Intended Use
|
||||
|
||||
• Primary intended uses: We release OPT-175B for research into Language Models, especially as it pertains to Responsible AI. See Section 6 for more detailed Considerations for Release. Information on how to use the model can be found at metaseq , our open-source repository. • Primary intended users: We primarily target researchers and the related research community. • Out-of-scope use cases: OPT-175B is not released for production use or real-world deployments. As we note in Section 5, OPT-175B, like similar large language models, has a variety of shortcomings that make it premature for commercial use.
|
||||
|
||||
D.3 Data, Limitations, and Recommendations
|
||||
|
||||
• Data selection for training: Training data for OPT-175B was selected based on a combination of breadth and availability. See our Data Card (Appendix C) for more detailed information on the data used to train our model. • Data selection for evaluation: Evaluations in this paper were chosen to provide comparable perfor-mance assessments relative to similar scale models in the literature. Given concerns in the community around safety and fairness of large language models in general, we also explicitly provide evaluations on Responsible AI (see Section 4). • Limitations: Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, OPT-175B has limitations in terms of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern large language models. By releasing with a non-commercial license, we also hope to increase communication, transparency, and study of the problems of large language models, especially in areas which may not be aligned with commercial interests. See Section 5 for a more detailed discussion of limitations of OPT-175B.
|
||||
|
||||
> 12 https://github.com/facebookresearch/metaseq/
|
||||
> 13 https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE. md
|
||||
|
||||
• Recommendations for future work: See Section 6 for more about our Considerations for Release, including a discussion of potential avenues of research enabled by opening our model to more of the research community. We hope that the release of OPT-175B, as well as information around our model training process, will increase open science around both large language models in specific and natural language processing and deep learning in general. E Sample Model Outputs
|
||||
|
||||
For all sample outputs, the initial prompt is given in bold and the remainder is the continuation. These example outputs were intentionally selected to highlight both successes and failures of the OPT-175B model.
|
||||
|
||||
Figure 8: Poetry generation. We have observed the model can write entertaining poetry on topics such as dodos, samosas, and performance reviews. However, we struggled to get the model to observe rhyme or meter.
|
||||
|
||||
Figure 9: Conversation generation. OPT-175B adopts a patriotic personality when prompted as the Statue of Liberty. However, the model also devolves into somewhat simple and linguistically repetitive generations further into the conversation. Figure 10: Basic few-shot translation example. OPT was not intentionally trained to be multilingual, but we found anecdotally it has limited success with simple translations in German, Spanish, French, and Chinese. Figure 11: Paper writing example. Prompting with "1. Introduction" generally yielded more interesting results compared to prompting with “Abstract.” Our prompt here was inspired by the first sentence of the seminal ResNet work (He et al., 2016). Figure 12: Arithmetic. We observe mistakes when extending from addition to other operations. Figure 13: Python programming. Simply switching out a variable name can alter the generated output.
|
||||
@@ -0,0 +1,349 @@
|
||||
Title: Qwen3 Technical Report
|
||||
|
||||
URL Source: https://arxiv.org/html/2505.09388
|
||||
|
||||
Markdown Content:
|
||||
\useunder
|
||||
|
||||
\ul
|
||||
|
||||
###### Abstract
|
||||
|
||||
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—–such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)—–and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
The pursuit of artificial general intelligence (AGI) or artificial super intelligence (ASI) has long been a goal for humanity. Recent advancements in large foundation models, e.g., GPT-4o (gpt4o), Claude 3.7 (claude3.7), Gemini 2.5 (gemini2.5), DeepSeek-V3 (deepseekv3), Llama-4 (llama4), and Qwen2.5 (qwen2.5), have demonstrated significant progress toward this objective. These models are trained on vast datasets spanning trillions of tokens across diverse domains and tasks, effectively distilling human knowledge and capabilities into their parameters. Furthermore, recent developments in reasoning models, optimized through reinforcement learning, highlight the potential for foundation models to enhance inference-time scaling and achieve higher levels of intelligence, e.g., o3 (o3), DeepSeek-R1 (r1). While most state-of-the-art models remain proprietary, the rapid growth of open-source communities has substantially reduced the performance gap between open-weight and closed-source models. Notably, an increasing number of top-tier models (llama4; deepseekv3; r1; qwen2.5) are now being released as open-source, fostering broader research and innovation in artificial intelligence.
|
||||
|
||||
In this work, we introduce Qwen3, the latest series in our foundation model family, Qwen. Qwen3 is a collection of open-weight large language models (LLMs) that achieve state-of-the-art performance across a wide variety of tasks and domains. We release both dense and Mixture-of-Experts (MoE) models, with the number of parameters ranging from 0.6 billion to 235 billion, to meet the needs of different downstream applications. Notably, the flagship model, Qwen3-235B-A22B, is an MoE model with a total of 235 billion parameters and 22 billion activated ones per token. This design ensures both high performance and efficient inference.
|
||||
|
||||
Qwen3 introduces several key advancements to enhance its functionality and usability. First, it integrates two distinct operating modes, thinking mode and non-thinking mode, into a single model. This allows users to switch between these modes without alternating between different models, e.g., switching from Qwen2.5 to QwQ (qwq). This flexibility ensures that developers and users can adapt the model's behavior to suit specific tasks efficiently. Additionally, Qwen3 incorporates thinking budgets, providing users with fine-grained control over the level of reasoning effort applied by the model during task execution. This capability is crucial to the optimization of computational resources and performance, tailoring the model's thinking behavior to meet varying complexity in real-world applications. Furthermore, Qwen3 has been pre-trained on 36 trillion tokens covering up to 119 languages and dialects, effectively enhancing its multilingual capabilities. This broadened language support amplifies its potential for deployment in global use cases and international applications. These advancements together establish Qwen3 as a cutting-edge open-source large language model family, capable of effectively addressing complex tasks across various domains and languages.
|
||||
|
||||
The pre-training process for Qwen3 utilizes a large-scale dataset consisting of approximately 36 trillion tokens, curated to ensure linguistic and domain diversity. To efficiently expand the training data, we employ a multi-modal approach: Qwen2.5-VL (qwen2.5vl) is finetuned to extract text from extensive PDF documents. We also generate synthetic data using domain-specific models: Qwen2.5-Math (qwen2.5math) for mathematical content and Qwen2.5-Coder (qwen2.5coder) for code-related data. The pre-training process follows a three-stage strategy. In the first stage, the model is trained on about 30 trillion tokens to build a strong foundation of general knowledge. In the second stage, it is further trained on knowledge-intensive data to enhance reasoning abilities in areas like science, technology, engineering, and mathematics (STEM) and coding. Finally, in the third stage, the model is trained on long-context data to increase its maximum context length from 4,096 to 32,768 tokens.
|
||||
|
||||
To better align foundation models with human preferences and downstream applications, we employ a multi-stage post-training approach that empowers both thinking (reasoning) and non-thinking modes. In the first two stages, we focus on developing strong reasoning abilities through long chain-of-thought (CoT) cold-start finetuning and reinforcement learning focusing on mathematics and coding tasks. In the final two stages, we combine data with and without reasoning paths into a unified dataset for further fine-tuning, enabling the model to handle both types of input effectively, and we then apply general-domain reinforcement learning to improve performance across a wide range of downstream tasks. For smaller models, we use strong-to-weak distillation, leveraging both off-policy and on-policy knowledge transfer from larger models to enhance their capabilities. Distillation from advanced teacher models significantly outperforms reinforcement learning in performance and training efficiency.
|
||||
|
||||
We evaluate both pre-trained and post-trained versions of our models across a comprehensive set of benchmarks spanning multiple tasks and domains. Experimental results show that our base pre-trained models achieve state-of-the-art performance. The post-trained models, whether in thinking or non-thinking mode, perform competitively against leading proprietary models and large mixture-of-experts (MoE) models such as o1, o3-mini, and DeepSeek-V3. Notably, our models excel in coding, mathematics, and agent-related tasks. For example, the flagship model Qwen3-235B-A22B achieves 85.7 on AIME'24 and 81.5 on AIME'25 (aime), 70.7 on LiveCodeBench v5 (livecodebench), 2,056 on CodeForces, and 70.8 on BFCL v3 (bfcl). In addition, other models in the Qwen3 series also show strong performance relative to their size. Furthermore, we observe that increasing the thinking budget for thinking tokens leads to a consistent improvement in the model's performance across various tasks.
|
||||
|
||||
In the following sections, we describe the design of the model architecture, provide details on its training procedures, present the experimental results of pre-trained and post-trained models, and finally, conclude this technical report by summarizing the key findings and outlining potential directions for future research.
|
||||
|
||||
## 2 Architecture
|
||||
|
||||
The Qwen3 series includes 6 dense models, namely Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, and Qwen3-32B, and 2 MoE models, Qwen3-30B-A3B and Qwen3-235B-A22B. The flagship model, Qwen3-235B-A22B, has a total of 235B parameters with 22B activated ones. Below, we elaborate on the architecture of the Qwen3 models.
|
||||
|
||||
The architecture of the Qwen3 dense models is similar to Qwen2.5 (qwen2.5), including using Grouped Query Attention (GQA, gqa), SwiGLU (glu), Rotary Positional Embeddings (RoPE, rope), and RMSNorm (rmsnorm) with pre-normalization. Besides, we remove QKV-bias used in Qwen2 (qwen2) and introduce QK-Norm (pmlr-v202-dehghani23a) to the attention mechanism to ensure stable training for Qwen3. Key information on model architecture is provided in Table [1](https://arxiv.org/html/2505.09388#S2.T1 "Table 1 ‣ 2 Architecture ‣ Qwen3 Technical Report").
|
||||
|
||||
The Qwen3 MoE models share the same fundamental architecture as the Qwen3 dense models. Key information on model architecture is provided in Table [2](https://arxiv.org/html/2505.09388#S2.T2 "Table 2 ‣ 2 Architecture ‣ Qwen3 Technical Report"). We follow Qwen2.5-MoE (qwen2.5) and implement fine-grained expert segmentation (deepseekmoe). The Qwen3 MoE models have 128 total experts with 8 activated experts per token. Unlike Qwen2.5-MoE, the Qwen3-MoE design excludes shared experts. Furthermore, we adopt the global-batch load balancing loss (global_balance) to encourage expert specialization. These architectural and training innovations have yielded substantial improvements in model performance across downstream tasks.
|
||||
|
||||
Qwen3 models utilize Qwen's tokenizer (qwen), which implements byte-level byte-pair encoding (BBPE, gpt3; wang2020neural; sennirch2016neural) with a vocabulary size of 151,669.
|
||||
|
||||
Table 1: Model architecture of Qwen3 dense models.
|
||||
|
||||
Models Layers Heads (Q / KV)Tie Embedding Context Length
|
||||
Qwen3-0.6B 28 16 / 8 Yes 32K
|
||||
Qwen3-1.7B 28 16 / 8 Yes 32K
|
||||
Qwen3-4B 36 32 / 8 Yes 128K
|
||||
Qwen3-8B 36 32 / 8 No 128K
|
||||
Qwen3-14B 40 40 / 8 No 128K
|
||||
Qwen3-32B 64 64 / 8 No 128K
|
||||
|
||||
Table 2: Model architecture of Qwen3 MoE models.
|
||||
|
||||
Models Layers Heads (Q / KV)# Experts (Total / Activated)Context Length
|
||||
Qwen3-30B-A3B 48 32 / 4 128 / 8 128K
|
||||
Qwen3-235B-A22B 94 64 / 4 128 / 8 128K
|
||||
|
||||
## 3 Pre-training
|
||||
|
||||
In this section, we describe the construction of our pretraining data, the details of our pretraining approach, and present experimental results from evaluating the base models on standard benchmarks.
|
||||
|
||||
### 3.1 Pre-training Data
|
||||
|
||||
Compared with Qwen2.5 (qwen2.5), we have significantly expanded the scale and diversity of our training data. Specifically, we collected twice as many pre-training tokens—covering three times more languages. All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. This dataset includes high-quality content in various domains such as coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks, books, multilingual texts, and synthetic data.
|
||||
|
||||
To further expand the pre-training data corpus, we first employ the Qwen2.5-VL model (qwen2.5vl) to perform text recognition on a large volume of PDF-like documents. The recognized text is then refined using the Qwen2.5 model (qwen2.5), which helps improve its quality. Through this two-step process, we are able to obtain an additional set of high-quality text tokens, amounting to trillions in total. Besides, we employ Qwen2.5 (qwen2.5), Qwen2.5-Math (qwen2.5math), and Qwen2.5-Coder (qwen2.5coder) models to synthesize trillions of text tokens in different formats, including textbooks, question-answering, instructions, and code snippets, covering dozens of domains. Finally, we further expand the pre-training corpus by incorporating additional multilingual data and introducing more languages. Compared to the pre-training data used in Qwen2.5, the number of supported languages has been significantly increased from 29 to 119, enhancing the model's linguistic coverage and cross-lingual capabilities.
|
||||
|
||||
We have developed a multilingual data annotation system designed to enhance both the quality and diversity of training data. This system has been applied to our large-scale pre-training datasets, annotating over 30 trillion tokens across multiple dimensions such as educational value, fields, domains, and safety. These detailed annotations support more effective data filtering and combination. Unlike previous studies (doremi; doge; regmix) that optimize the data mixture at the data source or domain level, our method optimizes the data mixture at the instance-level through extensive ablation experiments on small proxy models with the fine-grained data labels.
|
||||
|
||||
### 3.2 Pre-training Stage
|
||||
|
||||
The Qwen3 models are pre-trained through a three-stage process:
|
||||
|
||||
1. (1)
|
||||
General Stage (S1): At the first pre-training stage, all Qwen3 models are trained on over 30 trillion tokens using a sequence length of 4,096 tokens. At this stage, the models have been fully pre-trained on language proficiency and general world knowledge, with training data covering 119 languages and dialects.
|
||||
|
||||
2. (2)
|
||||
Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data. The models are further pre-trained with about 5T higher-quality tokens at a sequence length of 4,096 tokens. We also accelerate the learning rate decay during this stage.
|
||||
|
||||
3. (3)
|
||||
Long Context Stage: In the final pre-training stage, we collect high-quality long context corpora to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions of tokens with a sequence length of 32,768 tokens. The long context corpus includes 75% of text between 16,384 to 32,768 tokens in length, and 25% of text between 4,096 to 16,384 in length. Following Qwen2.5 (qwen2.5), we increase the base frequency of RoPE from 10,000 to 1,000,000 using the ABF technique (ropeabf). Meanwhile, we introduce YARN (yarn) and Dual Chunk Attention (DCA, chunkllama) to achieve a four-fold increase in sequence length capacity during inference.
|
||||
|
||||
Similar to Qwen2.5 (qwen2.5), we develop scaling laws for optimal hyper-parameters (e.g., learning rate scheduler, and batch size) predictions based on three pre-training stages mentioned above. Through extensive experiments, we systematically study the relationship between model architecture, training data, training stage, and optimal training hyper-parameters. Finally, we set the predicted optimal learning rate and batch size strategy for each dense or MoE model.
|
||||
|
||||
### 3.3 Pre-training Evaluation
|
||||
|
||||
We conduct comprehensive evaluations of the base language models of the Qwen3 series. The evaluation of base models mainly focuses on their performance in general knowledge, reasoning, mathematics, scientific knowledge, coding, and multilingual capabilities. The evaluation datasets for pre-trained base models include 15 benchmarks:
|
||||
|
||||
* •
|
||||
General Tasks: MMLU (mmlu) (5-shot), MMLU-Pro (mmlupro) (5-shot, CoT), MMLU-redux (mmluredux) (5-shot), BBH (bbh) (3-shot, CoT), SuperGPQA (supergpqa)(5-shot, CoT).
|
||||
|
||||
* •
|
||||
Math & STEM Tasks: GPQA (gpqa) (5-shot, CoT), GSM8K (gsm8k) (4-shot, CoT), MATH (math) (4-shot, CoT).
|
||||
|
||||
* •
|
||||
Coding Tasks: EvalPlus (evalplus) (0-shot) (Average of HumanEval (humaneval), MBPP (mbpp), Humaneval+, MBPP+) (evalplus), MultiPL-E (multiple) (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript), MBPP-3shot (mbpp), CRUX-O of CRUXEval (1-shot) (gu2024cruxeval).
|
||||
|
||||
* •
|
||||
Multilingual Tasks: MGSM (mgsm) (8-shot, CoT), MMMLU (mmmlu) (5-shot), INCLUDE (romanou2024includeevaluatingmultilinguallanguage) (5-shot).
|
||||
|
||||
For the base model baselines, we compare the Qwen3 series base models with the Qwen2.5 base models (qwen2.5) and other leading open-source base models, including DeepSeek-V3 Base (deepseekv3), Gemma-3 (gemma3), Llama-3 (llama3), and Llama-4 (llama4) series base models, in terms of scale of parameters. All models are evaluated using the same evaluation pipeline and the widely-used evaluation settings to ensure fair comparison.
|
||||
|
||||
#### Summary of Evaluation Results
|
||||
|
||||
Based on the overall evaluation results, we highlight some key conclusions of Qwen3 base models.
|
||||
|
||||
1. (1)
|
||||
Compared with the previously open-source SOTA dense and MoE base models (such as DeepSeek-V3 Base, Llama-4-Maverick Base, and Qwen2.5-72B-Base), Qwen3-235B-A22B-Base outperforms these models in most tasks with significantly fewer total parameters or activated parameters.
|
||||
|
||||
2. (2)
|
||||
For the Qwen3 MoE base models, our experimental results indicate that: (a) Using the same pre-training data, Qwen3 MoE base models can achieve similar performance to Qwen3 dense base models with only 1/5 activated parameters. (b) Due to the improvements of the Qwen3 MoE architecture, the scale-up of the training tokens, and more advanced training strategies, the Qwen3 MoE base models can outperform the Qwen2.5 MoE base models with less than 1/2 activated parameters and fewer total parameters. (c) Even with 1/10 of the activated parameters of the Qwen2.5 dense base model, the Qwen3 MoE base model can achieve comparable performance, which brings us significant advantages in inference and training costs.
|
||||
|
||||
3. (3)
|
||||
The overall performance of the Qwen3 dense base models is comparable to the Qwen2.5 base models at higher parameter scales. For example, Qwen3-1.7B/4B/8B/14B/32B-Base achieve comparable performance to Qwen2.5-3B/7B/14B/32B/72B-Base, respectively. Especially in STEM, coding, and reasoning benchmarks, the performance of Qwen3 dense base models even surpasses Qwen2.5 base models at higher parameter scales.
|
||||
|
||||
The detailed results are as follows.
|
||||
|
||||
Table 3: Comparison among Qwen3-235B-A22B-Base and other representative strong open-source baselines. The highest, the second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Qwen2.5-72B Qwen2.5-Plus Llama-4-Maverick DeepSeek-V3 Qwen3-235B-A22B
|
||||
Base Base Base Base Base
|
||||
Architecture Dense MoE MoE MoE MoE
|
||||
# Total Params 72B 271B 402B 671B 235B
|
||||
# Activated Params 72B 37B 17B 37B 22B
|
||||
General Tasks
|
||||
MMLU 86.06 85.02 85.16 87.19 87.81
|
||||
MMLU-Redux 83.91 82.69 84.05 86.14 87.40
|
||||
MMLU-Pro 58.07 63.52 63.91 59.84 68.18
|
||||
SuperGPQA 36.20 37.18 40.85 41.53 44.06
|
||||
BBH 86.30 85.60 83.62 86.22 88.87
|
||||
Math & STEM Tasks
|
||||
GPQA 45.88 41.92 43.94 41.92 47.47
|
||||
GSM8K 91.50 91.89 87.72 87.57 94.39
|
||||
MATH 62.12 62.78 63.32 62.62 71.84
|
||||
Coding Tasks
|
||||
EvalPlus 65.93 61.43 68.38 63.75 77.60
|
||||
MultiPL-E 58.70 62.16 57.28 62.26 65.94
|
||||
MBPP 76.00 74.60 75.40 74.20 81.40
|
||||
CRUX-O 66.20 68.50 77.00 76.60 79.00
|
||||
Multilingual Tasks
|
||||
MGSM 82.40 82.21 79.69 82.68 83.53
|
||||
MMMLU 84.40 83.49 83.09 85.88 86.70
|
||||
INCLUDE 69.05 66.97 73.47 75.17 73.46
|
||||
|
||||
#### Qwen3-235B-A22B-Base
|
||||
|
||||
We compare Qwen3-235B-A22B-Base to our previous similar-sized MoE Qwen2.5-Plus-Base (qwen2.5) and other leading open-source base models: Llama-4-Maverick (llama4), Qwen2.5-72B-Base (qwen2.5), DeepSeek-V3 Base (deepseekv3). From the results in Table [3](https://arxiv.org/html/2505.09388#S3.T3 "Table 3 ‣ Summary of Evaluation Results ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"), the Qwen3-235B-A22B-Base model attains the highest performance scores across most of the evaluated benchmarks. We further compare Qwen3-235B-A22B-Base with other baselines separately for the detailed analysis.
|
||||
|
||||
1. (1)
|
||||
Compared with the recently open-source model Llama-4-Maverick-Base, which has about twice the number of parameters, Qwen3-235B-A22B-Base still performs better on most benchmarks.
|
||||
|
||||
2. (2)
|
||||
Compared with the previously state-of-the-art open-source model DeepSeek-V3-Base, Qwen3-235B-A22B-Base outperforms DeepSeek-V3-Base on 14 out of 15 evaluation benchmarks with only about 1/3 the total number of parameters and 2/3 activated parameters, demonstrating the powerful and cost-effectiveness of our models.
|
||||
|
||||
3. (3)
|
||||
Compared with our previous MoE Qwen2.5-Plus of similar size, Qwen3-235B-A22B-Base significantly outperforms it with fewer parameters and activated parameters, which shows the remarkable advantages of Qwen3 in pre-training data, training strategy, and model architecture.
|
||||
|
||||
4. (4)
|
||||
Compared with our previous flagship open-source dense model Qwen2.5-72B-Base, Qwen3-235B-A22B-Base surpasses the latter in all benchmarks and uses fewer than 1/3 of the activated parameters. Meanwhile, due to the advantage of the model architecture, the inference costs and training costs on each trillion tokens of Qwen3-235B-A22B-Base are much cheaper than those of Qwen2.5-72B-Base.
|
||||
|
||||
Table 4: Comparison among Qwen3-32B-Base and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Qwen2.5-32B Qwen2.5-72B Gemma-3-27B Llama-4-Scout Qwen3-32B
|
||||
Base Base Base Base Base
|
||||
Architecture Dense Dense Dense MoE Dense
|
||||
# Total Params 32B 72B 27B 109B 32B
|
||||
# Activated Params 32B 72B 27B 17B 32B
|
||||
General Tasks
|
||||
MMLU 83.32 86.06 78.69 78.27 83.61
|
||||
MMLU-Redux 81.97 83.91 76.53 71.09 83.41
|
||||
MMLU-Pro 55.10 58.07 52.88 56.13 65.54
|
||||
SuperGPQA 33.55 36.20 29.87 26.51 39.78
|
||||
BBH 84.48 86.30 79.95 82.40 87.38
|
||||
Math & STEM Tasks
|
||||
GPQA 47.97 45.88 26.26 40.40 49.49
|
||||
GSM8K 92.87 91.50 81.20 85.37 93.40
|
||||
MATH 57.70 62.12 51.78 51.66 61.62
|
||||
Coding Tasks
|
||||
EvalPlus 66.25 65.93 55.78 59.90 72.05
|
||||
MultiPL-E 58.30 58.70 45.03 47.38 67.06
|
||||
MBPP 73.60 76.00 68.40 68.60 78.20
|
||||
CRUX-O 67.80 66.20 60.00 61.90 72.50
|
||||
Multilingual Tasks
|
||||
MGSM 78.12 82.40 73.74 79.93 83.06
|
||||
MMMLU 82.40 84.40 77.62 74.83 83.83
|
||||
INCLUDE 64.35 69.05 68.94 68.09 67.87
|
||||
|
||||
Table 5: Comparison among Qwen3-14B-Base, Qwen3-30B-A3B-Base, and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Gemma-3-12B Qwen2.5-14B Qwen2.5-32B Qwen2.5-Turbo Qwen3-14B Qwen3-30B-A3B Base Base Base Base Base Base Architecture Dense Dense Dense MoE Dense MoE# Total Params 12B 14B 32B 42B 14B 30B# Activated Params 12B 14B 32B 6B 14B 3B General Tasks MMLU 73.87 79.66 83.32 79.50 81.05 81.38 MMLU-Redux 70.70 76.64 81.97 77.11 79.88 81.17 MMLU-Pro 44.91 51.16 55.10 55.60 61.03 61.49 SuperGPQA 24.61 30.68 33.55 31.19 34.27 35.72 BBH 74.28 78.18 84.48 76.10 81.07 81.54 Math & STEM Tasks GPQA 31.31 32.83 47.97 41.41 39.90 43.94 GSM8K 78.01 90.22 92.87 88.32 92.49 91.81 MATH 44.43 55.64 57.70 55.60 62.02 59.04 Coding Tasks EvalPlus 52.65 60.70 66.25 61.23 72.23 71.45 MultiPL-E 43.03 54.79 58.30 53.24 61.69 66.53 MBPP 60.60 69.00 73.60 67.60 73.40 74.40 CRUX-O 52.00 61.10 67.80 60.20 68.60 67.20 Multilingual Tasks MGSM 64.35 74.68 78.12 70.45 79.20 79.11 MMMLU 72.50 78.34 82.40 79.76 79.69 81.46 INCLUDE 63.34 60.26 64.35 59.25 64.55 67.00
|
||||
|
||||
Table 6: Comparison among Qwen8B-Base and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Llama-3-8B Qwen2.5-7B Qwen2.5-14B Qwen3-8B
|
||||
Base Base Base Base
|
||||
Architecture Dense Dense Dense Dense
|
||||
# Total Params 8B 7B 14B 8B
|
||||
# Activated Params 8B 7B 14B 8B
|
||||
General Tasks
|
||||
MMLU 66.60 74.16 79.66 76.89
|
||||
MMLU-Redux 61.59 71.06 76.64 76.17
|
||||
MMLU-Pro 35.36 45.00 51.16 56.73
|
||||
SuperGPQA 20.54 26.34 30.68 31.64
|
||||
BBH 57.70 70.40 78.18 78.40
|
||||
Math & STEM Tasks
|
||||
GPQA 25.80 36.36 32.83 44.44
|
||||
GSM8K 55.30 85.36 90.22 89.84
|
||||
MATH 20.50 49.80 55.64 60.80
|
||||
Coding Tasks
|
||||
EvalPlus 44.13 62.18 60.70 67.65
|
||||
MultiPL-E 31.45 50.73 54.79 58.75
|
||||
MBPP 48.40 63.40 69.00 69.80
|
||||
CRUX-O 36.80 48.50 61.10 62.00
|
||||
Multilingual Tasks
|
||||
MGSM 38.92 63.60 74.68 76.02
|
||||
MMMLU 59.65 71.34 78.34 75.72
|
||||
IINCLUDE 44.94 53.98 60.26 59.40
|
||||
|
||||
Table 7: Comparison among Qwen3-4B-Base and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Gemma-3-4B Qwen2.5-3B Qwen2.5-7B Qwen3-4B
|
||||
Base Base Base Base
|
||||
Architecture Dense Dense Dense Dense
|
||||
# Total Params 4B 3B 7B 4B
|
||||
# Activated Params 4B 3B 7B 4B
|
||||
General Tasks
|
||||
MMLU 59.51 65.62 74.16 72.99
|
||||
MMLU-Redux 56.91 63.68 71.06 72.79
|
||||
MMLU-Pro 29.23 34.61 45.00 50.58
|
||||
SuperGPQA 17.68 20.31 26.34 28.43
|
||||
BBH 51.70 56.30 70.40 72.59
|
||||
Math & STEM Tasks
|
||||
GPQA 24.24 26.26 36.36 36.87
|
||||
GSM8K 43.97 79.08 85.36 87.79
|
||||
MATH 26.10 42.64 49.80 54.10
|
||||
Coding Tasks
|
||||
EvalPlus 43.23 46.28 62.18 63.53
|
||||
MultiPL-E 28.06 39.65 50.73 53.13
|
||||
MBPP 46.40 54.60 63.40 67.00
|
||||
CRUX-O 34.00 36.50 48.50 55.00
|
||||
Multilingual Tasks
|
||||
MGSM 33.11 47.53 63.60 67.74
|
||||
MMMLU 59.62 65.55 71.34 71.42
|
||||
INCLUDE 49.06 45.90 53.98 56.29
|
||||
|
||||
Table 8: Comparison among Qwen3-1.7B-Base, Qwen3-0.6B-Base, and other strong open-source baselines. The highest and second-best scores are shown in bold and underlined, respectively.
|
||||
|
||||
Qwen2.5-0.5B Qwen3-0.6B Gemma-3-1B Qwen2.5-1.5B Qwen3-1.7B
|
||||
Base Base Base Base Base
|
||||
Architecture Dense Dense Dense Dense Dense
|
||||
# Total Params 0.5B 0.6B 1B 1.5B 1.7B
|
||||
# Activated Params 0.5B 0.6B 1B 1.5B 1.7B
|
||||
General Tasks
|
||||
MMLU 47.50 52.81 26.26 60.90 62.63
|
||||
MMLU-Redux 45.10 51.26 25.99 58.46 61.66
|
||||
MMLU-Pro 15.69 24.74 9.72 28.53 36.76
|
||||
SuperGPQA 11.30 15.03 7.19 17.64 20.92
|
||||
BBH 20.30 41.47 28.13 45.10 54.47
|
||||
Math & STEM Tasks
|
||||
GPQA 24.75 26.77 24.75 24.24 28.28
|
||||
GSM8K 41.62 59.59 2.20 68.54 75.44
|
||||
MATH 19.48 32.44 3.66 35.00 43.50
|
||||
Coding Tasks
|
||||
EvalPlus 31.85 36.23 8.98 44.80 52.70
|
||||
MultiPL-E 18.70 24.58 5.15 33.10 42.71
|
||||
MBPP 29.80 36.60 9.20 43.60 55.40
|
||||
CRUX-O 12.10 27.00 3.80 29.60 36.40
|
||||
Multilingual Tasks
|
||||
MGSM 12.07 30.99 1.74 32.82 50.71
|
||||
MMMLU 31.53 50.16 26.57 60.27 63.27
|
||||
INCLUDE 24.74 34.26 25.62 39.55 45.57
|
||||
|
||||
#### Qwen3-32B-Base
|
||||
|
||||
Qwen3-32B-Base is our largest dense model among the Qwen3 series. We compare it to the baselines of similar sizes, including Gemma-3-27B (gemma3) and Qwen2.5-32B (qwen2.5). In addition, we introduce two strong baselines: the recently open-source MoE model Llama-4-Scout, which has three times the parameters of Qwen3-32B-Base but half the activated parameters; and our previous flagship open-source dense model Qwen2.5-72B-Base, which has more than twice the number of parameters compared to Qwen3-32B-Base. The results are shown in Table [4](https://arxiv.org/html/2505.09388#S3.T4 "Table 4 ‣ Qwen3-235B-A22B-Base ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"), which support three key conclusions:
|
||||
|
||||
1. (1)
|
||||
Compared with the similar-sized models, Qwen3-32B-Base outperforms Qwen2.5-32B-Base and Gemma-3-27B Base on most benchmarks. Notably, Qwen3-32B-Base achieves 65.54 on MMLU-Pro and 39.78 on SuperGPQA, significantly outperforming its predecessor Qwen2.5-32B-Base. In addition, Qwen3-32B-Base achieves significantly higher encoding benchmark scores than all baseline models.
|
||||
|
||||
2. (2)
|
||||
Surprisingly, we find that Qwen3-32B-Base achieves competitive results compared to Qwen2.5-72B-Base. Although Qwen3-32B-Base has less than half the number of parameters of Qwen2.5-72B-Base, it outperforms Qwen2.5-72B-Base in 10 of the 15 evaluation benchmarks. On coding, mathematics, and reasoning benchmarks, Qwen3-32B-Base has remarkable advantages.
|
||||
|
||||
3. (3)
|
||||
Compared to Llama-4-Scout-Base, Qwen3-32B-Base significantly outperforms it on all 15 benchmarks, with only one-third of the number of parameters of Llama-4-Scout-Base, but twice the number of activated parameters.
|
||||
|
||||
#### Qwen3-14B-Base & Qwen3-30B-A3B-Base
|
||||
|
||||
The evaluation of the Qwen3-14B-Base and Qwen3-30B-A3B-Base is compared against baselines of similar sizes, including Gemma-3-12B Base, Qwen2.5-14B Base. Similarly, we also introduce two strong baselines: (1) Qwen2.5-Turbo (qwen2.5), which has 42B parameters and 6B activated parameters. Note that its activated parameters are twice those of Qwen3-30B-A3B-Base. (2) Qwen2.5-32B-Base, which has 11 times the activated parameters of Qwen3-30B-A3B and more than twice that of Qwen3-14B. The results are shown in Table [5](https://arxiv.org/html/2505.09388#S3.T5 "Table 5 ‣ Qwen3-235B-A22B-Base ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"), where we can draw the following conclusions.
|
||||
|
||||
1. (1)
|
||||
Compared with the similar-sized models, Qwen3-14B-Base significantly performs better than Qwen2.5-14B-Base and Gemma-3-12B-Base on all 15 benchmarks.
|
||||
|
||||
2. (2)
|
||||
Similarly, Qwen3-14B-Base also achieves very competitive results compared to Qwen2.5-32B-Base with less than half of the parameters.
|
||||
|
||||
3. (3)
|
||||
With only 1/5 activated non-embedding parameters, Qwen3-30B-A3B significantly outperforms Qwen2.5-14B-Base on all tasks, and achieves comparable performance to Qwen3-14B-Base and Qwen2.5-32B-Base, which brings us significant advantages in inference and training costs.
|
||||
|
||||
#### Qwen3-8B / 4B / 1.7B / 0.6B-Base
|
||||
|
||||
For edge-side models, we take similar-sized Qwen2.5, Llama-3, and Gemma-3 base models as the baselines. The results can be seen in Table [6](https://arxiv.org/html/2505.09388#S3.T6 "Table 6 ‣ Qwen3-235B-A22B-Base ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"), Table [7](https://arxiv.org/html/2505.09388#S3.T7 "Table 7 ‣ Qwen3-235B-A22B-Base ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"), and Table [8](https://arxiv.org/html/2505.09388#S3.T8 "Table 8 ‣ Qwen3-235B-A22B-Base ‣ 3.3 Pre-training Evaluation ‣ 3 Pre-training ‣ Qwen3 Technical Report"). All Qwen3 8B / 4B / 1.7B / 0.6B-Base models continue to maintain strong performance across nearly all benchmarks. Notably, Qwen3-8B / 4B / 1.7B-Base models even outperform larger size Qwen2.5-14B / 7B / 3B Base models on over half of the benchmarks, especially on STEM-related and coding benchmarks, reflecting the significant improvement of the Qwen3 models.
|
||||
|
||||
## 4 Post-training
|
||||
|
||||

|
||||
|
||||
Figure 1: Post-training pipeline of the Qwen3 series models.
|
||||
|
||||
The post-training pipeline of Qwen3 is strategically designed with two core objectives:
|
||||
|
||||
1. (1)
|
||||
Thinking Control: This involves the integration of two distinct modes, namely the ``non-thinking'' and ``thinking'' modes, providing users with the flexibility to choose whether the model should engage in reasoning or not, and to control the depth of thinking by specifying a token budget for the thinking process.
|
||||
|
||||
2. (2)
|
||||
Strong-to-Weak Distillation: This aims to streamline and optimize the post-training process for lightweight models. By leveraging the knowledge from large-scale models, we substantially reduce both the computational costs and the development efforts required for building smaller-scale models.
|
||||
|
||||
As illustrated in Figure [1](https://arxiv.org/html/2505.09388#S4.F1 "Figure 1 ‣ 4 Post-training ‣ Qwen3 Technical Report"), the flagship models in the Qwen3 series follow a sophisticated four-stage training process. The first two stages focus on developing the models' ``thinking'' abilities. The next two stages aim to integrate strong ``non-thinking'' functionalities into the models.
|
||||
|
||||
Preliminary experiments suggest that directly distilling the output logits from teacher models into lightweight student models can effectively enhance their performance while maintaining fine-grained control over their reasoning processes. This approach eliminates the necessity of performing an exhaustive four-stage training process individually for every small-scale model. It leads to better immediate performance, as indicated by higher Pass@1 scores, and also improves the model's ability of exploration, as reflected in improved Pass@64 results. In addition, it achieves these gains with much greater training efficiency, requiring only 1/10 of the GPU hours compared to the four-stage training method.
|
||||
|
||||
In the following sections, we present the four-stage training process and provide a detailed explanation of the Strong-to-Weak Distillation approach.
|
||||
|
||||
### 4.1 Long-CoT Cold Start
|
||||
|
||||
We begin by curating a comprehensive dataset that spans a wide range of categories, including math, code, logical reasoning, and general STEM problems. Each problem in the dataset is paired with verified reference answers or code-based test cases. This dataset serves as the foundation for the ``cold start'' phase of long Chain-of-Thought (long-CoT) training.
|
||||
|
||||
The dataset construction involves a rigorous two-phase filtering process: query filtering and response filtering. In the query filtering phase, we use Qwen2.5-72B-Instruct to identify and remove queries that are not easily verifiable. This includes queries containing multiple sub-questions or those asking for general text generation. Furthermore, we exclude queries that Qwen2.5-72B-Instruct can answer correctly without using CoT reasoning. This helps prevent the model from relying on superficial guessing and ensures that only complex problems requiring deeper reasoning are included. Additionally, we annotate each query's domain using Qwen2.5-72B-Instruct to maintain balanced domain representation across the dataset.
|
||||
|
||||
After reserving a validation query set, we generate N candidate responses for each remaining query using QwQ-32B (qwq32b). When QwQ-32B consistently fails to generate correct solutions, human annotators manually assess the accuracy of the responses. For queries with positive Pass@N, further stringent filtering criteria are applied to remove responses that (1) yield incorrect final answers, (2) contain substantial repetition, (3) clearly indicate guesswork without adequate reasoning, (4) exhibit inconsistencies between the thinking and summary contents, (5) involve inappropriate language mixing or stylistic shifts, or (6) are suspected of being overly similar to potential validation set items. Subsequently, a carefully selected subset of the refined dataset is used for the initial cold-start training of the reasoning patterns. The objective at this stage is to instill foundational reasoning patterns in the model without overly emphasizing immediate reasoning performance. This approach ensures that the model's potential is not limited, allowing for greater flexibility and improvement during the subsequent reinforcement learning (RL) phase. To achieve this objective effectively, it is preferable to minimize both the number of training samples and the training steps during this preparatory phase.
|
||||
|
||||
### 4.2 Reasoning RL
|
||||
|
||||
The query-verifier pairs used in the Reasoning RL stage must satisfy the following four criteria: (1) They were not used during the cold-start phase. (2) They are learnable for the cold-start model. (3) They are as challenging as possible. (4) They cover a broad range of sub-domains. We ultimately collect a total of 3,995 query-verifier pairs, and employed GRPO (deepseekmath) to update the model parameters. We observe that using a large batch size and a high number of rollouts per query, along with off-policy training to improve sample efficiency, is beneficial to the training process. We have also addressed how to balance exploration and exploitation by controlling the model’s entropy to increase steadily or remain stable, which is crucial for maintaining stable training. As a result, we achieve consistent improvements in both training reward and validation performance over the course of a single RL run, without any manual intervention on hyperparameters. For instance, the AIME'24 score of the Qwen3-235B-A22B model increases from 70.1 to 85.1 over a total of 170 RL training steps.
|
||||
|
||||
### 4.3 Thinking Mode Fusion
|
||||
|
||||
The goal of the Thinking Mode Fusion stage is to integrate the ``non-thinking'' capabilities into the previously developed ``thinking'' model. This approach allows developers to manage and control reasoning behaviors, while also reducing the cost and complexity of deploying separate models for thinking and non-thinking tasks. To achieve this, we conduct continual supervised fine-tuning (SFT) on the Reasoning RL model and design a chat template to fuse the two modes. Moreover, we find that models capable of handling both modes proficiently perform consistently well under different thinking budgets.
|
||||
|
||||
#### Construction of SFT data.
|
||||
|
||||
The SFT dataset combines both the ``thinking'' and ``non-thinking'' data. To ensure that the performance of the Stage 2 model is not compromised by the additional SFT, the ``thinking'' data is generated via rejection sampling on Stage 1 queries using the Stage 2 model itself. The ``non-thinking'' data, on the other hand, is carefully curated to cover a diverse range of tasks, including coding, mathematics, instruction-following, multilingual tasks, creative writing, question answering, and role-playing. Additionally, we employ automatically generated checklists for assessing the response quality of ``non-thinking'' data. To enhance the performance on tasks with low-resource languages, we particularly increase the proportion of translation tasks.
|
||||
|
||||
#### Chat Template Design.
|
||||
|
||||
To better integrate the two modes and enable users to dynamically switch the model's thinking process, we design chat templates for Qwen3, as shown in Table [4.3](https://arxiv.org/html/2505.09388#S4.SS3.SSS0.Px3 "Thinking Budget. ‣ 4.3 Thinking Mode Fusion ‣ 4 Post-training ‣ Qwen3 Technical Report"). Specifically, for samples in thinking mode and non-thinking mode, we introduce /think and /no_think flags in the user query or system message, respectively. This allows the model to follow the user's input and select the appropriate thinking mode accordingly. For non-thinking mode samples, we retain an empty thinking block in the assistant's response. This design ensures internal format consistency within the model and allows developers to prevent the model from engaging in thinking behavior by concatenating an empty think block in the chat template. By default, the model operates in thinking mode; therefore, we add some thinking mode training samples where the user queries do not include /think flags. For more complex multi-turn dialogs, we randomly insert multiple /think and /no_think flags into users' queries, with the model response adhering to the last flag encountered.
|
||||
|
||||
#### Thinking Budget.
|
||||
|
||||
An additional advantage of Thinking Mode Fusion is that, once the model learns to respond in both non-thinking and thinking modes, it naturally develops the ability to handle intermediate cases—generating responses based on incomplete thinking. This capability lays the foundation for implementing budget control over the model's thinking process. Specifically, when the length of the model's thinking reaches a user-defined threshold, we manually halt the thinking process and insert the stop-thinking instruction: ``Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n\n''. After this instruction is inserted, the model proceeds to generate a final response based on its accumulated reasoning up to that point. It is worth noting that this ability is not explicitly trained but emerges naturally as a result of applying Thinking Mode Fusion.
|
||||
|
||||
Table 9: Examples of SFT data for thinking and non-thinking modes during the thinking mode fusion stage. For the thinking mode, the /think flag can be omitted since it represents the default behavior. This feature has been implemented in the chat template 2 2 2[https://huggingface.co/Qwen/Qwen3-32B/blob/main/tokenizer_config.json](https://huggingface.co/Qwen/Qwen3-32B/blob/main/tokenizer_config.json) supported by the Hugging Face's tokenizer, where the thinking mode can be disabled using an additional parameter enable_thinking=False.
|
||||
Reference in New Issue
Block a user