Compare commits

...

6 Commits

Author SHA1 Message Date
wassname 5fca5ad2b2 Refresh Schulman cache anchors after transcript rewrite
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-25 10:31:39 +08:00
wassname f8f512f603 Cite Irpan in research taste (signs-of-life, seed canary)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-25 10:31:39 +08:00
wassname 3fe6cb9ad9 Replace OCR-garbled Schulman cache with clean slide transcript
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-25 10:31:39 +08:00
wassname 67d4dc90bb Document quote-first evidence style 2026-06-25 10:31:39 +08:00
wassname 20f03f20b8 Expand research taste appendix with expert quotes 2026-06-25 10:31:39 +08:00
wassname 8fc2c0bbd0 Add research taste evidence appendix 2026-06-25 10:31:39 +08:00
14 changed files with 840 additions and 169 deletions
+21
View File
@@ -0,0 +1,21 @@
# Local Instructions
## Quote-first evidence
When adding evidence or appendix material for this skill, prefer the expert's
own words over assistant synthesis. These sources are high-level SME material,
often above current frontier LLM taste and more diverse than model priors.
Quoting them is grounding data; rewording them injects assistant bias.
- Use generous quote blocks, usually 2-3 sentences or a full paragraph.
- Preserve the author's paragraph flow. Do not insert blank blockquote lines
between sentences from the same paragraph; it makes expert prose harder to
read and adds fake structure.
- Keep editorial text sparse and mostly for routing: why this source is here,
when to read it, and what local file it supports.
- Do not atomize useful source material into tiny one-line quotes when a longer
block preserves the author's reasoning.
- Put lower-relevance sources in "See also" rather than forcing a synthetic
narrative around them.
- In `SKILL.md`, link to reference docs like `refs/research_taste.md` instead
of copying a long assistant-written summary.
+1 -1
View File
@@ -216,7 +216,7 @@ Folklore sources (the quotes above trace to these):
[^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188) [^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188)
[^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501) [^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501)
[^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments) [^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L71-75, standardize-observations L84-88; clean slide transcript)
[^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251) [^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251)
[^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89) [^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89)
[^slavv]: Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017) — https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 ([cache](docs/evidence/slavv_37_reasons_nn.md): opening anecdote L19, emergency checklist L45-51) [^slavv]: Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017) — https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 ([cache](docs/evidence/slavv_37_reasons_nn.md): opening anecdote L19, emergency checklist L45-51)
+3 -2
View File
@@ -163,7 +163,7 @@ Henderson confirmed it quantitatively: splitting 10 same-config runs (differing
### Normalize and scale everything ### Normalize and scale everything
From the slides[^schulman] (bullet points, de-artifacted from the PDF): From the slides[^schulman]:
> - If observations have unknown range, standardize > - If observations have unknown range, standardize
> - Compute running estimate of mean and standard deviation > - Compute running estimate of mean and standard deviation
> - x' = clip((x - mu)/sigma, -10, 10) > - x' = clip((x - mu)/sigma, -10, 10)
@@ -287,6 +287,7 @@ Open the relevant one when the task calls for it. These are synthesized checklis
- [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check. - [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check.
- [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, for before you claim method A beats method B. - [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, for before you claim method A beats method B.
- [refs/llm_judges.md](refs/llm_judges.md) — LLM-as-a-judge biases (position, verbosity, self-preference) and the mitigation checklist, for when an LLM-judged eval looks too good. - [refs/llm_judges.md](refs/llm_judges.md) — LLM-as-a-judge biases (position, verbosity, self-preference) and the mitigation checklist, for when an LLM-judged eval looks too good.
- [refs/research_taste.md](refs/research_taste.md) — quote-first research taste appendix: Nanda/Olah/Steinhardt/Spinning Up on patience, choosing what to try, information gain, de-risking, and distillation.
- [refs/transformers.md](refs/transformers.md) — transformer-specific folklore: full traces, warmup/LR, optimizer evidence, train-deploy parity, scale priors, steering, and disclosed-training reports. - [refs/transformers.md](refs/transformers.md) — transformer-specific folklore: full traces, warmup/LR, optimizer evidence, train-deploy parity, scale priors, steering, and disclosed-training reports.
- [rl/SKILL.md](rl/SKILL.md) — RL-specific: probe environments, reward engineering, HP defaults, reference implementations. - [rl/SKILL.md](rl/SKILL.md) — RL-specific: probe environments, reward engineering, HP defaults, reference implementations.
- [pinn/SKILL.md](pinn/SKILL.md) — physics-informed networks: nondimensionalization, gradient pathologies, curriculum. - [pinn/SKILL.md](pinn/SKILL.md) — physics-informed networks: nondimensionalization, gradient pathologies, curriculum.
@@ -307,7 +308,7 @@ Folklore sources (the quotes above trace to these):
[^karpathy-recipe]: Andrej Karpathy, "A Recipe for Training Neural Networks" (2019) — https://karpathy.github.io/2019/04/25/recipe/ ([cache](docs/evidence/karpathy_recipe_training_nn_2019.md): inspect-data L26+L32, fixed-seed L39, overfit-one-batch L51, Adam-3e-4 L73; note: this is an abridged note with its own "..." elisions) [^karpathy-recipe]: Andrej Karpathy, "A Recipe for Training Neural Networks" (2019) — https://karpathy.github.io/2019/04/25/recipe/ ([cache](docs/evidence/karpathy_recipe_training_nn_2019.md): inspect-data L26+L32, fixed-seed L39, overfit-one-batch L51, Adam-3e-4 L73; note: this is an abridged note with its own "..." elisions)
[^karpathy-mistakes]: Andrej Karpathy, "most common neural net mistakes" tweet thread, 1 Jul 2018 — https://x.com/karpathy/status/1013244313327681536 ([cache](docs/evidence/karpathy_common_mistakes_tweet_2018.md): tweets 1-3 verbatim, cross-checked against threadreaderapp; x.com itself blocks fetching) [^karpathy-mistakes]: Andrej Karpathy, "most common neural net mistakes" tweet thread, 1 Jul 2018 — https://x.com/karpathy/status/1013244313327681536 ([cache](docs/evidence/karpathy_common_mistakes_tweet_2018.md): tweets 1-3 verbatim, cross-checked against threadreaderapp; x.com itself blocks fetching)
[^sculley]: Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NIPS 2015) — https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf ([cache](docs/evidence/sculley_2015_hidden_technical_debt.md): abstract, CACE/entanglement, ensemble caveat) [^sculley]: Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NIPS 2015) — https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf ([cache](docs/evidence/sculley_2015_hidden_technical_debt.md): abstract, CACE/entanglement, ensemble caveat)
[^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments) [^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L71-75, standardize-observations L84-88; clean slide transcript)
[^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251) [^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251)
[^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) — https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](docs/evidence/alexirpan_rl_hard.md): variance-bug-or-unlucky L674-678, seed-canary L705-707) [^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) — https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](docs/evidence/alexirpan_rl_hard.md): variance-bug-or-unlucky L674-678, seed-canary L705-707)
[^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89) [^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89)
+110 -166
View File
@@ -1,199 +1,143 @@
Source: http://joschu.net/docs/nuts-and-bolts.pdf Source: http://joschu.net/docs/nuts-and-bolts.pdf
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016) Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf' Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with `(cid:73)` bullet glyphs; replaced with hand-pasted text)
Fetch-status: verbatim Fetch-status: verbatim (clean paste)
| The Nuts | and Bolts | of Deep | RL Research | # The Nuts and Bolts of Deep RL Research
| -------- | --------- | --------- | ----------- |
| | John | Schulman | |
| | December | 9th, 2016 | |
Outline John Schulman, December 9th, 2016
| Approaching | New Problems | |
| --------------------- | ------------ | ---------- |
| Ongoing Development | | and Tuning |
| General Tuning | Strategies | for RL |
| Policy Gradient | Strategies | |
| Q-Learning Strategies | | |
| Miscellaneous | Advice | |
Approaching New Problems ## Outline
| New Algorithm? | Use Small | Test Problems | - Approaching New Problems
| -------------------------- | --------- | ------------- | - Ongoing Development and Tuning
| (cid:73) Run experiments | quickly | | - General Tuning Strategies for RL
| (cid:73) Do hyperparameter | search | | - Policy Gradient Strategies
(cid:73) Interpret and visualize learning process: state visitation, value function, etc. - Q-Learning Strategies
(cid:73) Counterpoint: dont overfit algorithm to contrived problem - Miscellaneous Advice
(cid:73) Useful to have medium-sized problems that youre intimately familiar with
(Hopper, Atari Pong)
| New Task? | Make | It Easier Until | Signs | of Life | ## Approaching New Problems
| ---------------- | --------------- | --------------- | ----- | ------- |
| (cid:73) Provide | good input | features | | | New Algorithm? Use Small Test Problems
| (cid:73) Shape | reward function | | | | - Run experiments quickly
- Do hyperparameter search
- Interpret and visualize learning process: state visitation, value function, etc.
- Counterpoint: don't overfit algorithm to contrived problem
- Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)
New Task? Make It Easier Until Signs of Life
- Provide good input features
- Shape reward function
POMDP Design POMDP Design
(cid:73) Visualize random policy: does it sometimes exhibit desired behavior? - Visualize random policy: does it sometimes exhibit desired behavior?
| (cid:73) Human | control | | | | - Human control
| -------------- | ------- | --- | --- | --- | - Atari: can you see game features in downsampled image?
(cid:73) Atari: can you see game features in downsampled image? - Plot time series for observations and rewards. Are they on a reasonable scale?
(cid:73) Plot time series for observations and rewards. Are they on a reasonable - hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
scale? - Histogram observations and rewards
| (cid:73) hopper.py | in gym: | | | |
| ------------------ | ------------ | --------------------------- | ------- | ----------- |
| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t |
| (cid:73) Histogram | observations | and rewards | | |
Run Your Baselines Run Your Baselines
| (cid:73) Dont expect | them to | work with default | parameters | - Don't expect them to work with default parameters
| --------------------- | ------- | ----------------- | ---------- | - Recommended:
(cid:73) Recommended: - Cross-entropy method[^1]
| Cross-entropy | method1 | | | - Well-tuned policy gradient method[^2]
| ------------- | ------- | --- | --- | - Well-tuned Q-learning + SARSA method
(cid:73)
| (cid:73) Well-tuned | policy gradient | method2 | |
| ------------------- | --------------- | -------------- | --- |
| (cid:73) Well-tuned | Q-learning | + SARSA method | |
1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation.
2https://github.com/openai/rllab
| Run with | More Samples | Than | Expected | | Run with More Samples Than Expected
| -------- | ------------ | ---- | -------- | --- | - Early in tuning process, may need huge number of samples
(cid:73) Early in tuning process, may need huge number of samples - Don't be deterred by published work
| | Dont be deterred | by published | work | | - Examples:
| --- | ----------------- | ------------ | ---- | --- | - TRPO on Atari: 100K timesteps per batch for KL= 0.01
(cid:73) - DQN on Atari: update freq=10K, replay buffer size=1M
| (cid:73) Examples: | | | | |
| ------------------ | --- | --- | --- | --- |
(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01
| | DQN on Atari: | update freq=10K, | replay buffer | size=1M |
| --- | ------------- | ---------------- | ------------- | ------- |
(cid:73)
| Ongoing | Development | and Tuning | ## Ongoing Development and Tuning
| ------- | ----------- | ---------- |
| It | Works! | But | Dont | Be Satisfied | | | It Works! But Don't Be Satisfied
| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- | - Explore sensitivity to each parameter
| | (cid:73) Explore | sensitivity | | to each parameter | | | - If too sensitive, it doesn't really work, you just got lucky
(cid:73) If too sensitive, it doesnt really work, you just got lucky - Look for health indicators
| | (cid:73) Look | for health | indicators | | | | - VF fit quality
| --- | ------------- | --------------- | ---------- | --- | --- | --- | - Policy entropy
| | | (cid:73) VF fit | quality | | | | - Update size in output space and parameter space
| | | Policy | entropy | | | | - Standard diagnostics for deep networks
(cid:73)
| | | (cid:73) Update | size in | output space | and parameter | space |
| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- |
| | | (cid:73) Standard | diagnostics | for | deep networks | |
| Continually | Benchmark | | Your Code | Continually Benchmark Your Code
| ------------------- | --------- | ------------- | ------------ | - If reusing code, regressions occur
| (cid:73) If reusing | code, | regressions | occur | - Run a battery of benchmarks occasionally
| (cid:73) Run | a battery | of benchmarks | occasionally |
| Always | Use Multiple | Random | Seeds | Always Use Multiple Random Seeds
| ------ | ------------ | ------ | ----- |
| Always Be | Ablating | | Always Be Ablating
| ------------------ | ---------- | ---------- | - Different tricks may substitute
| (cid:73) Different | tricks may | substitute | - Especially whitening
| Especially | whitening | | - "Regularize" to favor simplicity in algorithm design space
(cid:73) - As usual, simplicity → generalization
(cid:73) “Regularize” to favor simplicity in algorithm design space
| (cid:73) As | usual, simplicity | → generalization |
| ----------- | ----------------- | ---------------- |
| Automate Your | Experiments | | | Automate Your Experiments
| ------------- | ---------------- | --------- | ----------------- | - Don't spend all day watching your code print out numbers
| Dont spend | all day watching | your code | print out numbers | - Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)
(cid:73)
(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,
| Google Compute | Engine) | | |
| -------------- | ------- | --- | --- |
| General | Tuning | Strategies | for RL | ## General Tuning Strategies for RL
| ------- | ------ | ---------- | ------ |
| Whitening | / Standardizing | Data | Whitening / Standardizing Data
| ------------------------ | --------------- | ------------------ | - If observations have unknown range, standardize
| (cid:73) If observations | have unknown | range, standardize | - Compute running estimate of mean and standard deviation
(cid:73) Compute running estimate of mean and standard deviation - x' = clip((x μ)/σ, 10, 10)
x(cid:48) - Rescale the rewards, but don't shift mean, as that affects agent's will to live
(cid:73) = clip((x −µ)/σ,10,10) - Standardize prediction targets (e.g., value functions) the same way
(cid:73) Rescale the rewards, but dont shift mean, as that affects agents will to live
(cid:73) Standardize prediction targets (e.g., value functions) the same way
| Generally | Important | Parameters | | | | Generally Important Parameters
| --------- | --------------- | ------------- | ---- | ------- | --------- | - Discount
| (cid:73) | Discount | | | | | - Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
| | (cid:73) Return | = r +γr | +γ2r | +... | | - Effective time horizon: 1 + γ + γ² + · · · = 1/(1 γ)
| | | t t | t+1 | t+2 | | - I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) | - Low γ works well for well-shaped reward
(cid:73) - In TD(λ) methods, can get away with high γ when λ < 1
(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps - Action frequency
| | Low | γ works well | for well-shaped | reward | | - Solvable with human control (if possible)
| --- | --- | ------------ | --------------- | ------ | --- | - View random exploration
(cid:73)
(cid:73) In TD(λ) methods, can get away with high γ when λ < 1
| (cid:73) | Action frequency | | | | |
| -------- | ---------------- | ---------- | ------- | ------------- | --- |
| | Solvable | with human | control | (if possible) | |
(cid:73)
| | (cid:73) View | random exploration | | | |
| --- | ------------- | ------------------ | --- | --- | --- |
General RL Diagnostics General RL Diagnostics
(cid:73) Look at min/max/stdev of episode returns, along with mean - Look at min/max/stdev of episode returns, along with mean
(cid:73) Look at episode lengths: sometimes provides additional information - Look at episode lengths: sometimes provides additional information
| (cid:73) Solving problem | faster, losing | game slower | - Solving problem faster, losing game slower
| ------------------------ | -------------- | ----------- |
Policy Gradient Strategies ## Policy Gradient Strategies
| Entropy as | Diagnostic | | | Entropy as Diagnostic
| ------------------ | ---------------- | ------- | ------------- | - Premature drop in policy entropy ⇒ no learning
| (cid:73) Premature | drop in policy | entropy | ⇒ no learning | - Alleviate by using entropy bonus or KL penalty
| (cid:73) Alleviate | by using entropy | bonus | or KL penalty |
KL as Diagnostic KL as Diagnostic
(cid:2) (cid:3) - Compute KL [π_old(· | s), π(· | s)]
| (cid:73) Compute | KL π | (·|s),π(·|s) | | - KL spike ⇒ drastic loss of performance
| ---------------- | ---- | ------------ | --- | - No learning progress might mean steps are too large
old - batchsize=100K converges to different result than batchsize=20K.
| (cid:73) KL spike | ⇒ drastic | loss of performance | |
| -------------------- | --------- | ------------------- | ------------- |
| (cid:73) No learning | progress | might mean steps | are too large |
(cid:73) batchsize=100K converges to different result than batchsize=20K.
| Baseline | Explained | Variance | Baseline Explained Variance
| -------- | --------- | -------- | - explained variance = 1 Var[empirical return predicted value] / Var[empirical return]
1Var[empiricalreturnpredictedvalue]
| (cid:73) | explained variance | = |
| -------- | ------------------ | --- |
Var[empiricalreturn]
Policy Initialization Policy Initialization
(cid:73) More important than in supervised learning: determines initial state - More important than in supervised learning: determines initial state visitation
visitation - Zero or tiny final layer, to maximize entropy
| (cid:73) Zero | or tiny final layer, | to maximize | entropy |
| ------------- | -------------------- | ----------- | ------- |
| Q-Learning Strategies | | | ## Q-Learning Strategies
| --------------------- | --- | --- |
(cid:73) Optimize memory usage carefully: youll need it for replay buffer
| (cid:73) Learning | rate schedules | |
| -------------------- | -------------- | ------ |
| (cid:73) Exploration | schedules | |
| (cid:73) Be patient. | DQN converges | slowly |
(cid:73) On Atari, often 10-40M frames to get policy much better than random
ThankstoSzymonSidorforsuggestions
Miscellaneous Advice - Optimize memory usage carefully: you'll need it for replay buffer
(cid:73) Read older textbooks and theses, not just conference papers - Learning rate schedules
(cid:73) Dont get stuck on problems—cant solve everything at once - Exploration schedules
| (cid:73) Exploration | problems | like cart-pole swing-up | - Be patient. DQN converges slowly
| -------------------- | ----------------- | ----------------------- | - On Atari, often 10-40M frames to get policy much better than random
| (cid:73) DQN on | Atari vs CartPole | |
Thanks! Thanks to Szymon Sidor for suggestions
## Miscellaneous Advice
- Read older textbooks and theses, not just conference papers
- Don't get stuck on problems—can't solve everything at once
- Exploration problems like cart-pole swing-up
- DQN on Atari vs CartPole
[^1]: István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation.
[^2]: https://github.com/openai/rllab
@@ -0,0 +1,67 @@
# Highly Opinionated Advice on How to Write ML Papers - Neel Nanda (2025-05-12)
Source: https://www.lesswrong.com/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers
Author: Neel Nanda
Date: 12th May 2025
Fetch-status: excerpted from LessWrong HTML via browser.
Use: distillation and paper-writing evidence. This is adjacent to the research-process sequence, and directly useful when turning messy findings into a public artifact.
## Why this matters for agents
This post is the operational version of the distillation stage: compress the research into a few claims, red-team the evidence, write to inform rather than persuade, and spend disproportionate care on the abstract, intro, figures, and limitations.
## Quotes
> The essence of an ideal paper is the narrative: a short, rigorous and evidence-based technical story you tell, with a takeaway the readers care about.
> The first step is to compress your research into these claims.
> Experimental Evidence: This is absolutely crucial to get right and aggressively red-team, its how you resist the temptation of elegant but false narratives.
> Inform, not persuade: Avoid the trap of overclaiming or ignoring limitations.
> Your research only matters if people read, understand, and build upon it.
> At its core, a paper should present a narrative of one to three specific concrete claims that you believe to be true, that build to some useful takeaway(s).
> Readers will rarely take away more than a few sentences of content. Choose those sentences carefully.
> Generally, stronger statements make for more interesting papers, but require higher standards of evidence - resist the temptation to overclaim for clicks!
> Warning: Before moving into paper-writing mode, it's crucial to verify that your evidence is actually correct.
> Novelty means it expands our knowledge.
> Rigorous, at-scale replications of shaky results, negative results of seemingly promising hypotheses, and high-quality failed replications of popular papers are all very valuable contributions.
> A particularly important thing to get right is extensive red-teaming: you should spend a good amount of your time, both during the original research and now, red teaming your narrative.
> Good experiments distinguish between hypotheses.
> This skepticism and sanity checking is especially key for particularly surprising or novel bits of evidence.
> Ablation studies: When a paper introduces a complex new method, there are often several moving parts.
> Track pre/post-hoc analysis.
> Quality Over Quantity: Try to prioritise having at least one really compelling and hard to deny experiment, over a bunch of mediocre ones.
> Baselines are Crucial.
> The subtlety of baselines: It's not enough to just have them; you must strive to have the strongest possible baselines.
> The Guiding Question for Evidence: Ultimately, the question to ask about your evidence is: "Should this update a reader's beliefs about my claims?"
> Reproducibility & Publishing code: Rigour can be in the eye of the beholder: if readers cannot understand or verify it for themselves, its far harder to consider it rigorous.
> A key challenge in paper writing is the illusion of transparency - you have spent months steeped in the context of this research project.
## Source graph
Links visible in this post worth follow-up:
- Research process sequence: https://www.lesswrong.com/s/5GT3yoYM9gRmMEKqL
- Jakob Foerster writing advice: https://www.jakobfoerster.com/
- Jacob Steinhardt writing/research advice: https://cs.stanford.edu/~jsteinhardt/
- Refusal is mediated by a single direction: https://arxiv.org/abs/2406.11717
- Nanda grokking work: https://arxiv.org/abs/2301.05217
- Paper writing checklist: Google Docs link visible in post, not cached.
@@ -0,0 +1,51 @@
# How I Think About My Research Process: Explore, Understand, Distill - Neel Nanda (2025-04-26)
Source: https://www.lesswrong.com/posts/hjMy4ZxS5ogA9cTYK/how-i-think-about-my-research-process-explore-understand
Mirror/sequence URL visible on page: https://www.lesswrong.com/s/5GT3yoYM9gRmMEKqL/p/hjMy4ZxS5ogA9cTYK
Author: Neel Nanda
Date: 26th Apr 2025
Fetch-status: excerpted from LessWrong HTML via browser plus cross-checked against local shared draft.
Use: research-process / research-taste evidence, especially for agents deciding what mode of work they are in.
## Why this matters for agents
Nanda frames empirical research as stage-dependent. A model should not demand a crisp hypothesis when the right stage is exploration; it should not accept weak, cherry-picked evidence when the task has moved into understanding or distillation.
## Quotes
> This guide focuses more on the strategic (high-level direction, when to give up or pivot, etc) and tactical (what to do next, how to prioritise, etc) aspects of research - the "how to think about it" rather than just the "how to do it." Some of skills (coding, reading papers, understanding ML/mech interp concepts) are vital for how to do it, but not in scope here.
> How to get started? Strategic and tactical thinking are hard skills, and it is rare to be any good at them when starting out at research (or ever tbh). The best way to learn them is by trying things, making predictions, seeing what you get right or wrong (i.e., getting feedback from reality), and iterating.
> I see research as breaking down into a few stages:
>
> 1. Ideation - Choose a problem/domain to focus on
> 2. Exploration - Gain Surface area
> 1. North star: Gain information
> 3. Understanding - Test Hypotheses
> 1. North star: Convince yourself of a key hypothesis
> 4. Distillation - Compress, Refine, Communicate
> 1. North star: Compress your research findings into concise, rigorous truth that you can communicate to the world
> At the start, your understanding of the problem is often vague. Naively, its easy to think of research as being about testing specific hypotheses, but in practice you often start out not even knowing the right questions to ask, or the most promising directions. The exploration stage is about moving past this.
> Not having a clear goal/next step doesnt mean that you dont need to prioritise! Prioritise for information gain.
> Frequently ask yourself “am I getting enough information per unit time?” If you havent learned anything recently, shake it up.
> The mark of a good researcher is a deep commitment to skepticism of your results.
> A great experiment elegantly, and conclusively distinguishes between several plausible hypotheses, validates non-trivial predictions made by one hypothesis, and is tractable to implement in practice.
> The north star here is to distill your research findings into concise, rigorous truth that you can communicate to the world.
> Write to inform, not persuade - if you are clear (a high bar), and your results are interesting, people will likely appreciate your work.
## Source graph
High-value links inside or adjacent to this post:
- ARENA curriculum: https://arena-chapter1-transformer-interp.streamlit.app/
- Nanda paper reading list: https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
- Chris Olah, research taste: https://colah.github.io/notes/taste/
- Nanda Othello research process write-up: https://www.alignmentforum.org/s/nhGNHyJHbrofpPbRG/p/TAz44Lb9n9yf52pv8
- Nanda standards post: https://www.neelnanda.io/blog/35-standards
@@ -0,0 +1,49 @@
# My Research Process: Key Mindsets - Truth-Seeking, Prioritisation, Moving Fast - Neel Nanda (2025-04-27)
Source: https://www.lesswrong.com/s/5GT3yoYM9gRmMEKqL/p/cbBwwm4jW6AZctymL
Author: Neel Nanda
Date: 27th Apr 2025
Fetch-status: excerpted from LessWrong HTML via browser plus cross-checked against local shared draft.
Use: research-process evidence for truth-seeking, prioritization, speed, and action under uncertainty.
## Why this matters for agents
This is the most directly agent-steering post in the sequence. It says the research process needs active skepticism, explicit prioritization, fast feedback loops, and the ability to act under uncertainty without waiting for a perfect next step.
## Quotes
> I think the most important mindsets are:
> * Truth-seeking: By default, many research insights will be false - finding truth is hard. Its not enough to just know this, you must put in active effort to be skeptical and resist bias, lest you risk your research being worthless.
> * Prioritisation: You have finite time, and a lot of possible actions. Your project will live or die according to whether you pick good ones.
> * Moving fast: You have finite time and a lot to do. This doesnt just mean “push yourself to go faster” - theres a lot of ways to eliminate inefficiency without sacrificing quality.
> Insufficient skepticism doesn't feel like insufficient skepticism from the inside. It just feels like doing research.
> This means that you must be putting in constant active effort into ensuring your results are robust. This must be integrated into part of your research process - if youre not, then theres a good chance your results are BS.
> The standard hypothesis testing framework can be misleading here, because it has an implicit frame of being able to list all the hypotheses. But actually, most of your probability mass should normally be on “something I havent thought of yet”.
> Here the Bayesian frame is often helpful. Its generally overkill to put explicit numbers on everything, but it reminds me to ask the question “was this observation more likely under hypothesis A or B”, not just whether it was predicted by my favourite hypothesis.
> Fundamentally, good prioritisation is about having a clear goal (north star) in mind.
> The first step is just making time to stop and ask yourself “do I endorse what Im doing, and could I be doing something better?”
> Prioritising and executing are different mental modes and should not be done simultaneously. Keep them separate, and make time to regularly reflect, and time to lock-in and execute on a plan without stressing about if its the best plan.
> Tight feedback loops are crucial: A key thing to track when doing research is your feedback loops.
> A corollary of this is that you should (often) do fast experiments first. It is far better to do a quick and dirty experiment to get some preliminary signs of life than an extremely long and expensive experiment that will produce conclusive data but only after weeks of work.
> Fail fast. One of the largest time sinks possible is investing weeks to months of effort into a failed research direction. Thus, a key question to ask yourself is: if this direction is doomed, how could I discover this as fast as humanly possible?
> A crucial mindset is being able to do something anyway, despite being so uncertain.
## Source graph
High-value links inside this post:
- Stop pressing the try-harder button: https://www.neelnanda.io/blog/mini-blog-post-6-stop-pressing-the-try-harder-button
- Negative results for SAEs on downstream tasks: https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
- Five-minute timers: https://www.neelnanda.io/blog/post-28-on-creativity-the-joys-of-5-minute-timers
- Weekly review / reflection: https://www.neelnanda.io/blog/39-reflection
- Jacob Steinhardt, Research as a Stochastic Decision Process: https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html
@@ -0,0 +1,51 @@
# My Research Process: Understanding and Cultivating Research Taste - Neel Nanda (2025-05-01)
Source: https://www.lesswrong.com/posts/Ldrss6o3tiKT6NdMm/my-research-process-understanding-and-cultivating-research
Author: Neel Nanda
Date: 1st May 2025
Fetch-status: excerpted from LessWrong HTML via browser plus cross-checked against local shared draft.
Use: core research-taste evidence, especially for deciding whether this should become a separate skill.
## Why this matters for agents
This post gives the boundary: research taste is not just picking ideas. It is judgment under long feedback loops across problem choice, exploration, experiment design, and distillation. It also explains why taste is learnable but slow: the feedback data is sparse.
## Quotes
> What is research taste? As I define it, research taste is far broader than just picking the right problem at the outset. Research is full of key decisions that will affect the future of the project, without an obvious way to find the right answer: from choosing the research problem itself, to identifying which anomalies are and are not worth exploring, distinguishing an experiment that will be compelling from one thatll have inconclusive results, etc.
> I think of taste as the set of intuitions and good judgment that guide a researchers decisions throughout the research process, any time an ambiguous or open-ended decision like this arises.
> The core problem is you just don't get that much data. Generally the shorter a feedback loop is the more data you will get. By definition research taste is about things that are not immediately obvious.
> I think the main way to speed it up is by getting more data, and by being more sample efficient about the data that you have.
> When you have made a research decision and you eventually get feedback, do a post-mortem analyzing what did and did not work and why and what general themes you could look at in future.
> As discussed, I define research taste broadly: it's the collection of intuitions and judgments that guide good decision-making throughout a research project, especially where feedback loops are long, and the search space is large and open-ended.
> Exploration: A tactical sense for which experiments yield the most insight, recognizing interesting anomalies versus noise, knowing when to dig deeper or move on from a thread.
> Understanding: Designing creative, elegant experiments that cleanly distinguish hypotheses, judging the plausibility and explanatory power of different theories, identifying crucial assumptions or potential confounds.
> Communication & Distillation: Identifying the core, communicable claims within messy findings, structuring a compelling and true narrative, anticipating audience confusion, knowing what makes a result impactful to others.
> The ideal is strategic conviction: the ability to adopt a confident mindset to maintain momentum, while regularly zooming out to reflect and maintaining the capacity for zoomed-out skepticism and the willingness to update or abandon course based on evidence.
> Keep a research log. Ask why things worked or failed. Was it luck, execution, or a fundamental judgment call (taste)?
> Papers are a biased dataset (publication bias!), but still useful.
> Research taste isn't magic. It's a complex set of intuitions and frameworks built incrementally through experience, reflection, and learning from others. It governs the crucial, often implicit, decisions that shape a research project's success.
> Because the feedback loops for high-level strategic taste are long and noisy, don't expect to master it quickly. It's perfectly normal, and indeed expected, to rely heavily on external guidance (like mentors or established research directions) early in your career. Focus first on mastering the skills with shorter feedback loops coding, running experiments, analyzing data, clearly communicating simple results.
> By actively engaging in research, deliberately reflecting on your decisions and their outcomes, and strategically leveraging the experiences of others, you can accelerate the development of your own research taste. Be patient with the process, especially the long-game aspects like problem selection. Trust that by doing the work and learning effectively from it, your intuition will improve over time.
## Source graph
High-value links inside this post:
- Chris Olah, research taste / supervised data framing: https://colah.github.io/notes/taste/
- Weekly reviews: https://www.neelnanda.io/blog/39-reflection
- Activation patching paper: https://arxiv.org/abs/2309.16042
- Gears-level model reference: https://www.lesswrong.com/posts/nEBbw2Bc2CnN2RMxy/gears-level-models-are-capital-investments
@@ -0,0 +1,75 @@
# Shared Publicly: My Model of the Research Process - Neel Nanda draft/local copy
Source file: /home/wassname/Downloads/[Shared Publicly] My Model of the Research Process_ Explore, Understand, Distill.md
Author shown in content: Neel Nanda
Date: not stated in local file; contains published posts dated 2025-04-26, 2025-04-27, and 2025-05-01 plus expanded stage-guide material.
Fetch-status: local user-provided/downloaded markdown. Treat as a shared draft/local copy, not identical to the public LessWrong pages.
Use: practical stage guide for agents. This is the most operational source for ideation, exploration, understanding, distillation, failure modes, and mentor role.
## Why this matters for agents
The published posts establish the frame. This local draft contains the useful agent checklist: when to ideate, when to explore, what counts as surface area, how to test hypotheses, how to refine evidence, and when to go back a stage.
## Quotes
> You can't do research without a question or a domain. Ideation is about finding fertile ground. It might be quick, eg deferring to a mentor, or it might involve significant exploration itself, with explorations of many unpromising domains before you settle on one.
> While research taste is important, there are many other crucial skills, and research taste itself comprises several distinct abilities that shouldn't be naively conflated. Rather than focusing solely on research taste, Ive tried to break down the research process into concrete and specific skills.
> Chris Olah has an excellent short post on what research taste is and exercises to learn it. In this spirit, for each of the aspects of the below, I highly recommend predicting a mentors answer before asking.
> Ideation ends when you have a clear enough question or domain that you can start generating concrete experiments to run.
> Leverage Mentors: Especially early on, its fine to let someone else do the work here, i.e. have a mentor recommend a problem.
> This is basically borrowing someone elses research taste, and IMO is one of the most valuable things I do for my mentees.
> Goal: Gain understanding of the problem/domain, start to identify and crystallise interesting hypotheses.
> Your north star is information gained per unit time/effort.
> Crucially, Exploration is not about testing a specific hypothesis. Exploration is about gaining enough of an understanding of a domain that you know what the interesting hypotheses even are.
> Its OK to be confused: Its totally normal to spend a large fraction of this stage feeling pretty confused about whats going on. This is fine and does not mean that youre failing! The key question is whether you feel like you are learning things and becoming less confused.
> Reach for a tool that might show you something interesting, and can be employed fast. Dont hold yourself to the standard of tools that youre confident are good.
> Notice Weirdness: This is critical. Pay close attention to results that are surprising, counter-intuitive, inconsistent, or just feel off. Ask "Why?" relentlessly.
> Research Log: Keep a detailed log (daily or per session). Note down: goals for the session, what you tried, observations (especially weird ones!), links to code/plots (eg to notebooks or git commits or saved plots), brief thoughts/interpretations, ideas for next steps.
> Mentorship Role: Suggesting initial explorations & relevant resources, distinguishing genuinely weird results from known artifacts, providing sanity checks, helping prioritize which weirdness to pursue first.
> Goal: Rigorously testing specific, plausible hypotheses.
> Design High Information Experiments: Design experiments specifically to differentiate between your main hypothesis and the most plausible alternatives. Ask: "What prediction does H1 make that H2 contradicts?" Think like a Bayesian: what evidence is most likely under H1 relative to H2?
> Avoid the mistake of looking for evidence predicted by H1 thats also predicted by a bunch of other things!
> Use appropriate baselines - e.g. its not enough to show that your technique helps to lower a models performance on harmful tasks. Does a random vector do worse?
> Actively Seek Alternatives: Explicitly brainstorm other ways your observations could be explained. What are the simplest explanations? What known circuits or phenomena could be involved? What would a strong skeptic argue?
> Mentorship Role: Aggressively red teaming hypotheses and experimental designs. Suggesting crucial alternative hypotheses or experiments. Helping interpret confusing results. Conveying conceptual frameworks to make sense of findings. Pushing for higher standards of rigor and clarity.
> Goal: Distill all the messy insights from your research into concise, rigorous truth to communicate it to the world.
> Select Strongest Evidence: To start, choose the clearest, most convincing experiments, visualizations, and analyses that directly support your main claims.
> Acknowledge limitations: Inevitably, your results will have some limitations - edge cases, ways your evidence could be wrong, etc. I strongly encourage you to discuss these clearly and prominently in a write-up, even if you dont have good counters to it.
> Your goal is to inform not persuade.
> The truth is what it is, and you should strive to understand it, even if it is inconvenient.
## Source graph
Links visible in this local draft worth follow-up:
- Chris Olah, research taste: https://colah.github.io/notes/taste/
- Jacob Steinhardt, Research as a Stochastic Decision Process: https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html
- Nanda paper reading list: https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
- Nanda Othello research process: https://www.alignmentforum.org/s/nhGNHyJHbrofpPbRG/p/TAz44Lb9n9yf52pv8
- Nanda five-minute timers: https://www.neelnanda.io/blog/post-28-on-creativity-the-joys-of-5-minute-timers
- Nanda weekly reflection: https://www.neelnanda.io/blog/39-reflection
- Negative results for SAEs: https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
- Research Debt: referenced by name in local draft; URL not included in visible excerpt.
@@ -0,0 +1,42 @@
# Research Taste Exercises - Chris Olah (2021-01-09)
Source: https://colah.github.io/notes/taste/
Author: Chris Olah
Date: Posted Jan 9, 2021
Fetch-status: excerpted from HTML via browser.
Use: direct source for research-taste training exercises; cited by Nanda's shared draft and public taste post.
## Why this matters for agents
Olah gives concrete exercises for getting more feedback on taste without spending months executing every idea. This is a good reference for an agent asked to help a researcher improve project selection or calibrate idea quality.
## Quotes
> One of the most important aspects of growing as a researcher is developing research taste -- roughly, the ability to chose good problems to work on.
> I think the fundamental issue is that actually testing whether a research idea you come up with is good is very expensive. Often it takes months, so you only really get a few pieces of feedback on your taste every year.
> Many of the following exercises are really strategies for getting (proxy) feedback on more research ideas faster.
> Write down a list of research ideas. Have a mentor you respect rate each idea 1-10. Discuss ideas where you disagree with them after reflection.
> Pay attention when other people try ideas youve had. How did the results compare with your expectations?
> Interview researchers around you on their taste. Why do they work on the problems they do? How do they pick problems? Whats their “big picture” of research?
> Critically consider your research taste, and the community taste around you. Your taste is likely very influenced by your research cluster (your collaborators, advisor, etc).
> Failure Mode 1: Getting overly attached to one research direction / falling into sunk costs.
> Failure mode 2: Lack of research knowledge / intimacy.
> Theoretical knowledge is table stakes for research taste. You cant have research taste in a vacuum.
> Failure mode 3: Environment not aligned with your interests.
## Source graph
Links visible in the post worth follow-up:
- Hamming, You and Your Research: linked via YouTube.
- Michael Nielsen, Principles of Effective Research: https://michaelnielsen.org/blog/principles-of-effective-research/
- Andy Matuschak taste-related thread: linked as Twitter, may need archival route.
@@ -0,0 +1,35 @@
# ML Engineering for AI Safety & Robustness - Catherine Olsson and 80,000 Hours (2018-11)
Source: https://80000hours.org/articles/ml-engineering-career-transition-guide/
Authors: Catherine Olsson and the 80,000 Hours team
Date: Published November 2018; update note visible Feb 2022
Fetch-status: excerpted from HTML via browser.
Use: source-graph evidence from Spinning Up's "Other Resources" section; useful for research-engineer skill acquisition, less central to research taste.
## Why this matters for agents
This source is more about becoming useful on ML research teams than choosing research ideas. Its most relevant claim is that implementing and debugging foundational algorithms is a high-value learning path, with easy environments, metrics, and reference-code scrutiny.
## Quotes
> Technical AI safety is a multifaceted area of research, with many sub-questions in areas such as reward learning, robustness, and interpretability.
> Not all of these questions are best tackled with abstract mathematics research; some can be approached with concrete coding experiments and machine learning (ML) prototypes.
> Once you know the 101-level basics of ML, the next thing to learn is how to implement and debug ML algorithms.
> Breadth of experience is not important here: you dont need to read all the latest papers, or master an extensive reading list. You also dont need to do novel research or come up with new algorithms.
> What you do need is to get your hands dirty implementing and debugging ML algorithms, and to build evidence for job interviews that you have some experience doing this.
> The most straightforward way to gain this experience is to choose a subfield of ML relevant to a lab youre interested in. Then read a few dozen of the subfields key papers, and reimplement a few of the foundational algorithms that the papers are based on or reference most frequently.
> For each algorithm, they would first test on very easy environments, and then move to more difficult environments.
> Once the algorithm was partially working, they would attain higher performance by looking for remaining bugs, both by reviewing the code carefully, and by collecting metrics such as average policy entropy to perform sanity-checks, rather than just tune hyperparameters.
> Most importantly, he was able to implement and debug ML algorithms, going from math in a paper to running code.
## Source graph
This page was linked from Spinning Up's "Other Resources" section. It points to Josh Achiam's Key Papers in Deep RL list and a Daniel Ziegler self-study path. It is useful background for training agents to value implementation and debugging practice, but probably secondary for a dedicated research-taste skill.
@@ -0,0 +1,69 @@
# Spinning Up as a Deep RL Researcher - source graph and research-taste excerpts
Primary source: https://spinningup.openai.com/en/latest/spinningup/spinningup.html
Author: Joshua Achiam, OpenAI
Date: October 13th, 2018
Related local cache: docs/evidence/spinningup_researcher.md
Fetch-status: excerpted from Spinning Up HTML via browser; source graph cross-checked against existing local evidence files where present.
Use: RL research-process evidence, especially for source graph, fair comparisons, seeds, preregistration, and ablations.
## Why this matters for agents
Spinning Up is not just an RL textbook page. Its researcher page is a compact research apprenticeship guide. It points to the same battle-tested debugging and reproducibility references already cached in this repo, then adds project selection and rigorous comparison advice.
## Quotes
> If youre an aspiring deep RL researcher, youve probably heard all kinds of things about deep RL by this point. You know that its hard and it doesnt always work. That even when youre following a recipe, reproducibility is a challenge. And that if youre starting from scratch, the learning curve is incredibly steep.
> In particular, this will outline a useful curriculum for increasing raw knowledge, while interleaving it with the odds and ends that lead to better research.
> Write your own implementations. You should implement as many of the core deep RL algorithms from scratch as you can, with the aim of writing the shortest correct implementation of each.
> Simplicity is critical. You should organize your efforts so that you implement the simplest algorithms first, and only gradually introduce complexity.
> Dont overfit to existing implementations either. Study existing implementations for inspiration, but be careful not to overfit to the engineering details of those implementations.
> Iterate fast in simple environments. To debug your implementations, try them with simple environments where learning should happen quickly.
> Your ideal experiment turnaround-time at the debug stage is <5 minutes (on your local machine) or slightly longer but not much.
> Start by exploring the literature to become aware of topics in the field.
> Use the related work section and citations to find closely-related papers and do a deep dive in the literature. Youll start to figure out where the unsolved problems are and where you can make an impact.
> There are a many different ways to start thinking about ideas for projects, and the frame you choose influences how the project might evolve and what risks it will face.
> Avoid reinventing the wheel. When you come up with a good idea that you want to start testing, thats great! But while youre still in the early stages with it, do the most thorough check you can to make sure it hasnt already been done.
> Under no circumstances handicap the baseline!
> Beware of random seeds making things look stronger or weaker than they really are, so run everything for many random seeds (at least 3, but if you want to be thorough, do 10 or more).
> This is to enforce a weak form of preregistration: you use the tuning stage to come up with your hypotheses, and you use the final runs to come up with your conclusions.
> Check each claim separately. Another critical aspect of doing research is to run an ablation analysis.
## Source graph
Spinning Up intro references, with local status:
- Alex Irpan, Deep Reinforcement Learning Doesn't Work Yet: https://www.alexirpan.com/2018/02/14/rl-hard.html. Local cache: docs/evidence/alexirpan_rl_hard.md.
- Islam et al., Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control: https://arxiv.org/abs/1708.04133. Not separately cached; discussed/cited inside Henderson local cache.
- Henderson et al., Deep Reinforcement Learning that Matters: https://arxiv.org/abs/1709.06560. Local cache: docs/evidence/henderson_2018_deep_rl_matters.md.
- Matthew Rahtz, Lessons Learned Reproducing a Deep RL Paper: http://amid.fish/reproducing-deep-rl. Local cache: docs/evidence/amid_fish_reproducing_deep_rl.md.
- David Silver UCL RL course: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html. Not cached.
- Berkeley Deep RL course: http://rll.berkeley.edu/deeprlcourse/. Not cached.
- Deep RL Bootcamp lectures: https://sites.google.com/view/deep-rl-bootcamp/lectures. Reddit index cache: docs/evidence/reddit_deeprl_bootcamp_2017_75m5vd.md.
- John Schulman, Nuts and Bolts of Deep RL: http://joschu.net/docs/nuts-and-bolts.pdf. Local cache: docs/evidence/joschu_nuts_and_bolts.md.
- Tim Rocktaschel et al., Advice for Short-term Machine Learning Research Projects: https://rockt.github.io/2018/08/29/msc-advice.html. Not cached.
- Catherine Olsson / 80,000 Hours, ML Engineering for AI Safety & Robustness: https://80000hours.org/articles/ml-engineering-career-transition-guide/. Not cached.
## Likely follow-up cache candidates
Priority 1:
- Chris Olah, research taste: short and directly named by Nanda.
- Jacob Steinhardt, Research as a Stochastic Decision Process: directly named by Nanda for prioritization.
- Tim Rocktaschel et al., short-term ML research projects: directly named by Spinning Up for research growth.
Priority 2:
- David Silver/UCL, Berkeley Deep RL, Deep RL Bootcamp: curriculum material, less directly research-taste except via RL mastery.
- Catherine Olsson/80k: career/field-entry framing; useful if the skill expands beyond project-level research taste.
@@ -0,0 +1,43 @@
# Research as a Stochastic Decision Process - Jacob Steinhardt
Source: https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html
Author: Jacob Steinhardt
Date: not visible in fetched HTML
Fetch-status: excerpted from HTML via browser.
Use: research-prioritization evidence; cited by Nanda's Key Mindsets post.
## Why this matters for agents
Steinhardt gives a crisp formal-ish rule for research prioritization: reduce uncertainty as fast as possible. This is useful for agents deciding which experiment, baseline, prototype, or sanity check to run first.
## Quotes
> Below I analyze how to approach a project that has many somewhat independent sources of uncertainty (we can often think of these as multiple "steps" or "parts" that each have some probability of success).
> We will eventually see that a good principle is to "reduce uncertainty at the fastest possible rate".
> This reveals that harder tasks should not necessarily be prioritized. Rather, we should prioritize tasks that are more likely to fail (so that we remove the risk of them failing) but also tasks that take less time.
> Do the components in order from most informative per unit time to least informative per unit time.
> De-risk all components (to the extent feasible), then execute.
> Specifically, for each task we want a cheap way to obtain high confidence about whether that task will be feasible. This is called "de-risking".
> We are often either in "de-risking mode" (determining if the problem is infeasible as quickly as possible) or "execution mode" (assuming the problem is feasible and trying to solve it quickly).
> The counterpart to ceilings are baselines--simple or off-the-shelf methods that give a quick lower bound on achievable accuracy.
> Together with ceilings, they delineate a range of possible performance, which helps us interpret our core results.
> I often think about possible approaches to a problem as an exponentially branching search tree.
> Whenever something doesn't work, I ask why it didn't work. My goal is to avoid trying similar things that will fail for the same reason.
> Compared to other people I know, I try harder and earlier to show that my ideas can't work to solve a problem.
> We often try easier tasks first, when instead we should try the most informative tasks first.
## Source graph
This is a standalone blog post. It links to concepts like Poisson arrival processes, but the skill-relevant content is the prioritization/de-risking frame above.
+223
View File
@@ -0,0 +1,223 @@
# Research taste and research-process folklore
Appendix to the [ML Debugging skill](../SKILL.md).
Use this when the question is closer to "what should we try next?" than "why did this crash?" The quotes do most of the work here. The editorial is just routing.
## Patience and process
Research taste is learned under long, noisy feedback loops. This is the quote I would put nearest the main skill.
> Research taste isn't magic. It's a complex set of intuitions and frameworks built incrementally through experience, reflection, and learning from others. It governs the crucial, often implicit, decisions that shape a research project's success. Because the feedback loops for high-level strategic taste are long and noisy, don't expect to master it quickly. It's perfectly normal, and indeed expected, to rely heavily on external guidance (like mentors or established research directions) early in your career. Focus first on mastering the skills with shorter feedback loops coding, running experiments, analyzing data, clearly communicating simple results. By actively engaging in research, deliberately reflecting on your decisions and their outcomes, and strategically leveraging the experiences of others, you can accelerate the development of your own research taste. Be patient with the process, especially the long-game aspects like problem selection. Trust that by doing the work and learning effectively from it, your intuition will improve over time.[^nanda-taste]
Olah gives the matching training-data frame:
> One of the most important aspects of growing as a researcher is developing research taste -- roughly, the ability to chose good problems to work on. I think the fundamental issue is that actually testing whether a research idea you come up with is good is very expensive. Often it takes months, so you only really get a few pieces of feedback on your taste every year. Many of the following exercises are really strategies for getting (proxy) feedback on more research ideas faster.[^olah-taste]
## What taste covers
The useful move is not "research taste = picking good projects". Nanda uses it for the hard judgment calls throughout a project.
> What is research taste? As I define it, research taste is far broader than just picking the right problem at the outset. Research is full of key decisions that will affect the future of the project, without an obvious way to find the right answer: from choosing the research problem itself, to identifying which anomalies are and are not worth exploring, distinguishing an experiment that will be compelling from one thatll have inconclusive results, etc. I think of taste as the set of intuitions and good judgment that guide a researchers decisions throughout the research process, any time an ambiguous or open-ended decision like this arises. This can just be gut feeling, but also having conceptual frameworks you reason through, having novel ideas spark in your mind, etc.[^nanda-taste]
And the stage model:
> I see research as breaking down into a few stages:
> 1. Ideation - Choose a problem/domain to focus on
> 2. Exploration - Gain Surface area
> 1. North star: Gain information
> 3. Understanding - Test Hypotheses
> 1. North star: Convince yourself of a key hypothesis
> 4. Distillation - Compress, Refine, Communicate
> 1. North star: Compress your research findings into concise, rigorous truth that you can communicate to the world[^nanda-explore]
## Key mindsets
This is agent-steering material: truth-seeking, prioritization, moving fast, and acting under uncertainty.
> I think the most important mindsets are:
> * Truth-seeking: By default, many research insights will be false - finding truth is hard. Its not enough to just know this, you must put in active effort to be skeptical and resist bias, lest you risk your research being worthless.
> * Prioritisation: You have finite time, and a lot of possible actions. Your project will live or die according to whether you pick good ones.
> * Moving fast: You have finite time and a lot to do. This doesnt just mean “push yourself to go faster” - theres a lot of ways to eliminate inefficiency without sacrificing quality.[^nanda-key]
> This means that you must be putting in constant active effort into ensuring your results are robust. This must be integrated into part of your research process - if youre not, then theres a good chance your results are BS. The standard hypothesis testing framework can be misleading here, because it has an implicit frame of being able to list all the hypotheses. But actually, most of your probability mass should normally be on “something I havent thought of yet”. Here the Bayesian frame is often helpful. Its generally overkill to put explicit numbers on everything, but it reminds me to ask the question “was this observation more likely under hypothesis A or B”, not just whether it was predicted by my favourite hypothesis.[^nanda-key]
## Prioritisation and speed
Nanda's prioritisation advice is close to the GSD/UAT habit: write the goal, check whether the work is buying that goal, and separate choosing from executing.
> Ultimately, time is scarce. The space of possible actions you can take when doing research is wide and open ended, and some are far more valuable than others. The difference between a failed and a great research project is often prioritisation skill. Improved prioritisation is one of the key sources of value I add as a mentor Fundamentally, good prioritisation is about having a clear goal (north star) in mind. You need good judgement about how well different actions achieve this goal. You need to actually make the time to think about how well actions achieve this goal![^nanda-draft]
> Being great at prioritisation is pretty difficult, and requires good research taste, which will take a lot of time to develop. But theres often basic mistakes and low-hanging fruit to improve, if you just try. The first step is just making time to stop and ask yourself “do I endorse what Im doing, and could I be doing something better?” This advice may seem obvious, but is deceptively hard to put into practice! You need regular prompts Often its very easy to think of a better idea, but by default nothing prompts you to think. I like to explicitly write goals down and regularly check in that theyre being achieved - it sounds obvious, but you would be shocked at how effective it is to ask people if theyre doing the best thing for the project goals.[^nanda-draft]
> I recommend actually writing a plan, and estimate how long each step will take, at least for the current research stage youre in. You dont need to take it very seriously, and youll totally deviate a ton. But it forces you to think through the project, notice uncertainties you could ask someone about, question if parts are really necessary to achieve your goals.[^nanda-draft]
> Prioritising and executing are different mental modes and should not be done simultaneously. Keep them separate, and make time to regularly reflect, and time to lock-in and execute on a plan without stressing about if its the best plan Concrete advice: Work to a schedule where you regularly (ideally at least once a day, and with extended reflection at least once a week), zoom out and check that what youre doing is your highest priority. E.g. work in pomodoros Having a weekly review can be incredibly useful - where you zoom out and check in on whats going on, any current issues, etc.[^nanda-draft]
> Tight feedback loops are crucial: A key thing to track when doing research is your feedback loops. Definition: A feedback loop is the process from having an experiment idea and to results. Tight feedback loops are when the time taken is short. It will make an enormous difference to your research velocity if you can get your feedback loops as tight as possible, and this is a big priority.[^nanda-draft]
> A corollary of this is that you should (often) do fast experiments first. It is far better to do a quick and dirty experiment to get some preliminary signs of life than an extremely long and expensive experiment that will produce conclusive data but only after weeks of work. Realistically you should be prioritising by information gain per unit time. This is especially important in exploration where it's hard to have a clear sense of which experiments are the most useful while estimating their tractability is pretty easy.[^nanda-draft]
> Fail fast. One of the largest time sinks possible is investing weeks to months of effort into a failed research direction. Thus, a key question to ask yourself is: if this direction is doomed, how could I discover this as fast as humanly possible? I often try to think through what kind of confident predictions a hypothesis I care about makes in the understanding stage, or what fundamental assumptions make me think my domain is interesting at all in the exploration stage, and then think of the quickest and dirtiest experiments I can to test these. It's often much better to have several quick and dirty experiments to attack different angles where you could fail fast than to put a lot of effort into one.[^nanda-draft]
Irpan's "signs of life" is the positive read on the same cheap experiment - the early signal that tells you the direction is worth more time:
> Not all hyperparameters perform well, but with all the empirical tricks discovered over the years, many hyperparams will show signs of life during training. These signs of life are super important, because they tell you that youre on the right track, youre doing something reasonable, and its worth investing more time.[^irpan]
> Ultimately, you just need to accept on an emotional level that you dont get to know the “right” answer for what to do next - in practice, theres no such thing as the right answer. The ideal is to strive to carefully evaluate the extremely noisy evidence, make a best guess for what to do next, and act on it, while also being self-aware enough to notice if it no longer seems the best action. This is a hard balance to achieve, but super useful if you can do it. Especially when youre starting out, this can be very low stakes: the value of anything you do is dominated by the learning value![^nanda-draft]
## Ideation
This is the most mentor-dependent stage. The quote is useful because it gives permission to borrow taste without pretending that borrowed taste is yours.
> You can't do research without a question or a domain. Ideation is about finding fertile ground. It might be quick, eg deferring to a mentor, or it might involve significant exploration itself, with explorations of many unpromising domains before you settle on one. Find a Domain: You need something concrete to study. This could be a specific model (Pythia 2.8B), a specific phenomenon (grokking, factual recall), a specific capability (how models do addition), or a specific technique (improving SAEs). Ideation ends when you have a clear enough question or domain that you can start generating concrete experiments to run[^nanda-draft]
> Make or break: Ideation is very important - if you choose a problem thats not an interesting question or doomed then it doesnt matter what else you do, the project is sunk. One of the most common reasons I dont read an interpretability paper is that I think its answering the wrong question High-level research taste: One facet of the general notion of research taste is noticing which problems are promising and interesting.[^nanda-draft]
> Leverage Mentors: Especially early on, its fine to let someone else do the work here, i.e. have a mentor recommend a problem. If you dont have a mentor, try a natural extension of an existing paper you like, or pick a problem from a vetted open problems list, This is basically borrowing someone elses research taste, and IMO is one of the most valuable things I do for my mentees.[^nanda-draft]
## Exploration
Exploration should feel different from proof. It is for gaining surface area.
> Goal: Gain understanding of the problem/domain, start to identify and crystallise interesting hypotheses. Your north star is information gained per unit time/effort. Crucially, Exploration is not about testing a specific hypothesis. Exploration is about gaining enough of an understanding of a domain that you know what the interesting hypotheses even are.[^nanda-draft]
> Its OK to be confused: Its totally normal to spend a large fraction of this stage feeling pretty confused about whats going on. This is fine and does not mean that youre failing! The key question is whether you feel like you are learning things and becoming less confused. Reach for a tool that might show you something interesting, and can be employed fast. Dont hold yourself to the standard of tools that youre confident are good. Notice Weirdness: This is critical. Pay close attention to results that are surprising, counter-intuitive, inconsistent, or just feel off. Ask "Why?" relentlessly.[^nanda-draft]
This matches the older debugging folklore about confusion and anomalies:
> It was only by following that confusion and realising that taking the difference between frames zeroed out the background that gave the hint of a problem with normalization. Im not entirely sure how to make ones mind do more of this, but my best guesses at the moment are:
> * Learn to recognise what confusion feels like.[^rahtz]
More exploration mechanics:
> Gaining surface area: A key concept here is surface area: knowledge and intuition about the domain/problem. Most of the way I prioritise is by asking myself what decisions would maximise my surface area on a problem/domain. I want to put myself in a position where I can notice cool patterns and phenomena and spark hypotheses about whats going on. This is a different mindset from what gains me rigorous evidence. Qualitative experiments, cherry-picked case studies, low sample size quick and dirty experiments, etc can all be high value for gaining surface area. While often the best way to test a specific hypothesis is with a narrow quantitative test with a large sample size, which teaches me little if I was asking the wrong questions.[^nanda-draft]
> Productive flailing: Use simple mech interp techniques wherever they seem applicable and look for patterns - you dont need to have a plan in mind, just try lots of stuff quickly and see what sticks. Get your hands dirty with the model and data, so you build a mental bank of interesting phenomena, so you can notice connections Reach for a tool that might show you something interesting, and can be employed fast. Dont hold yourself to the standard of tools that youre confident are good. Notice Weirdness: This is critical. Pay close attention to results that are surprising, counter-intuitive, inconsistent, or just feel off. Ask "Why?" relentlessly. These anomalies often point towards deeper insights.[^nanda-draft]
> Micro-Hypotheses: Generate small, speculative hypotheses ("Maybe head L5H6 is detecting syntax?") and devise quick ways to test them. Don't get attached; the goal is quick learning, not proof. The process of investigating this will often teach you something interesting. The important thing is to generate ideas at all, not to find the perfect ones. If you can test them fast, then its much better to come up with 10 ideas of which 1 is true, rather than 1 idea with a 50% chance of being true. The Understanding phase is where we start being more discriminating.[^nanda-draft]
> Research Log: Keep a detailed log (daily or per session). Note down: goals for the session, what you tried, observations (especially weird ones!), links to code/plots (eg to notebooks or git commits or saved plots), brief thoughts/interpretations, ideas for next steps. This fights confusion and helps track progress. Highlights Doc: Separately, keep a running document of your most interesting findings, key graphs, and solidified insights. This helps distill progress and is useful for sharing/communicating. A decent metric of progress is “did I add anything to my highlights doc recently”[^nanda-draft]
> Create Fast Feedback Loops! This is a major benefit of mech interp - in some fields you cant get any data for weeks or months, in mech interp it can be seconds or minutes. Optimize for quick iterations. If you have slow feedback loops fixing this is high priority. Use the smallest model that can do your task. Favour cheap, partially-trusted metrics.[^nanda-draft]
> Analysis Paralysis: Getting stuck trying to understand everything perfectly before running code. Solution: Bias towards action, then reflect. Keep experiments simple. It can help to set a rule for yourself like, if Ive spent more than 4 hours without running any code, I should just do a quick experiment.[^nanda-draft]
> When to go back to problem selection? Sometimes this just isnt very promising and you should go back to choosing a problem. When to do this is a complex question, but a good heuristic is when things seem to be messy and youve tried a bunch of things to gain surface area but not found interesting structure or hypotheses to investigate further When to move on to understanding? Once you have enough understanding of the problem to have identified one/a few hypotheses that seem plausible and interesting, you can move on to understanding them in more detail. Note that, often, most of the work of the research project is identifying what the correct hypotheses are! This typically isnt written up in papers, which is a shame, and gives quite a mistaken impression IMO[^nanda-draft]
## Think more, experiment less
This is from Rahtz and belongs in the main skill too.
> Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. When debugging with long iteration times, you really need to pour time into the hypothesis-forming step - thinking about what all the possibilities are, how likely they seem on their own, and how likely they seem in light of everything you've seen so far. Spend as much time as you need, even if it takes 30 minutes, or an hour. Reserve experiments for once you've fleshed out the hypothesis space as thoroughly as possible and know which pieces of evidence would allow you to best distinguish between the different possibilities. It's especially important to be deliberate about this if you're working on something as a side project.[^rahtz]
## Understanding
Understanding is where hypotheses become objects to test.
> Design High Information Experiments: Design experiments specifically to differentiate between your main hypothesis and the most plausible alternatives. Ask: "What prediction does H1 make that H2 contradicts?" Think like a Bayesian: what evidence is most likely under H1 relative to H2? Avoid the mistake of looking for evidence predicted by H1 thats also predicted by a bunch of other things! Use appropriate baselines - e.g. its not enough to show that your technique helps to lower a models performance on harmful tasks. Does a random vector do worse?[^nanda-draft]
> Actively Seek Alternatives: Explicitly brainstorm other ways your observations could be explained. What are the simplest explanations? What known circuits or phenomena could be involved? What would a strong skeptic argue? Mentorship Role: Aggressively red teaming hypotheses and experimental designs. Suggesting crucial alternative hypotheses or experiments. Helping interpret confusing results. Conveying conceptual frameworks to make sense of findings. Pushing for higher standards of rigor and clarity.[^nanda-draft]
Steinhardt's de-risking frame is the same habit in a different language:
> This reveals that harder tasks should not necessarily be prioritized. Rather, we should prioritize tasks that are more likely to fail (so that we remove the risk of them failing) but also tasks that take less time. Do the components in order from most informative per unit time to least informative per unit time. De-risk all components (to the extent feasible), then execute.[^steinhardt]
More understanding mechanics:
> Execute Carefully & Rigorously: Now is the time for more careful experiments. Consider controls, potential confounds, statistical significance (if applicable), and robustness checks. Increase sample sizes from Exploration (though even N=5 case studies can be much better than N=1). Document methods clearly. Try harder to avoid cherry-picking here - sample random data points rather than just picking the most convenient ones Use appropriate baselines - e.g. its not enough to show that your technique helps to lower a models performance on harmful tasks. Does a random vector do worse?[^nanda-draft]
> Types of evidence: I think of experiments as falling into four categories, its worth tracking which one: Strong evidence: This will give a strong update for or against the hypothesis (the best kind!) Big if true: Experiments that probably fail, but are a big deal for our hypothesis if they work.[^nanda-draft]
> Sanity checks: Experiments that probably work but are a big deal against our hypothesis if they fail Weak evidence: This will give a weak update for or against the hypothesis (or maybe just be inconclusive) Poor Baselines/Controls: Comparing results against a weak or irrelevant null hypothesis, or failing to isolate the variable of interest.[^nanda-draft]
> Insufficient Skepticism: Missing simple alternative explanations, methodological flaws, or bugs. Explicitly list alternatives. Get others (especially mentors) to red team your plans before you run them. Actively try to break your hypothesis. Ask "What observation would make me abandon this?"[^nanda-draft]
> Be Able to Discard False Hypotheses: Sometimes youll have a hypothesis that youre really excited about, and it turns out to be false. This is OK! This is all just part of science. Move on and try new hypotheses, or write up your negative results if theyre interesting enough! Be exploratory: You should still be partially in explore mode in this stage - often your conception of the hypothesis, or the right kinds of experiment, will shift. This is an important part of the research process, not a sign that you screwed anything up! When to move on to distillation? When you are fairly convinced of some hypotheses, and think theyre interesting enough to be worth communicating.[^nanda-draft]
## Rigorous comparisons
Spinning Up is RL-framed but generally useful for research agents doing method comparisons.
> Set up fair comparisons. If you implement your baseline from scratch [...] it's important to spend as much time tuning your baseline as you spend tuning your own algorithm. This will make sure that comparisons are fair. Also, do your best to hold "all else equal" [...]. Under no circumstances handicap the baseline! Remove stochasticity as a confounder. Beware of random seeds making things look stronger or weaker than they really are, so run everything for many random seeds (at least 3, but if you want to be thorough, do 10 or more). Run high-integrity experiments. Don't just take the results from the best or most interesting runs to use in your paper.[^spinningup]
Schulman and Henderson are the harder-edged RL versions:
> Always Be Ablating
> - Different tricks may substitute
> - Especially whitening
> - "Regularize" to favor simplicity in algorithm design space
> - As usual, simplicity → generalization[^schulman]
Irpan gives the reason seeds matter at all: variance from pure randomness lower-bounds how much a real code difference could swing your result. This is the observation that motivates the Henderson study below (which Irpan cites).
> Instability to random seed is like a canary in a coal mine. If pure randomness is enough to lead to this much variance between runs, imagine how much an actual difference in the code could make.[^irpan]
> Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible.[^henderson]
## Distillation and paper writing
This belongs in the appendix more than the main skill, but it is the right source for "when do I write this up?"
> The essence of an ideal paper is the narrative: a short, rigorous and evidence-based technical story you tell, with a takeaway the readers care about. The first step is to compress your research into these claims. Experimental Evidence: This is absolutely crucial to get right and aggressively red-team, its how you resist the temptation of elegant but false narratives.[^nanda-paper]
> At its core, a paper should present a narrative of one to three specific concrete claims that you believe to be true, that build to some useful takeaway(s). Readers will rarely take away more than a few sentences of content. Choose those sentences carefully. Generally, stronger statements make for more interesting papers, but require higher standards of evidence - resist the temptation to overclaim for clicks![^nanda-paper]
> The Guiding Question for Evidence: Ultimately, the question to ask about your evidence is: "Should this update a reader's beliefs about my claims?" Reproducibility & Publishing code: Rigour can be in the eye of the beholder: if readers cannot understand or verify it for themselves, its far harder to consider it rigorous. A key challenge in paper writing is the illusion of transparency - you have spent months steeped in the context of this research project.[^nanda-paper]
From the shared draft:
> Goal: Distill all the messy insights from your research into concise, rigorous truth to communicate it to the world. Compress what youve learned into some key claims, something you can convey via a short series of bullet points Refine the evidence that convinced you into clear, rigorous, legible experiments that provide strong evidence for the key claims[^nanda-draft]
> Compress the Core Narrative: What are the most important takeaways? What's the simplest, truest story that explains your key findings and answers your initial research question? What have you learned? A useful framing: “how would you explain your research to a friend?” or “how would you compress your findings into 150 words or less?” or “how would you give a lightning talk on this?”. You want something thats a short series of bullet points. It often helps to discuss your research with a range of people at this point - what are they interested in? What confuses them? What points do you keep emphasising and coming back to?[^nanda-draft]
> Refine your evidence North star: How can I build an evidence base that makes my key claims obviously correct? Research is messy, so “obviously correct” is a high bar, but useful to aspire to IMO Select Strongest Evidence: To start, choose the clearest, most convincing experiments, visualizations, and analyses that directly support your main claims. Ask: "What evidence best distinguishes my claims from alternatives? What would convince a knowledgeable skeptic?"[^nanda-draft]
> Red team your existing evidence: Then, red team this strongest evidence - if you were wrong, whats the flaw in your case? What objections would an intelligent external researcher raise? If you presented this to a specific mentor what feedback do you think theyd give? This is typically a mix of conceptual flaws, e.g. there are multiple hypotheses equally consistent with the data, and methodological laziness - poor baselines, low sample size, poor randomisation/cherry-picking, etc Check Robustness: How general are the findings? Do they hold across different models/datasets/prompts (where applicable and feasible)? Sanity-check against known results.[^nanda-draft]
> Acknowledge limitations: Inevitably, your results will have some limitations - edge cases, ways your evidence could be wrong, etc. I strongly encourage you to discuss these clearly and prominently in a write-up, even if you dont have good counters to it. This is a key part of doing good science. Pragmatically, when I read a paper, Ill generally notice at least some limitations anyway, and judge a paper if it ignores them and respect one that discusses them clearly even if it weakens the narrative - so if youre optimising for experienced researchers liking your work, acknowledging limitations is generally in your interests Your goal is to inform not persuade[^nanda-draft]
> When to go back to Understanding? If you discover that your narrative no longer seems true/well supported, you should go back to Understanding This is fine: It's totally natural that in the course of trying to refine your evidence and case, you discover you were wrong about something. Sometimes results from a few cherry-picked prompts don't generalize. This is the point of refining. Switch mode: If you discover that you no longer think your list of key claims is true, then you should return to understanding or possibly even exploration.[^nanda-draft]
## Agent habit
Minimal loop:
1. Name the stage.
2. Quote the north star for that stage.
3. Pick the action with the best information per unit time.
4. Say what would change your mind.
5. Preserve proof in a log, plot, table, commit, or source quote.
## See also / source graph
Most relevant sources cached for this reference:
- Neel Nanda, research-process sequence: [explore/understand/distill](../docs/evidence/nanda_research_process_explore_understand_distill.md), [key mindsets](../docs/evidence/nanda_research_process_key_mindsets.md), [research taste](../docs/evidence/nanda_research_process_research_taste.md), [shared draft](../docs/evidence/nanda_research_process_shared_draft.md), [paper writing](../docs/evidence/nanda_highly_opinionated_ml_paper_writing.md).
- Chris Olah, [Research Taste Exercises](../docs/evidence/olah_research_taste_exercises.md): proxy feedback, mentor ratings, research intimacy.
- Jacob Steinhardt, [Research as a Stochastic Decision Process](../docs/evidence/steinhardt_research_stochastic_decision_process.md): information rate, de-risking, ceilings, baselines.
- Joshua Achiam / OpenAI Spinning Up, [research source graph](../docs/evidence/spinningup_research_source_graph.md) and [original cache](../docs/evidence/spinningup_researcher.md): RL apprenticeship, fair comparisons, seeds, preregistration, ablations.
- Matthew Rahtz, [Lessons Learned Reproducing a Deep RL Paper](../docs/evidence/amid_fish_reproducing_deep_rl.md): confusion, long iteration times, think more before expensive runs.
- Henderson et al., [Deep Reinforcement Learning that Matters](../docs/evidence/henderson_2018_deep_rl_matters.md): seed variance, implementation differences, reproducibility reporting.
- John Schulman, [Nuts and Bolts of Deep RL Research](../docs/evidence/joschu_nuts_and_bolts.md): small test problems, health indicators, multiple seeds, ablations.
- Alex Irpan, [Deep Reinforcement Learning Doesn't Work Yet](../docs/evidence/alexirpan_rl_hard.md): realistic expectations, sample inefficiency, seed variance.
Less central but useful:
- Catherine Olsson / 80,000 Hours, [ML Engineering for AI Safety & Robustness](../docs/evidence/olsson_80000hours_ml_engineering_ai_safety.md): implementation/debugging as research-engineer apprenticeship.
- Tim Rocktaschel et al., Advice for Short-term Machine Learning Research Projects: linked by Spinning Up but not cached yet.
- Islam et al., Reproducibility of Benchmarked Deep RL Tasks: linked by Spinning Up; not separately cached, but discussed in Henderson.
- David Silver UCL RL course, Berkeley Deep RL course, and Deep RL Bootcamp: curriculum links from Spinning Up; useful for background, less directly research-taste.
[^nanda-explore]: Neel Nanda, "How I Think About My Research Process: Explore, Understand, Distill" (2025-04-26) - https://www.lesswrong.com/posts/hjMy4ZxS5ogA9cTYK/how-i-think-about-my-research-process-explore-understand ([cache](../docs/evidence/nanda_research_process_explore_understand_distill.md)).
[^nanda-key]: Neel Nanda, "My Research Process: Key Mindsets - Truth-Seeking, Prioritisation, Moving Fast" (2025-04-27) - https://www.lesswrong.com/s/5GT3yoYM9gRmMEKqL/p/cbBwwm4jW6AZctymL ([cache](../docs/evidence/nanda_research_process_key_mindsets.md)).
[^nanda-taste]: Neel Nanda, "My Research Process: Understanding and Cultivating Research Taste" (2025-05-01) - https://www.lesswrong.com/posts/Ldrss6o3tiKT6NdMm/my-research-process-understanding-and-cultivating-research ([cache](../docs/evidence/nanda_research_process_research_taste.md)).
[^nanda-draft]: Neel Nanda, shared/local draft, "My Model of the Research Process" - source file `/home/wassname/Downloads/[Shared Publicly] My Model of the Research Process_ Explore, Understand, Distill.md` ([cache](../docs/evidence/nanda_research_process_shared_draft.md)).
[^nanda-paper]: Neel Nanda, "Highly Opinionated Advice on How to Write ML Papers" (2025-05-12) - https://www.lesswrong.com/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers ([cache](../docs/evidence/nanda_highly_opinionated_ml_paper_writing.md)).
[^olah-taste]: Chris Olah, "Research Taste Exercises" (2021-01-09) - https://colah.github.io/notes/taste/ ([cache](../docs/evidence/olah_research_taste_exercises.md)).
[^steinhardt]: Jacob Steinhardt, "Research as a Stochastic Decision Process" - https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html ([cache](../docs/evidence/steinhardt_research_stochastic_decision_process.md)).
[^spinningup]: Joshua Achiam, "Spinning Up as a Deep RL Researcher" (OpenAI, 2018-10-13) - https://spinningup.openai.com/en/latest/spinningup/spinningup.html ([research cache](../docs/evidence/spinningup_research_source_graph.md), [debugging cache](../docs/evidence/spinningup_researcher.md)).
[^rahtz]: Matthew Rahtz, "Lessons Learned Reproducing a Deep Reinforcement Learning Paper" (2018) - http://amid.fish/reproducing-deep-rl ([cache](../docs/evidence/amid_fish_reproducing_deep_rl.md)).
[^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) - https://arxiv.org/abs/1709.06560 ([cache](../docs/evidence/henderson_2018_deep_rl_matters.md)).
[^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" (2016) - http://joschu.net/docs/nuts-and-bolts.pdf ([cache](../docs/evidence/joschu_nuts_and_bolts.md)).
[^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) - https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](../docs/evidence/alexirpan_rl_hard.md)).