Files
ml-debug/docs/evidence/nanda_how_to_mech_interp.md
T
wassname 8509ec3c30 folklore: promote Spinning Up to main; add a Research-taste section
- Promote the general (non-RL-specific) Spinning Up lessons up to the main
  folklore: "broken code fails silently", "you can't tell it's broken if you
  can't see that it's breaking", and test on more than one setup.
- Add gwern's "Unseeing" to the data theme: you can't read what you actually
  wrote, hence fresh eyes / a fresh-eyes subagent.
- New "Research taste (adjacent to debugging)" section with verbatim quotes,
  each cached: Neel Nanda (your research is false by default; excitement is
  evidence of bullshit; read your data), Ulisse Mini (understand the system to
  shrink the search space), John Wentworth (gears-level models are capital
  investments vs cheap black boxes).

All quotes verbatim from cached sources; 25/25 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 21:08:49 +08:00

1.5 KiB

How to Become a Mechanistic Interpretability Researcher — Neel Nanda

Source: https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher (also on LessWrong, same post id). Verbatim excerpts cached for the research-taste section.


Skepticism/Truth-seeking: The default state of the world is that your research is false, because doing research is hard. Your north star should always be to find true insights

Excitement is evidence of bullshit: Generally, most true results are not exciting, but a fair amount of false results are. So from a Bayesian perspective, if a result is exciting and cool, it's even more likely to be false than normal!

Read your data: A fantastic use of time, especially during the exploration phase, is just actually reading the data you're working with, or model chains of thought and responses. [...] Often, the quality of the data is a crucial driver of the results of your experiments. Often, it is quite bad.

A useful exercise is imagining you're talking to a really obnoxious skeptic who keeps complaining that they don't believe you and coming up with arguments for why your thing is wrong. What could you do such that they don't have a leg to stand on?

Do ablations on your fancy method: It's easy for people to have a fancy method with lots of moving parts, when many actually are unnecessary. You should always try removing one part and see if the method breaks. Do this for each part.