lessweong data

This commit is contained in:
wassname
2025-07-26 10:25:45 +08:00
parent 29e89a1e42
commit 3950695c70
17 changed files with 1215 additions and 6 deletions
+1 -1
View File
@@ -567,7 +567,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0rc1"
"version": "3.11.0"
}
},
"nbformat": 4,
@@ -0,0 +1,182 @@
---
title: "By default, capital will matter more than ever after AGI"
date: 2024-12-30 19:47:19.838000+00:00
url: https://www.lesswrong.com/posts/KFFaKu27FNugCHFmh/by-default-capital-will-matter-more-than-ever-after-agi
novelty: 0.906387857762945
score: 0.6629015207290649
baseScore: 249
voteCount: 147
---
**Edited to add*****:** The main takeaway of this post is meant to be: **Labour-replacing AI will shift the relative importance of human v non-human factors of production, which reduces the incentives for society to care about humans while making existing powers more effective and entrenched.** Many people are reading this post in a way where either (a) "capital" means just "money" (rather than also including physical capital like factories and data centres), or (b) the main concern is human-human inequality (rather than broader societal concerns about humanity's collective position, the potential for social change, and human agency).*
I've heard many people say something like "money won't matter post-AGI". This has always struck me as odd, and as most likely completely incorrect.
First: labour means human mental and physical effort that produces something of value. Capital goods are things like factories, data centres, and software—things humans have built that are used in the production of goods and services. I'll use "capital" to refer to both the stock of capital goods and to the money that can pay for them. I'll say "money" when I want to exclude capital goods.
The key economic effect of AI is that it makes capital a more and more general substitute for labour. There's less need to pay humans for their time to perform work, because you can replace that with capital (e.g. data centres running software replaces a human doing mental labour).
I will walk through consequences of this, and end up concluding that labour-replacing AI means:
1. The ability to buy results in the real world will dramatically go up
2. Human ability to wield power in the real world will dramatically go down (at least without money); including because:
1. there will be no more incentive for states, companies, or other institutions to care about humans
2. it will be harder for humans to achieve outlier outcomes relative to their starting resources
3. Radical equalising measures are unlikely
Overall, this points to a neglected downside of transformative AI: that society might become permanently static, and that current power imbalances might be amplified and then turned immutable.
Given sufficiently strong AI, *this is not a risk about insufficient material comfort*. Governments could institute UBI with the AI-derived wealth. Even if e.g. only the United States captures AI wealth and the US government does nothing for the world, if you're willing to assume arbitrarily extreme wealth generation from AI, the wealth of the small percentage of wealthy Americans who care about causes outside the US might be enough to end material poverty (if 1% of American billionaire wealth was spent on wealth transfers to foreigners, it would take 16 doublings of American billionaire wealth as expressed in purchasing-power-for-human-needs—a roughly 70,000x increase—before they could afford to give $500k-equivalent to every person on Earth; in a singularity scenario where the economy's doubling time is months, this would not take long). Of course, if the AI explosion is less singularity-like, or if the dynamics during AI take-off actively disempower much of the world's population (a real possibility), even material comfort could be an issue.
What most emotionally moves me about these scenarios is that a static society with a locked-in ruling caste does not seem dynamic or alive to me. We should not kill human ambition, if we can help it.
There are also ways in which such a state makes [slow-rolling, gradual AI catastrophes](https://nosetgauge.substack.com/p/a-disneyland-without-children) more likely, because the incentive for power to care about humans is reduced.
**The default solution**
------------------------
Let's assume human mental and physical labour across the vast majority of tasks that humans are currently paid wages for no longer has non-trivial market value, because the tasks can be done better/faster/cheaper by AIs. Call this labour-replacing AI.
There are two levels of the standard solution to the resulting unemployment problem:
1. Governments will adopt something universal basic income (UBI).
2. We will quickly hit superintelligence, and, assuming the superintelligence is aligned, live in a post-scarcity technological wonderland where everything is possible.
Note, firstly, that money will continue being a thing, at least unless we have one single AI system doing all economic planning. [Prices are largely about communicating information](https://www.econlib.org/library/Essays/hykKnw.html). If there are many actors and they trade with each other, the strong assumption should be that there are prices (even if humans do not see them or interact with them). Remember too that however sharp the singularity, abundance will still be finite, and must therefore be allocated.
**Money currently struggles to buy talent**
-------------------------------------------
Money can buy you many things: capital goods, for example, can usually be bought quite straightforwardly, and cannot be bought without a lot of money (or other liquid assets, or non-liquid assets that others are willing to write contracts against, or special government powers). But it is surprisingly hard to convert raw money into labour, in a way that is competitive with top labour.
Consider Blue Origin versus SpaceX. Blue Origin was started two years earlier (2000 v 2002), had much better funding for most of its history, and even today employs almost as many people as SpaceX (11,000 v 13,000). Yet SpaceX has crushingly dominated Blue Origin. In 2000, Jeff Bezos had $4.7B at hand. But it is hard to see what he could've done to not lose out to the comparatively money-poor SpaceX with its intense culture and outlier talent.
Consider, a century earlier, the Wright brothers with their bike shop resources beating Samuel Langley's well-funded operation.
Consider the stereotypical VC-and-founder interaction, or the acquirer-and-startup interaction. In both cases, holders of massive financial capital are willing to pay very high prices to bet on labour—and the bet is that the labour of the few people in the startup will beat extremely large amounts of capital.
If you want to convert money into results, the deepest problem you are likely to face is hiring the right talent. And that comes with several problems:
1. It's often hard to judge talent, unless you yourself have considerable talent in the same domain. Therefore, if you try to find talent, you will often miss.
2. Talent is rare (and credentialed talent even more so—and many actors can't afford to rely on any other kind, because of point 1), so there's just not very much of it going around.
3. Even if you can locate the top talent, the top talent tends to be *less* amenable to being bought out by money than others.
(Of course, those with money keep building infrastructure that makes it easier to convert money into results. I have seen first-hand the largely-successful quest by quant finance companies to strangle out all existing [ambition](https://space.ong.ac/escaping-flatland) out of top UK STEM grads and replace it with the eking of tiny gains in financial markets. Mammon must be served!)
With labour-replacing AI, these problems go away.
First, *you* might not be able to judge AI talent. Even the AI evals ecosystem might find it hard to properly judge AI talent—evals are hard. Maybe even the informal word-of-mouth mechanisms that correctly sung praises of Claude-3.5-Sonnet far more decisively than any benchmark might find it harder and harder to judge which AIs really are best as AI capabilities keep rising. But the real difference is that the AIs can be cloned. Currently, huge pools of money chase after a single star researcher who's made a breakthrough, and thus had their talent made legible to those who control money (who can judge the clout of the social reception to a paper but usually can't judge talent itself directly). But the star researcher that is an AI can just be cloned. Everyone—or at least, everyone with enough money to burn on GPUs—gets the AI star researcher. No need to sort through the huge variety of unique humans with their unproven talents and annoying inability to be instantly cloned. This is the main reason why it will be easier for money to find top talent once we have labour-replacing AIs.
Also, of course, the price of talent will go down massively, because the AIs will be cheaper than the equivalent human labour, and because competition will be fiercer because the AIs can be cloned.
The final big bottleneck for converting money into talent is that lots of top talent has complicated human preferences that make them hard to buy out. The top artist has an artistic vision they're genuinely attached to. The top mathematician has a deep love of elegance and beauty. The top entrepreneur has deep conviction in what they're doing—and probably wouldn't function well as an employee anyway. Talent and performance in humans are surprisingly tied to a [sacred bond to a discipline or mission](https://www.benlandautaylor.com/p/looking-beyond-the-veil) (a fact that the world's cynics / careerists / Roman Empires like to downplay, only to then find their lunch eaten by the ambitious interns / SpaceXes / Christianities of the world). In contrast, AIs exist specifically so that they *can* be trivially bought out (at least within the bounds of their safety training). The genius AI mathematician, unlike the human one, will happily spend its limited time on Earth proving the correctness of schlep code.
Finally (and obviously), the AIs will eventually be much more capable than any human employees at their tasks.
This means that *the ability of money to buy results in the real world will dramatically go up once we have labour-replacing AI*.
**Most people's power/leverage derives from their labour**
----------------------------------------------------------
Labour-replacing AI also deprives almost everyone of their main lever of power and leverage. Most obviously, if you're the average Joe, you have money because someone somewhere pays you to spend your mental and/or physical efforts solving their problems.
But wait! We assumed that there's UBI! Problem solved, right?
### **Why are states ever nice?**
UBI is granted by states that care about human welfare. There are many reasons why states care and might care about human welfare.
Over the past few centuries, there's been a big shift towards states caring more about humans. Why is this? We can examine the reasons to see how durable they seem:
1. Moral changes downstream of the Enlightenment, in particular an increased centering of liberalism and individualism.
2. Affluence & technology. Pre-industrial societies were mostly so poor that significant efforts to help the poor would've bankrupted them. Many types of help (such as effective medical care) are also only possible because of new technology.
3. Incentives for states to care about freedom, prosperity, and education.
AI will help a lot with the 2nd point. It will have some complicated effect on the 1st. But here I want to dig a bit more into the 3rd, because I think this point is unappreciated.
Since the industrial revolution, the interests of states and people have been unusually aligned. To be economically competitive, a strong state needs efficient markets, a good education system that creates skilled workers, and a prosperous middle class that creates demand. It benefits from using talent regardless of its class origin. It also benefits from allowing high levels of freedom to foster science, technology, and the arts & media that result in global soft-power and cultural influence. Competition between states largely pushes *further* in all these directions—consider the success of the US, or how even the CCP is pushing for efficient markets and educated rich citizens, and faces incentives to allow some freedoms for the sake of Chinese science and startups. Contrast this to the feudal system, where the winning strategy was building an extractive upper class to rule over a population of illiterate peasants and spend a big share of extracted rents on winning wars against nearby states. For more, see [my review of *Foragers, Farmers, and Fossil Fuels*](https://nosetgauge.substack.com/p/review-foragers-farmers-and-fossil-fuels), or my post on the connection between [moral values and economic growth](https://nosetgauge.substack.com/p/growth-and-civilisation).
With labour-replacing AI, the incentives of states—in the sense of what actions states should take to maximise their competitiveness against other states and/or their own power—will no longer be aligned with humans in this way. The incentives might be better than during feudalism. During feudalism, the incentive was to extract as much as possible from the peasants without them dying. After labour-replacing AI, humans will be less a resource to be mined and more just irrelevant. However, spending fewer resources on humans and more on the AIs that sustain the state's competitive advantage will still be incentivised.
Humans will also have much less leverage over states. Today, if some important sector goes on strike, or if some segment of the military threatens a coup, the state has to care, because its power depends on the buy-in of at least some segments of the population. People can also credibly tell the state things like "invest in us and the country will be stronger in 10 years". But once AI can do all the labour that keeps the economy going and the military powerful, the state has no more *de facto* reason to care about the demands of its humans.
Adam Smith could [write](https://oll.libertyfund.org/quotes/adam-smith-butcher-brewer-baker) that his dinner doesn't depend on the benevolence of the butcher or the brewer or the baker. The classical liberal today can credibly claim that the arc of history really does bend towards freedom and plenty for all, not out of the benevolence of the state, but because of the incentives of capitalism and geopolitics. But after labour-replacing AI, this will no longer be true. If the arc of history keeps bending towards freedom and plenty, it will do so only out of the benevolence of the state (or the AI plutocrats). If so, we better lock in that benevolence while we have leverage—and have a good reason why we expect it to stand the test of time.
The best thing going in our favour is democracy. It's a huge advantage that a deep part of many of the modern world's strongest institutions (i.e. Western democracies) is equal representation of every person. However, only about [13% of the world's population lives in a liberal democracy](https://ourworldindata.org/grapher/people-living-in-democracies-autocracies), which creates concerns both about the fate of the remaining 87% of the world's people (especially the 27% in closed autocracies). It also creates potential for [Molochian](https://slatestarcodex.com/2014/07/30/meditations-on-moloch/) competition between humanist states and less scrupulous states that might drive down the resources spent on human flourishing to zero over a sufficiently long timespan of competition.
I focus on states above, because states are the strongest and most durable institutions today. However, similar logic applies if, say, companies or some entirely new type of organisation become the most important type of institution.
### **No more outlier outcomes?**
Much change in the world is driven by people who start from outside money and power, achieve outlier success, and then end up with money and/or power. This makes sense, since those with money and/or power rarely have the fervour to push for big changes, since they are exactly those who are best served by the status quo.
Whatever your opinions on income inequality or any particular group of outlier successes, I hope you agree with me that the possibility of someone achieving outlier success and changing the world is important for avoiding stasis and generally having a world that is interesting to live in.
Let's consider the effects of labour-replacing AI on various routes to outlier success through labour.
**Entrepreneurship** is increasingly what [Matt Clifford calls the "technology of ambition" of choice](https://medium.com/entrepreneur-first/tech-entrepreneurship-and-the-disruption-of-ambition-4e6854121992) for ambitious young people (at least those with technical talent and without a disposition for politics). Right now, entrepreneurship has become easier. AI tools can already make small teams much more effective without needing to hire new employees. They also reduce the entry barrier to new skills and fields. However, labour-replacing AI makes the tenability of entrepreneurship uncertain. There is some narrow world in which AIs remain mostly tool-like and entrepreneurs can succeed long after most human labour is automated because they provide agency and direction. However, it also seems likely that sufficiently strong AI will by default obsolete human entrepreneurship. For example, VC funds might be able to directly convert money into hundreds of startup attempts all run by AIs, without having to go through the intermediate route of finding a human entrepreneurs to manage the AIs for them.
**The hard sciences**. The era of human achievement in hard sciences will probably end within a few years because of the rate of AI progress in anything with crisp reward signals.
**Intellectuals.** Keynes, Friedman, and Hayek all did technical work in economics, but their outsize influence came from the worldviews they developed and sold (especially in Hayek's case), which made them more influential than people like Paul Samuelson who dominated mathematical economics. John Stuart Mill, John Rawls, and Henry George were also influential by creating frames, worldviews, and philosophies. The key thing that separates such people from the hard scientists is that the outputs of their work are not spotlighted by technical correctness alone, but require moral judgement as well. Even if AI is superhumanly persuasive and correct, there's some uncertainty about how AI work in this genre will fit into [the way that human culture picks and spreads ideas](https://nosetgauge.substack.com/p/ai-and-wisdom-3-ai-effects-on-amortised). Probably it doesn't look good for human intellectuals. I suspect that a lot of why intellectuals' ideologies can have so much power is that they're products of genius in a world where genius is rare. A flood of AI-created ideologies might mean that no individual ideology, and certainly no human one, can shine so bright anymore. The world-historic intellectual might go extinct.
**Politics** might be one of the least-affected options, since I'd guess that most humans specifically want a human to do that job, and because politicians get to set the rules for what's allowed. The charisma of AI-generated avatars, and a general dislike towards politicians at least in the West, might throw a curveball here, though. It's also hard to say whether incumbents will be favoured. AI might bring down the cost of many parts of political campaigning, reducing the resource barrier to entry. However, if AI too expensive for small actors is meaningfully better than cheaper AI, this would favour actors with larger resources. I expect these direct effects to be smaller than the indirect effects from whatever changes AI has on the memetic landscape.
Also, the real play is not to go into actual politics, where a million other politically-talented people are competing to become president or prime minister. Instead, have political skill and go somewhere outside government where political skill is less common (c.f. Sam Altman). Next, wait for the arrival of hyper-competent AI employees that reduce the demands for human subject-matter competence while increasing the rewards for winning political games within that organisation.
**Military** success as a direct route to great power and disruption has—for the better—not really been a thing since Napoleon. Advancing technology increases the minimum industrial base for a state-of-the-art army, which benefits incumbents. AI looks set to be controlled by the most powerful countries. One exception is if coups of large countries become easier with AI. Control over the future AI armies will likely be both (a) more centralised than before (since a large number of people no longer have to go along for the military to take an action), and (b) more tightly controllable than before (since the permissions can be implemented in code rather than human social norms). These two factors point in different directions so it's uncertain what the net effect on coup ease will be. Another possible exception is if a combination of revolutionary tactics and cheap drones enables a Napoleon-of-the-drones to win against existing armies. Importantly, though, neither of these seems likely to promote the *good* kind of disruptive challenge to the status quo.
**Religions**. When it comes to rising rank in existing religions, the above takes on politics might be relevant. When it comes to starting new religions, the above takes on intellectuals might be relevant.
So on net, sufficiently strong labour-replacing AI will be on-net bad for the chances of every type of outlier human success, with perhaps the weakest effects in politics. This is despite the very real boost that current AI has on entrepreneurship.
All this means that *the ability to get and wield power in the real world without money will dramatically go down once we have labour-replacing AI.*
**Enforced equality is unlikely**
---------------------------------
*The Great Leveler* is a [good book](https://rudolf.website/short-reviews-nonfiction-1/#section-3) on the history of inequality that ([at least per the author](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4939891)) has survived its critiques fairly well. Its conclusion is that past large reductions in inequality have all been driven by one of the "Four Horsemen of Leveling": total war, violent revolution, state collapse, and pandemics. Leveling income differences has historically been hard enough to basically never happen through conscious political choice.
Imagine that labour-replacing AI is here. UBI is passed, so no one is starving. There's a massive scramble between countries and companies to make the best use of AI. This is all capital-intensive, so everyone needs to woo holders of capital. The top AI companies wield power on the level of states. The redistribution of wealth is unlikely to end up on top of the political agenda.
An exception might be if some new political movement or ideology gets a lot of support quickly, and is somehow boosted by some unprecedented effect of AI (such as: no one has jobs anymore so they can spend all their time on politics, or there's some new AI-powered coordination mechanism).
Therefore, even if the future is a glorious transhumanist utopia, it is unlikely that people will be starting in it at an equal footing. Due to the previous arguments, it is also unlikely that they will be able to greatly change their relative footing later on.
Consider also equality between states. Some states stand set to benefit massively more than others from AI. Many equalising measures, like UBI, would be difficult for states to extend to non-citizens under anything like the current political system. This is true even of the United States, the most liberal and humanist great power in world history. By default, the world order might therefore look (even more than today) like a global caste system based on country of birth, with even fewer possibilities for immigration (because the main incentive to allow immigration is its massive economic benefits, which only exist when humans perform economically meaningful work).
**The default outcome?**
------------------------
Let's grant the assumptions at the start of this post and the above analysis. Then, the post-labour-replacing-AI world involves:
* Money will be able to buy results in the real world better than ever.
* People's labour gives them less leverage than ever before.
* Achieving outlier success through your labour in most or all areas is now impossible.
* There was no transformative leveling of capital, either within or between countries.
This means that those with significant capital when labour-replacing AI started have a permanent advantage. They will wield more power than the rich of today—not necessarily *over* people, to the extent that liberal institutions remain strong, but at least over physical and intellectual achievements. Upstarts will not defeat them, since capital now trivially converts into superhuman labour in any field.
Also, there will be no more incentive for whatever institutions wield power in this world to care about people in order to maintain or grow their power, because all real power will flow from AI. There might, however, be significant lock-in of liberal humanist values through political institutions. There might also be significant lock-in of people's purchasing power, if everyone has meaningful UBI (or similar), and the economy retains a human-oriented part.
In the best case, this is a world like a more unequal, unprecedentedly static, and much richer Norway: a massive pot of non-human-labour resources (oil :: AI) has benefits that flow through to everyone, and yes some are richer than others but everyone has a great standard of living (and [ideally](https://nosetgauge.substack.com/p/death-is-bad) also lives forever). The only realistic forms of human ambition are playing local social and political games within your social network and class. If you don't have a lot of capital (and maybe not even then), you don't have a chance of affecting the broader world anymore. Remember: the AIs are better poets, artists, philosophers—everything; why would anyone care what some human does, unless that human is someone they personally know? Much like in feudal societies the answer to "why is this person powerful?" would usually involve some long family history, perhaps ending in a distant ancestor who had fought in an important battle ("my great-great-grandfather fought at Bosworth Field!"), anyone of importance in the future will be important because of something they or someone they were close with did in the pre-AGI era ("oh, my uncle was technical staff at OpenAI"). The children of the future will live their lives in the shadow of their parents, with social mobility extinct. I think you should definitely feel a non-zero amount of existential horror at this, even while acknowledging that it could've gone a lot worse.
In a worse case, AI trillionaires have near-unlimited and unchecked power, and there's a permanent aristocracy that was locked in based on how much capital they had at the time of labour-replacing AI. The power disparities between classes might make modern people shiver, much like modern people consider feudal status hierarchies grotesque. But don't worry—much like [the feudal underclass mostly accepted their world order](https://nosetgauge.substack.com/p/review-foragers-farmers-and-fossil-fuels) due to their culture even without superhumanly persuasive AIs around, the future underclass will too.
In the absolute worst case, humanity goes extinct, potentially because of a [slow-rolling optimisation for AI power over human prosperity](https://nosetgauge.substack.com/p/a-disneyland-without-children) over a long period of time. Because that's what the power and money incentives will point towards.
**What's the takeaway?**
------------------------
If you read this post and accept a job at a quant finance company as a result, I will be sad. If you were about to do something ambitious and impactful about AI, and read this post and accept a job at Anthropic to accumulate risk-free personal capital while counterfactually helping out a bit over the marginal hire, I can't fault you too much, but I will still be slightly sad.
It's of course true that the above increases the stakes of medium-term (~2-10 year) personal finance, and you should consider this. But it's also true that *right now* is a great time to do something ambitious. Robin Hanson calls the present ["the dreamtime"](https://www.overcomingbias.com/p/this-is-the-dream-timehtml), following a concept in Aboriginal myths: the time when the future world order and its values are still liquid, not yet set in stone.
Previous upheavals—the various waves of industrialisation, the internet, etc.—were great for human ambition. With AI, we could have the last and greatest opportunity for human ambition—followed shortly by its extinction for all time. How can your reaction not be: ["carpe diem"](https://en.wikipedia.org/wiki/Carpe_diem)?
We should also try to preserve the world's dynamism.
Rationalist thought on post-AGI futures is too solutionist. The strawman version: solve morality, solve AI, figure out the optimal structure to tile the universe with, do that, done. (The actual leading figures have far less strawman views; see e.g. [Paul Christiano at 23:30 here](https://www.dwarkeshpatel.com/p/paul-christiano)—but the on-the-ground culture *does* lean in the strawman direction.)
I think it's much healthier for society and its development to be a shifting, dynamic thing where the ability, as an individual, to add to it or change it remains in place. And that means keeping the potential for successful ambition—and the resulting disruption—alive.
How do we do this? I don't know. But I don't think you should see the approach of powerful AI as a blank inexorable wall of human obsolescence, consuming everything equally and utterly. There will be cracks in the wall, at least for a while, and they will look much bigger up close once we get there—or if you care to look for them hard enough from further out—than from a galactic perspective. As AIs get closer and closer to a Pareto improvement over all human performance, though, I expect we'll eventually need to augment ourselves to keep up.
+59
View File
@@ -0,0 +1,59 @@
---
title: "The Plan - 2024 Update"
date: 2024-12-31 19:29:23.013000+00:00
url: https://www.lesswrong.com/posts/kJkgXEwQtWLrpecqg/the-plan-2024-update
novelty: 0.7851697060402698
score: 0.542655885219574
baseScore: 113
voteCount: 48
---
This post is a follow-up to [The Plan - 2023 Version](https://www.lesswrong.com/posts/HfqbjwpAEGep9mHhc/the-plan-2023-version). Theres also [The Plan - 2022 Update](https://www.lesswrong.com/posts/BzYmJYECAc3xyCTt6/the-plan-2022-update) and [The Plan](https://www.lesswrong.com/posts/3L46WGauGpr7nYubu/the-plan), but the 2023 version contains everything you need to know about the current Plan. Also see [this comment](https://www.lesswrong.com/posts/BzYmJYECAc3xyCTt6/the-plan-2022-update?commentId=grDHP2Yc6os2SjFsy) and [this comment](https://www.lesswrong.com/posts/HfqbjwpAEGep9mHhc/the-plan-2023-version?commentId=CGoaDgsDdK5TztBqi) on how my plans interact with the labs and other players, if youre curious about that part.
What Have You Been Up To This Past Year?
----------------------------------------
Our big thing at the end of 2023 was [Natural Latents](https://www.lesswrong.com/posts/dWQWzGCSFj6GTZHz7/natural-latents-the-math). Prior to natural latents, the biggest problem with my [math on natural abstraction](https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1) was that it didnt handle approximation well. Natural latents basically solved that problem. With that theoretical barrier out of the way, it was time to focus on crossing the theory-practice gap. Ultimately, that means building a product to get feedback from users on how well our theory works in practice, providing an empirical engine for iterative improvement of the theory.
In late 2023 and early 2024, David and I spent about 3-4 months trying to speedrun the theory-practice gap. Our target product was an image editor; the idea was to use a standard image generation net (specifically [this one](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7)), and edit natural latent variables internal to the net. Its conceptually similar to some things people have built before, but the hope would be that natural latents would better match human concepts, and therefore the edits would feel more like directly changing human-interpretable things in the image in natural ways.
When I say “speedrun” the theory-practice gap… well, the standard expectation is that theres a lot of iteration and insights required to get theory working in practice (even when the theory is basically correct). The “speedrun” strategy was to just try the easiest and hackiest thing at every turn. The hope was that (a) maybe it turns out to be that easy (though probably not), and (b) even if it doesnt work well get some useful feedback. After 3-4 months, it indeed did not work very well. But more importantly, we did not actually get much useful feedback signal. David and I now think the project was a pretty major mistake; it cost us 3-4 months and we got very little out of it.
After that, we spent a few months on some smaller and more theory-ish projects. We worked out a [couple](https://www.lesswrong.com/posts/NHKCtSXgFieDAyWt2/calculating-natural-latents-via-resampling) [more](https://www.lesswrong.com/posts/xDsbqxeCQWe4BiYFX/natural-latents-are-not-robust-to-tiny-mixtures) pieces of the math of natural latents, [explained](https://www.lesswrong.com/posts/RrQftNoRHd5ya54cb/towards-a-less-bullshit-model-of-semantics) what kind of model of semantics wed ideally like (in terms of natural latents), [wrote up](https://www.lesswrong.com/posts/DXxEp3QWzeiyPMM3y/a-simple-toy-coherence-theorem) a toy coherence theorem which I think is currently the best illustration of how coherence theorems *should* work, [worked out](https://www.lesswrong.com/posts/QA7bQHpKymPBFBuHb/a-solomonoff-inductor-walks-into-a-bar-schelling-points-for) a version of natural latents for Solomonoff inductors[[1]](#fn6vcsu80422i) and applied that to semantics as well, [presented](https://www.lesswrong.com/posts/7LaDvWtymFWtidGxe/corrigibility-tool-ness) an interesting notion of corrigibility and tool-ness, and [put](https://www.lesswrong.com/posts/YgaPhcrkqnLrTzQPG/we-don-t-know-our-own-values-but-reward-bridges-the-is-ought) [together](https://www.lesswrong.com/posts/a5hpPfABQnrkfGGxb/values-are-real-like-harry-potter) an agent model which resolved all of my own most pressing outstanding confusions about the type-signature of human values. There were also a few other results which we havent yet written up, including a version of the second law of thermo more suitable for embedded agents, and some more improvements to the theory of natural latents, as well as a bunch of small investigations which didnt yield anything legible.
Of particular note, we spent several weeks trying to [apply](https://www.lesswrong.com/posts/QsstSjDqa7tmjQfnq/wait-our-models-of-semantics-should-inform-fluid-mechanics) the theory of natural latents to fluid mechanics. That project has not yet yielded anything notable, but its of interest here because its another plausible route to a useful product: a fluid simulation engine based on natural latent theory would, ideally, make all of todays fluid simulators completely obsolete, and totally change the accuracy/compute trade-off curves. To frame it in simulation terms, the ideal version of this would largely solve the challenges of [multiscale simulation](https://en.wikipedia.org/wiki/Multiscale_modeling), i.e. eliminate the need for a human to figure out relevant summary statistics and hand-code multiple levels. Of course that project has its own nontrivial theory-practice gap to cross.
At the moment, were focused on another project with an image generator net, about which we might write more in the future.
Why The Focus On Image Generators Rather Than LLMs?
---------------------------------------------------
At this stage, were not really interested in the internals of nets themselves. Rather, were interested in what kinds of patterns in the environment the net learns and represents. Roughly speaking, one cant say anything useful about representations in a net until one has a decent characterization of the types of patterns in the environment which are represented in the first place.[[2]](#fnbosody02zja)
And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment were trying to understand.
And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The fields current focus on LLMs is a mistake
Any Major Changes To The Plan In The Past Year?
-----------------------------------------------
In previous years, much of my relative optimism stemmed from the hope that the field of alignment would soon shift from pre-paradigmatic to paradigmatic, and progress would accelerate a lot as a result. [Ive largely given up on that hope](https://www.lesswrong.com/posts/nwpyhyagpPYDn4dAW/the-field-of-ai-alignment-a-postmortem-and-what-to-do-about). The probability I assign to a good outcome has gone down accordingly; I dont have a very firm number, but its definitely below 50% now.
In terms of the plan, weve shifted toward assuming well need to do more of the work ourselves. Insofar as were relying on other people to contribute, we expect it to be a narrower set of people on narrower projects.
This is not as dire an update as it might sound. The results we already have are far beyond what I-in-2020 would have expected from just myself and one other person, especially with the empirical feedback engine not really up and running yet.  Earlier this year, David and I estimated that wed need roughly a 3-4x productivity multiplier to feel like we were basically on track. And that kind of productivity multiplier is not out of the question; I already estimate that working with David has been about a 3x boost for me, so wed need roughly that much again. Especially if we get the empirical feedback loop up and running, another 3-4x is very plausible. Not easy, but plausible.
Do We Have Enough Time?
-----------------------
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they wont and that the excitement about them is [very overblown](https://www.lesswrong.com/posts/puv8fRDCH9jx5yhbX/johnswentworth-s-shortform?commentId=szp5fNZJqpfrsyQP9). But Im not very confident in that guess.
If the excitement is overblown, then were most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think were likely to see another AI winter in the next year or so.
If the excitement is not overblown, then were probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably [our own understanding](https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP) of how to align much-smarter-than-human AGI.
1. **[^](#fnref6vcsu80422i)**
Woohoo! Id been wanting a Solomonoff version of natural abstraction theory for years.
2. **[^](#fnrefbosody02zja)**
The lack of understanding of the structure of patterns in the environment is a major barrier for interp work today. The cutting edge is “sparse features”, which is indeed a pattern which comes up a lot in our environment, but its probably far from a complete catalogue of the relevant types of patterns.
+190
View File
@@ -0,0 +1,190 @@
---
title: "2024 in AI predictions"
date: 2025-01-02 02:09:13.093000+00:00
url: https://www.lesswrong.com/posts/CJ4sppkGcbnGMSG2r/2024-in-ai-predictions
novelty: 0.789189874669953
score: 0.8065339922904968
baseScore: 116
voteCount: 35
---
Follow-up to: [2023 in AI predictions](https://www.lesswrong.com/posts/EZxG6ySHCEjDvL5x4/2023-in-ai-predictions).
Here I collect some AI predictions made in 2024. It's not very systematic, it's a convenience sample mostly from browsing Twitter/X. I prefer including predictions that are more specific/testable. I'm planning to make these posts yearly, checking in on predictions whose date has expired. Feel free to add more references to predictions made in 2024 to the comments. (Thanks especially @tsarnick and @AISafetyMemes for posting about a lot of these.)
Predictions about 2024
======================
I'll review predictions from previous posts that are about 2024.
the gears to ascension: "Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch's stuff becomes super duper important to understand." (conceded as false by author)
[John Pressman](https://x.com/jd_pressman/status/1718789355379314808?t=luXXx8djX0EUV2qgnZXX5g&s=19): "6-12 month prediction (80%): The alignment problem as the core of AI X-Risk will become a historical artifact as it's largely solved or on track to being solved in the eyes of most parties and arguments increasingly become about competition and misuse. Few switch sides." (conceded as false by author)
Predictions made in 2024
========================
December 2024
-------------
[Gary Marcus](https://x.com/GaryMarcus/status/1766871625075409381?t=_R1xaUU0LiAgfXjw3xCTvw&s=19):
> Prediction: By end of 2024 we will see
>
> * 7-10 GPT-4 level models
> * No massive advance (no GPT-5, or disappointing GPT-5)
> * Price wars
> * Very little moat for anyone
> * No robust solution to hallucinations
> * Modest lasting corporate adoption
> * Modest profits, split 7-10 ways
(since 2024 has already ended, this can be evaluated to some degree; I would say he's approximately correct regarding non-agent models, but o1 and o3 are big advances ("massive" is about right), and constitute more moat for OpenAI. He [rates himself](https://x.com/GaryMarcus/status/1873856666334015499) as 7/7.)
September 2025
--------------
[teortaxesTex](https://x.com/teortaxesTex/status/1870257485703197181): "We can have effectively o3 level models fitting into 256 Gb VRAM by Q3 2025, running at >40 t/s. Basically its a matter of Liang and co. having the compute and the political will to train and upload r3 on Huggingface."
October 2025
------------
[Jack Gallagher](https://x.com/gallabytes/status/1851721561063203153): "calling it now - there's enough different promising candidates rn that I bet by this time next year we mostly don't use Adam anymore."
December 2025
-------------
[Elon Musk](https://x.com/elonmusk/status/1767738797276451090?t=Yfw96mnoV2fAkc8FDZYDcw&s=19): "AI will probably be smarter than any single human next year. By 2029, AI is probably smarter than all humans combined." (I'll repeat this for 2029)
[Aidan McLau](https://x.com/aidan_mclau/status/1870462987842236910): "i think its likely (p=.6) that an o-series model solves a millennium prize math problem in 2025"
[Victor Taelin](https://x.com/VictorTaelin/status/1801403177733898451?t=gIBAnRSQoq4Ksh4LcXN6mg&s=19): "I'm now willing to bet up to 100k (but no more than that, I'm not Musk lol) that HOC will have AGI by end of 2025.... AGI defined as an algorithm capable of proving theorems in a proof assistant as competently as myself. (This is an objective way to say 'codes like Taelin'.)"
April 2026
----------
[drdanponders](https://x.com/drdanponders/status/1784260365343219818): "It just dawned on me that ~humanoids in the house will be a thing very soon indeed. In under 2 years I bet. Simply another home appliance, saving you time, cooking for you, doing the chores, watching the house while you're gone. I can see a robot of approximately this complexity and capabilities at around the price of a budget car even at launch."
June 2026
---------
[Mira Murati](https://x.com/tsarnick/status/1803901130130497952?t=q690BnHYMS6C1TCxbUxfpg&s=19): "in the next couple of years, we're looking at PhD-level intelligence for specific tasks."
August 2026
-----------
[Dario Amodei](https://www.dwarkeshpatel.com/p/dario-amodei) "In terms of someone looks at the model and even if you talk to it for an hour or so, it's basically like a generally well educated human, that could be not very far away at all. I think that could happen in two or three years. The main thing that would stop it would be if we hit certain safety thresholds and stuff like that."
November 2026
-------------
[William Bryk](https://x.com/WilliamBryk/status/1871946968148439260): "700 days until humans are no longer the top dogs at math in the known universe."
Februrary 2027
--------------
[Daniel Kokotajlo](https://www.lesswrong.com/posts/CcqaJFf7TvAjuZFCx/retirement-accounts-and-short-timelines/comment/9tyDD2bqk4BH6Y7XX): "I expect to need the money sometime in the next 3 years, because thats about when we get to 50% chance of AGI."
(thread includes more probabilities further down; see [this thread](https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=LKThjEJ6W8eQEJiXG) for more context on AGI definitions)
December 2027
-------------
[Leopold Aschenbrenner](https://situational-awareness.ai/from-gpt-4-to-agi/): "it is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer."
[Gary Marcus vs. Milus Brundage](https://garymarcus.substack.com/p/where-will-ai-be-at-the-end-of-2027?r=8tdk6&utm_campaign=post&utm_medium=web&triedRedirect=true):
> If there exist AI systems that can perform 8 of the 10 tasks below by the end of 2027, as determined by our panel of judges, Gary will donate $2,000 to a charity of Miles choice; if AI can do fewer than 8, Miles will donate $20,000 to a charity of Garys choice.
>
> ...
>
> 1. Watch a previously unseen mainstream movie (without reading reviews etc) and be able to follow plot twists and know when to laugh, and be able to summarize it without giving away any spoilers or making up anything that didnt actually happen, and be able to answer questions like who are the characters? What are their conflicts and motivations? How did these things change? What was the plot twist?
> 2. Similar to the above, be able to read new mainstream novels (without reading reviews etc) and reliably answer questions about plot, character, conflicts, motivations, etc, going beyond the literal text in ways that would be clear to ordinary people.
> 3. Write engaging brief biographies and obituaries without obvious hallucinations that arent grounded in reliable sources.
> 4. Learn and master the basics of almost any new video game within a few minutes or hours, and solve original puzzles in the alternate world of that video game.
> 5. Write cogent, persuasive legal briefs without hallucinating any cases.
> 6. Reliably construct bug-free code of more than 10,000 lines from natural language specification or by interactions with a non-expert user. [Gluing together code from existing libraries doesnt count.]
> 7. With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.
> 8. With little or no human involvement, write Oscar-caliber screenplays.
> 9. With little or no human involvement, come up with paradigm-shifting, Nobel-caliber scientific discoveries.
> 10. Take arbitrary proofs from the mathematical literature written in natural language and convert them into a symbolic form suitable for symbolic verification.
2028
----
[Dario Amodei](https://www.nytimes.com/2024/04/12/podcasts/transcript-ezra-klein-interviews-dario-amodei.html): "A.S.L. 4 is going to be more about, on the misuse side, enabling state-level actors to greatly increase their capability, which is much harder than enabling random people. So where we would worry that North Korea or China or Russia could greatly enhance their offensive capabilities in various military areas with A.I. in a way that would give them a substantial advantage at the geopolitical level. And on the autonomy side, its various measures of these models are pretty close to being able to replicate and survive in the wild. So it feels maybe one step short of models that would, I think, raise truly existential questions…I think A.S.L. 4 could happen anywhere from 2025 to 2028."
[Shane Legg](https://www.dwarkeshpatel.com/p/shane-legg#details): "And so, yeah, I think there's a 50% chance that we have AGI by 2028. Now, it's just a 50% chance. I'm sure what's going to happen is were going to get to 2029 and someone's going to say, 'Shane, you were wrong.' Come on, I said 50% chance."
[Thomas Friedman](https://www.nytimes.com/2024/10/29/opinion/artificial-intelligence-harris-trump-election.html): "And this election coincides with one of the greatest scientific turning points in human history: the birth of artificial general intelligence, or A.G.I., which is likely to emerge in the next four years and will require our next president to pull together a global coalition to productively, safely and compatibly govern computers that will soon have minds of their own superior to our own."
[Sabine Hossenfelder](https://www.youtube.com/watch?v=xm1B3Y3ypoE): "According to Aschenbrenner, by 2028, the most advanced models will run on 10 gigawatts of power at a cost of several hundred billion dollars. By 2030, theyll run at 100 gigawatts of power at a cost of a trillion dollars… Can you do that? Totally. Is it going to happen? You got to be kidding me."
[Vlad Tenev](https://x.com/tsarnick/status/1838683210680799665), on AI solving Millenium prize: 2028 for a human/AI hybrid solving a Millenium prize problem
2029
----
[Sam Altman](https://www.marketingaiinstitute.com/blog/sam-altman-ai-agi-marketing), regarding AGI: "5 years, give or take, maybe slightly longer — but no one knows exactly when or what it will mean for society."
(he says AGI "will mean that 95% of what marketers use agencies, strategists, and creative professionals for today will easily, nearly instantly and at almost no cost be handled by the AI — and the AI will likely be able to test the creative against real or synthetic customer focus groups for predicting results and optimizing. Again, all free, instant, and nearly perfect. Images, videos, campaign ideas? No problem.")
[Elon Musk](https://x.com/elonmusk/status/1767738797276451090?t=Yfw96mnoV2fAkc8FDZYDcw&s=19): "AI will probably be smarter than any single human next year. By 2029, AI is probably smarter than all humans combined."
[John Schulman](https://www.dwarkeshpatel.com/p/john-schulman) in response to "What is your median timeline for when it replaces your job?": "Maybe five years."
[Ray Kurzweil](https://www.fanaticalfuturist.com/2017/03/kurzweil-ai-aces-turing-test-in-2029-and-the-singularity-arrives-in-2045/): "By 2029, computers will have human level intelligence"
[jbetker](https://nonint.com/2024/06/03/general-intelligence-2024/): "In summary weve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. Id give that another 1-2 years. So my current estimate is 3-5 years for AGI. Im leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus of the world."
[Jeffrey Ladish](https://x.com/JeffLadish/status/1849992536468922715): "Now it appears, if not obvious, quite likely that well be able to train agents to exceed human strategic capabilities, across the board, this decade."
[Bindu Reddy](https://x.com/bindureddy/status/1767608523859661120): "We are at least 3-5 years away from automating software engineering."
[AISafetyMemes](https://x.com/AISafetyMemes/status/1768478961972166932): "I repeat: in 1-5 years, if we're still alive, I expect the biggest protests humanity has ever seen"
[Jonathan Ross](https://x.com/JonathanRoss321/status/1795722941990314240?t=kdmG69tvv_Wl0J0i-9c6UQ&s=19): "Prediction: AI will displace social drinking within 5 years. Just as alcohol is a social disinhibitor, like the Steve Martin movie Roxanne, people will use AI powered earbuds to help them socialize. At first we'll view it as creepy, but it will quickly become superior to alcohol"
2030
----
[Demis Hassabis](https://www.dwarkeshpatel.com/p/demis-hassabis): "I will say that when we started DeepMind back in 2010, we thought of it as a 20-year project. And I think were on track actually, which is kind of amazing for 20-year projects because usually theyre always 20 years away. Thats the joke about whatever, quantum, AI, take your pick. But I think were on track. So I wouldnt be surprised if we had AGI-like systems within the next decade."
[Christopher Manning](https://x.com/chrmanning/status/1768291975005196326?t=nLomK17OpGNMyve8ebF3mQ&s=19): "I do not believe human-level AI (artificial superintelligence, or the commonest sense of #AGI) is close at hand. AI has made breakthroughs, but the claim of AGI by 2030 is as laughable as claims of AGI by 1980 are in retrospect. Look how similar the rhetoric was in @LIFE in 1970!"
[Dr\_Singularity](https://x.com/Dr_Singularity/status/1792231239992426982): "For the record, I'm currently at ~96% that ASI will be here by 2030. I've stopped saving for retirement and have increased my spending. Long term planning is pointless in a world when ASI (even AGI alone) is on the horizon."
[Greg Colbourn](https://x.com/gcolbourn/status/1719632086615605546): "High chance AI will lead to human extinction before 2030 unless we act now"
2032
----
[Eric Schmidt](https://x.com/kimmonismus/status/1854464662626086940?t=0bxRNUNPajZ0LhIMmzXmYQ&s=19): "In the industry it is believed that somewhere around 5 years, no one knows exactly, the systems will begin to be able to write their own code, that is, they literally will take their code and make it better. And of course that's recursive... It's reasonable to expect that within 6-8 years from now... it will be possible to have a single system that is 80 or 90 percent of the ability of the expert in every field... ninety percent of the best physicist, ninety percent of the best chemist, ninety percent of the best artist."
[Roko Mijic](https://x.com/RokoMijic/status/1762087871047991325?t=2ynm2v91kRkkKYBvv454tg&s=19): "AI will completely replace human programmers by 2045... 2032 seems more realistic"
2034
----
[Mustafa Suleyman](https://x.com/FutureJurvetson/status/1782201734158524435): ""AI is a new digital species...To avoid existential risk, we should avoid: 1) Autonomy 2) Recursive self-improvement 3) Self-replication. We have a good 5 to 10 years before we'll have to confront this."
[Joe Biden](https://x.com/tsarnick/status/1838721620808208884): "We will see more technological change, I argue, in the next 2-10 years, than we have in the last 50 years."
2039
----
[Ray Kurzweil](https://x.com/tsarnick/status/1806434872686178784?t=14SzwB5Onnq2oPiR8XlU-Q&s=19): "When we get to the 2030s, nanobots will connect our brains to the cloud, just the way your phone does. It'll expand intelligence a million-fold by 2045. That is the Singularity."
[Rob Bensinger](https://x.com/robbensinger/status/1798845199382429697): "I think [Leopold Aschenbrenner's] arguments for this have a lot of holes, but he gets the basic point that superintelligence looks 5 or 15 years off rather than 50+."
[acidshill](https://x.com/acidshill/status/1778336665347698923?t=gqgGTz-ZDLNSJ1LL1GBU-g&s=19): "damn... i'd probably be pretty concerned about the trajectory of politics and culture if i wasn't pretty confident that we're all going to d\*e in the next 15 years... but i am, so instead it's just funny"
[James Miller](https://x.com/JimDMiller/status/1781697375221715042): "I don't see how, absent the collapse of civilization, we don't get a von Neumann level or above AI within 15 years."
[Aella](https://x.com/Aella_Girl/status/1790618794181976397?t=n20gxMSGEgZoRXrfJwkJtQ&s=19): "for the record, im currently at ~70% that we're all dead in 10-15 years from AI. i've stopped saving for retirement, and have increased my spending and the amount of long-term health risks im taking"
2044
----
[Geoffrey Hinton](https://x.com/tsarnick/status/1849908298499359088): "Now, I think its quite likely that sometime in the next 20 years, these things will get smarter than us."
[Yann LeCun](https://x.com/tsarnick/status/1804259837024538662): "We're nowhere near reaching human-level intelligence, let alone superintelligence. If we're lucky, within a decade or so, maybe two."
@@ -0,0 +1,68 @@
---
title: "Comment on 'Death and the Gorgon'"
date: 2025-01-01 06:00:24.498000+00:00
url: https://www.lesswrong.com/posts/hx5EkHFH5hGzngZDs/comment-on-death-and-the-gorgon
novelty: 0.7502529657282471
score: 0.5308915376663208
baseScore: 90
voteCount: 32
---
*(some plot spoilers)*
There's something distinctly uncomfortable about reading Greg Egan in the 2020s. Besides telling gripping tales with insightful commentary on the true nature of mind and existence, Egan stories written in the 1990s and set in the twenty-first century excelled at speculative worldbuilding, imagining what technological wonders might exist in the decades to come and how Society might adapt to them.
In contrast, "Death and the Gorgon", published in the January/February 2024 issue of *Asimov's*, feels like it's set [twenty minutes into the future](https://tvtropes.org/pmwiki/pmwiki.php/Main/TwentyMinutesIntoTheFuture). The technologies on display are an AI assistant for police officers (capable of performing research tasks and carrying on conversation) and real-time synthetic avatars (good enough to pass as a video call with a real person). When these kinds of products showed up in "'90s Egan"—I think of Worth's "pharm" custom drug dispenser in *Distress* (1995) or Maria's "mask" for screening spam calls in *Permutation City* (1994)—it was part of the background setting of a more technologically advanced world than our own.
Reading "Gorgon" in 2024, not only do the depicted capabilities seem less out of reach (our language model assistants and deepfakes aren't quite there yet, but don't seem too far off), but their literary function has changed: much of the moral of "Gorgon" seems to be to chide people in the real world who are overly impressed by ChatGPT. Reality and Greg Egan are starting to meet in the middle.
Our story features Beth, a standard-issue Greg Egan protagonist[[1]](#fn-nEfrZ7hExtjNoKLof-1) as a small-town Colorado sheriff investigating the suspicious destruction of a cryonics vault in an old mine: a naturally occurring cave-in seems unlikely, but it's not clear who would have the motive to thaw (murder?) a hundred frozen heads.
Graciously tolerating the antics of her deputy, who is obsessed with the department's trial version of (what is essentially) ChatGPT-for-law-enforcement, Beth proceeds to interview the next of kin, searching for a motive. She discovers that many of the cryopreserved heads were beneficiaries of a lottery for terminally ill patients in which the prize was free cyronic suspension. The lottery is run by OG—"Optimized Giving"—a charitable group concerned with risks affecting the future of humanity. As the investigation unfolds, Beth and a colleague at the FBI begin to suspect that the lottery is a front for a creative organized crime scheme: OG is recruiting terminal patients to act as assassins, carrying out hits in exchange for "winning" the lottery. (After which another mafia group destroyed the cryonics vault as retaliation.) Intrigue, action, and a cautionary moral ensue as our heroes make use of ChatGPT-for-law-enforcement to prove their theory and catch OG red-handed before more people get hurt.
---
So, cards on the table: this story spends a lot of wordcount satirizing a subculture that, unfortunately, I can't credibly claim not to be a part of. "Optimized Giving" is clearly a spoof on the longtermist wing of Effective Altruism—and if I'm not happy about how the "Effective Altruism" brand ate my beloved rationalism over the 2010s, I don't think anyone would deny the contiguous memetic legacy involving many of the same people. ([Human subcultures are nested fractally](https://xkcd.com/1095/); for the purposes of reviewing the story, it would benefit no one for me to to insist that Egan isn't talking about me and my people, even if, from *within* the subculture, it looks like the OpenPhil people and the MIRI people and the Vassarites and ... *&c.* are all totally different and in fact hate each other's guts.)
I don't want to be defensive, because I'm *not* loyal to the subculture, its leaders, or its institutions. In the story, Beth talks to a professor—think [Émile Torres](https://en.wikipedia.org/wiki/%C3%89mile_P._Torres#Transhumanism,_longtermism,_and_effective_altruism) as a standard-issue Greg Egan character—who studies "apostates" from OG who are angry about "the hubris, the deception, and the waste of money." That resonated with me a lot: I have a long [dumb](http://unremediatedgender.space/2023/Jul/blanchards-dangerous-idea-and-the-plight-of-the-lucid-crossdreamer/) [story](http://unremediatedgender.space/2023/Jul/a-hill-of-validity-in-defense-of-meaning/) [to tell](http://unremediatedgender.space/2023/Dec/if-clarity-seems-like-death-to-them/) [about hubris and deception](http://unremediatedgender.space/2024/Mar/agreeing-with-stalin-in-ways-that-exhibit-generally-rationalist-principles/), and the corrupting forces of money are probably a big part of the explanation for [the rise and predictable perversion of Effective Altruism](http://benjaminrosshoffman.com/effective-altruism-is-self-recommending/).
So if my commentary on Egan's satire contains some criticism, it's absolutely *not* because I think my ingroup is beyond reproach and doesn't deserve to satirized. They (we) absolutely do. (I took joy in including a similar caricature in [one of my own stories](http://unremediatedgender.space/2023/Oct/fake-deeply/).) But if Egan's satire doesn't quite hit the mark of explaining exactly why the group is bad, it's not an act of partisan loyalty for me to contribute my nuanced explanation of what I think it gets right and what it gets wrong. I'm not carrying water for the movement;[[2]](#fn-nEfrZ7hExtjNoKLof-2) it's just a topic that I happen to have a lot of information about.
Without calling it a fair portrayal, the OG of "Gorgon" isn't a strawman conjured out of thin air; the correspondences to its real-world analogue are clear. When our heroine suspiciously observes that these *soi-disant* world-savers don't seem to be spending anything on climate change and the Émile Torresanalogue tells her that OG don't regard it as an existential threat, [this is also true of real-world EA](https://forum.effectivealtruism.org/posts/eJPjSZKyT4tcSGfFk/climate-change-is-in-general-not-an-existential-risk). When the Torres-analogue says that "OG view any delay in spreading humanity at as close to light-speed as possible as the equivalent of murdering all the people who won't have a chance to exist in the future," the argument isn't a fictional parody; it's a somewhat uncharitably phrased summary of Nick Bostrom's ["Astronomical Waste: The Opportunity Cost of Delayed Technological Development"](https://nickbostrom.com/papers/astronomical-waste/). When the narrator describes some web forums as "interspers[ing] all their actual debunking of logical fallacies with much more tendentious claims, wrapped in cloaks of faux-objectivity" and being "especially prone to an abuse of probabilistic methods, where they pretended they could quantify both the likelihood and the potential harm for various implausible scenarios, and then treated the results of their calculations—built on numbers they'd plucked out of the air—as an unimpeachable basis for action", one could quibble with the disparaging description of subjective probability, but you can tell which website is being alluded to.
The cryonics-as-murder-payment lottery fraud is fictional, of course, but I'm inclined to read it as artistically-licensed commentary on a strain of ends-justify-the-means thinking that does exist within EA. EA organizations don't take money from the mob for facilitating contract killings, but they *did* take money from [the largest financial fraud in history](https://en.wikipedia.org/wiki/FTX), [which was explicitly founded as a means to make money for EA](https://thezvi.wordpress.com/2023/10/24/book-review-going-infinite/). (One could point out that the charitable beneficiaries of Sam Bankman-Fried's largesse didn't know that FTX wasn't an honest business, but we have to assume that the same is true of OG in the story: only a few insiders would be running the contract murder operation, not the rank-and-file believers.)
While the depiction of OG in the story clearly shows familiarity with the source material, the satire feels somewhat lacking *qua* anti-EA advocacy insofar as it relies too much on mere dismissal rather than presenting clear counterarguments.[[3]](#fn-nEfrZ7hExtjNoKLof-3) The effect of OG-related web forums on a vulnerable young person are described thus:
> Super-intelligent AIs conquering the world; the whole Universe turning out to be a simulation; humanity annihilated by aliens because we failed to colonize the galaxy in time. Even if it was all just stale clichés from fifty-year-old science fiction, a bright teenager like Anna could have found some entertainment value analyzing the possibilities rigorously and puncturing the forums' credulous consensus. But while she'd started out healthily skeptical, some combination of in-forum peer pressure, the phony gravitas of trillions of future deaths averted, and the corrosive effect of an endless barrage of inane slogans pimped up as profound insights—all taking the form "X is the mind-killer," where X was pretty much anything that might challenge the delusions of the cult—seemed to have worn down her resistance in the end.
I absolutely agree that healthy skepticism is critical when evaluating ideas and that in-forum peer pressure and the gravitas of a cause (for any given set of peers and any given cause) are troubling sources of potential bias—and that just because a group pays lip service to the value of healthy skepticism and the dangers of peer pressure and gravitas, doesn't mean the group's culture isn't still falling prey to the usual dysfunctions of groupthink. (As the inane slogan goes, ["Every cause wants to be a cult."](https://www.lesswrong.com/posts/yEjaj7PWacno5EvWa/every-cause-wants-to-be-a-cult))
That being said, however, ideas ultimately need to be judged on their merits, and the narration in this passage[[4]](#fn-nEfrZ7hExtjNoKLof-4) isn't giving the reader any counterarguments to the ideas being alluded to. (As Egan would know, science fiction authors having written about an idea does not make the idea false.) The clause about the whole Universe turning out to be a simulation is probably a reference to Bostrom's [simulation argument](https://simulation-argument.com/simulation/), which is a disjunctive, conditional claim: given some assumptions in the philosophy of mind and the theory of anthropic reasoning, then *if* future civilization could run simulations of its ancestors, then *either* they won't want to, *or* we're probably in one of the simulations (because there are more simulated than "real" histories). The clause about humanity being annihilated by failing to colonize the galaxy in time is probably a reference to Robin Hanson *et al.*'s [grabby aliens thesis](https://grabbyaliens.com/), that the Fermi paradox can be explained by a selection effect: there's a relatively narrow range of parameters in which we would see signs of an expanding alien civilization in our skies without already having been engulfed by them.
No doubt many important criticisms could be made of Bostrom's or Hanson's work, perhaps by a bright teenager finding entertainment value in analyzing the possibilities rigorously. But there's an important difference between having such a criticism[[5]](#fn-nEfrZ7hExtjNoKLof-5) and merely asserting that it could exist. Speaking only to my own understanding, Hanson's and Bostrom's arguments both look reasonable to me? It's certainly possible I've just been hoodwinked by the cult, but if so, the narrator of "Gorgon"'s snarky description isn't helping me snap out of it.
It's worth noting that despite the notability of Hanson's and Bostrom's work, in practice, I don't see anyone in the subculture particularly worrying about losing out on galaxies due to competition with aliens—admittedly, because we're worried about "super-intelligent AIs conquering the world" first.[[6]](#fn-nEfrZ7hExtjNoKLof-6) About which, "Gorgon" ends on a line from Beth about "the epic struggle to make computers competent enough to help bring down the fools who believe that they're going to be omnipotent."
This is an odd take from the author[[7]](#fn-nEfrZ7hExtjNoKLof-7) of [multiple](https://gregegan.net/DIASPORA/DIASPORA.html) [novels](https://www.gregegan.net/SCHILD/SCHILD.html) in which software minds engage in astronomical-scale engineering projects. Accepting the premise that institutional longtermist EA deserves condemnation for being goofy and a fraud: in condemning them, why single out as the characteristic belief of this despicable group, the idea that future AI could be really powerful?[[8]](#fn-nEfrZ7hExtjNoKLof-8) Isn't that at least credible? Even if you think people in the cult or who work at AI companies are liars or dupes, it's harder to say that about eminent academics like Stuart Russell, Geoffrey Hinton, Yoshua Bengio, David Chalmers, and Daniel Dennett, who signed [a statement affirming that "[m]itigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."](https://www.safe.ai/work/statement-on-ai-risk)[[9]](#fn-nEfrZ7hExtjNoKLof-9)
Egan's own work sometimes features artificial minds with goals at odds with their creator, as in ["Steve Fever"](https://www.technologyreview.com/2007/10/15/223446/steve-fever/) (2007) or ["Crystal Nights"](https://gregegan.net/MISC/CRYSTAL/Crystal.html) (2008), and with substantial advantages over biological creatures: in *Diaspora* (1997), the polis citizens running at 800 times human speed were peace-loving, but surely could have glassed the fleshers in a war if they wanted to. If you believe that AI could be at odds with its creators and hold a competitive advantage, scenarios along the lines of "super-intelligent AIs conquering the world" should seem plausible rather than far-fetched—a natural phenomenon straightforwardly analogous to human empires conquering other countries, or humans dominating other animals.
Given so many shared premises, it's puzzling to me why Egan seems to bear so much antipathy towards "us",[[10]](#fn-nEfrZ7hExtjNoKLof-10) rather than than regarding the subculture more coolly, as a loose amalgamation of people interested in many of the same topics as him, but having come to somewhat different beliefs. (Egan doesn't seem to think human-level AI is at all close, nor that AI could be qualitatively superhumanly intelligent; an aside in *Schild's Ladder* (2002) alludes to a fictional result that there's nothing "above" general intelligence of the type humans have, *modulo* speed and memory.) He seems to expect the feeling to be mutual: when someone remarked on Twitter about finding it funny that the *Less Wrong* crowd likes his books, Egan [replied](https://twitter.com/gregeganSF/status/1727940487255138404), "Oh, I think they've noticed, but some of them still like the, err, 'early, funny ones' that predate the cult and hence devote no time to mocking it."
Well, I can't speak for anyone else, but personally, *I* like Egan's later work, including "Death and the Gorgon."[[11]](#fn-nEfrZ7hExtjNoKLof-11) Why wouldn't I? I am not so petty as to let my appreciation of well-written fiction be dulled by the incidental fact that I happen to disagree with some of the author's views on artificial intelligence and a social group that I can't credibly claim not to be a part of. That kind of dogmatism would be contrary to the ethos of humanism and clear thinking that I learned from reading Greg Egan and *Less Wrong*—an ethos that doesn't endorse blind loyalty to every author or group you learned something from, but a discerning loyalty to whatever was *good* in what the author or group saw in our shared universe. I don't know what the future holds in store for humanity. But whatever risks and opportunities nature may present, I think our odds are better for every thinking individual who tries to read widely and see more.[[12]](#fn-nEfrZ7hExtjNoKLof-12)
---
1. Some people say that Greg Egan is bad at characterization. I think he just specializes in portraying *reasonable* people, who don't have grotesque personality flaws to be the subject of "characterization." [↩︎](#fnref-nEfrZ7hExtjNoKLof-1)
2. I do feel bad about the fraction of my recent writing output that consists of criticizing the movement—not because it's disloyal, but because it's *boring*. I keep telling myself that one of these years I'm going to have healed enough trauma to forget about these losers already and just read ArXiv papers. Until then, you get posts like this one. [↩︎](#fnref-nEfrZ7hExtjNoKLof-2)
3. On the other hand, one could argue that satire just isn't the right medium for presenting counterarguments, which would take up a lot of wordcount without advancing the story. Not every written work can accomplish all goals! Maybe it's fine for this story to make fun of the grandiose and cultish elements within longtermist EA (and there are a lot of them), with a critical evaluation of the ideas being left to other work. But insofar as the goal of "Gorgon" is to persuade readers that the ideas aren't even worthy of consideration, I think that's a mistake. [↩︎](#fnref-nEfrZ7hExtjNoKLof-3)
4. In critically examining this passage, I don't want to suggest that "Gorgon"'s engagement with longtermist ideas is all snark and no substance. Earlier in the story, Beth compares OG believers "imagin[ing] that they're in control of how much happiness there'll be in the next trillion years" to a child's fantasy of violating relativity by twirling a rope millions of miles long. That's substantive: even if the future of humanity is very large, the claim that a nonprofit organization today is in a position to meaningfully affect it is surprising and should not be accepted uncritically on the basis of [evocative storytelling about the astronomical stakes](https://www.lesswrong.com/posts/pGvyqAQw6yqTjpKf4/the-gift-we-give-to-tomorrow). [↩︎](#fnref-nEfrZ7hExtjNoKLof-4)
5. Which I think would get upvoted on this website if it were well done—certainly if it were written with the insight and rigor characteristic of a standard-issue Greg Egan protagonist. [↩︎](#fnref-nEfrZ7hExtjNoKLof-5)
6. Bostrom's "Astronomical Waste" concludes that "The Chief Goal for Utilitarians Should Be to Reduce Existential Risk": making sure colonization happens at all (by humanity or worthy [rather than unworthy](https://www.lesswrong.com/tag/squiggle-maximizer-formerly-paperclip-maximizer) successors) is more important that making it happen faster. [↩︎](#fnref-nEfrZ7hExtjNoKLof-6)
7. In context, it seems reasonable to infer that Beth's statement is author-endorsed, even if fictional characters do not in general represent the author's views. [↩︎](#fnref-nEfrZ7hExtjNoKLof-7)
8. I'm construing "omnipotent" as rhetorical hyperbole; influential subcultural figures [clarifying that no one thinks superintelligence will be able to break the laws of physics](https://x.com/ESYudkowsky/status/1658616828741160960) seems unlikely to be exculpatory in Egan's eyes. [↩︎](#fnref-nEfrZ7hExtjNoKLof-8)
9. Okay, the drafting and circulation of the statement by Dan Hendrycks's [Center for AI Safety](https://www.safe.ai/) was arguably cult activity. (While Hendrycks has a PhD from UC Berkeley and [co-pioneered the usage of a popular neural network activation function](https://arxiv.org/abs/1606.08415), he [admits that his career focus on AI safety was influenced by](https://archive.ph/20230708182452/https://www.bostonglobe.com/2023/07/06/opinion/ai-safety-human-extinction-dan-hendrycks-cais/#selection-1909.0-1913.10) the EA advice-counseling organization [80,000 hours](https://80000hours.org/). But Russell, Hinton, *et al*. did sign. [↩︎](#fnref-nEfrZ7hExtjNoKLof-9)
10. This isn't the first time Egan has satirized the memetic lineage that became longtermist EA; *Zendegi* (2010) [features negative portrayals of](https://www.overcomingbias.com/p/egans-zendegihtml) a character who blogs at *[overpoweringfalsehood.com](http://overpoweringfalsehood.com)* (a reference to [*Overcoming Bias*](https://www.overcomingbias.com/)) and a Benign Superintelligence Bootstrap Project (a reference to what was then the Singularity Institute for Artificial Intelligence). [↩︎](#fnref-nEfrZ7hExtjNoKLof-10)
11. Okay, I should confess that I do treasure early Egan (*Quarantine* (1992)/*Permutation City* (1994)/*Distress* (1995)) more than later Egan, but not because they devote no time to mocking the cult. It's because I'm not smart enough to properly appreciate all the alternate physics in, *e.g.*, *Schild's Ladder* (2002) or the *Orthogonal* trilogy (20112013). [↩︎](#fnref-nEfrZ7hExtjNoKLof-11)
12. Though we're [unlikely to get it](https://twitter.com/robinhanson/status/1365662127504187396), I've sometimes wished for a Greg EganRobin Hanson collaboration; I think Egan's masterful understanding of the physical world and Hanson's unsentimental analysis of the social world would complement each other well. [↩︎](#fnref-nEfrZ7hExtjNoKLof-12)
@@ -0,0 +1,66 @@
---
title: "debating buying NVDA in 2019"
date: 2025-01-04 18:07:35.832000+00:00
url: https://www.lesswrong.com/posts/QFSZmMdtPwoAHdSoz/debating-buying-nvda-in-2019
novelty: 0.5269753048885288
score: 0.7926478385925293
baseScore: 21
voteCount: 12
---
Alice: You saw GPT-2, right?
Bob: Of course.
Alice: It's running on GPUs using CUDA. OpenAI will keep scaling that up, and other groups will want to do the same thing.
Bob: Right.
Alice: So, does this mean we should buy Nvidia stock?
Bob: I'm not sure. Nvidia makes the hardware used now, but why should we expect it to be the hardware used in the future? There's clearly room for improvement: current GPUs aren't optimized for lower-precision numbers or sparsity. Designs will change, which means a competitor might do better. At the scales we're talking about, it makes some sense to design your own ASICs. Google already has TPUs, and Amazon & Facebook will probably do something similar.
Alice: OK, but researchers are all people who started out doing stuff on their personal GPU using CUDA.
Bob: And you don't think AMD or somebody will be able to make other GPUs compatible with CUDA, now that it's a priority?
Alice: Eventually, maybe, but I think you're massively underestimating the difficulty of that.
Bob: Again, google is already using TPUs, so clearly they have software for that. Look, neural networks are mostly big matrix multiplications; chips for them should be easier to design than chips for graphics, and Nvidia has strong competition for GPUs.
Alice: You want HBM for NN ASICs, and TSMC is the only company doing that well. Nvidia reserved a lot of their capacity.
Bob: Apple did too. More importantly, if TSMC capacity is the limiting factor, then profits should go to them. There are contracts for now, but GPT-2 is still a ways off from being useful and it wouldn't make sense to make really big purchases until the next generation of ASICs comes out, at which point those contracts could be renegotiated and TSMC could raise their prices, up to the point where Samsung is almost as good an option.
Alice: Hmm, maybe. I still think you're underestimating the software moat Nvidia has. Long-term, maybe there's a real competitor, but every big company is going to want to train their own language model ASAP.
Bob: Why? I could understand fine-tuning, but why would they need to do that? I imagine there will be a few big groups making their own models, but then everybody else could just license the best ones. The competition should be even, meaning low net profits. There might even be competitive open-source models from somebody.
Alice: No, big companies will want to make their own. You're not considering the incentives of people at those companies. If they think AI will be big, they'll want job experience "making AI". And CEOs will be afraid that markets will punish them for not having their own AI program, because investors will think experience with AI could be important in the future.
Bob: What?! OpenAI has only been around for a few years! Corporate "experience" with AI won't matter; just hire decent people and read the latest papers.
Alice: Maybe so, but that's not how a lot of investors think.
Bob: Is that the basis of our investment plan, then? CEOs do something dumb to please dumb investors?
Alice: You already said there's room for improvement with NN ASICs, right?
Bob: Of course, you can [redacted]. But obviously a complete design is too large a project for just me.
Alice: Well then, it seems you think there's room for them to continue improvements and stay ahead of other designers. Nvidia was leading for hardware acceleration of ray tracing, and they'll have a big budget, so it seems like they'll be leading for NN ASIC design too, at least for a while.
Bob: I'm not convinced that such competence carries over to other designs. For all you know Apple or Amazon will do better than them. Or maybe Huawei, or Will Semiconductor.
Alice: Even if that's true, you're looking too far ahead. Stock prices are based on profit in the last few quarters and the stories in media. There's a whole pipeline for this stuff, and it takes years. Also, Nvidia can afford to steal all the best GPU software people from AMD.
Bob: Again, there are already TPUs.
Alice: Fun fact, Nvidia is actually doing better in terms of NN performance per mm^2 than TPUs despite their processors being less special-purpose.
Bob: OK, but presumably Google could still sell those if there's so much demand.
Alice: Maybe Google just won't be able to make TPUs fast enough to sell them and do their own stuff. And maybe there just won't be much other strong competition in the relevant timeframe.
Bob: Do all the best people want to work in Nvidia's giant open offices, then? It's not like they have a monopoly on talent; they certainly wouldn't hire me or [redacted].
Alice: Sure, but neither would the ASIC companies getting VC funding. If any ASIC startup actually becomes a threat, Nvidia can buy them out too. The Chinese talent pool is also somewhat separate, but then the Chinese companies have their geopolitical and management issues.
@@ -1,5 +1,5 @@
---
title: Human study on AI spear phishing campaigns
title: "Human study on AI spear phishing campaigns"
date: 2025-01-03 19:03:28.406000+00:00
url: https://www.lesswrong.com/posts/GCHyDKfPXa5qsG2cP/human-study-on-ai-spear-phishing-campaigns
novelty: 0.6774583882987633
@@ -0,0 +1,137 @@
---
title: "My AGI safety research—2024 review, 25 plans"
date: 2025-01-01 22:01:48.820000+00:00
url: https://www.lesswrong.com/posts/2wHaCimHehsF36av3/my-agi-safety-research-2024-review-25-plans
novelty: 0.7569247703430972
score: 0.49594494700431824
baseScore: 94
voteCount: 25
---
*Previous:*[*My AGI safety research—2022 review, 23 plans*](https://www.lesswrong.com/posts/qusBXzCpxijTudvBB/my-agi-safety-research-2022-review-23-plans)*. (I guess I skipped it last year.)*
*“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.”  *[*attributed to DL Moody*](https://www.goodreads.com/quotes/390887-our-greatest-fear-should-not-be-of-failure-but-of)
Tl;dr
=====
* Section 1 goes through my main research project, “reverse-engineering human social instincts”: what does that even mean, whats the path-to-impact, what progress did I make in 2024 (spoiler: lots!!), and how can I keep pushing it forward in the future?
* Section 2 is what Im expecting to work on in 2025: most likely, Ill start the year with some bigger-picture thinking about Safe & Beneficial AGI, then eventually get back to reverse-engineering human social instincts after that. Plus, a smattering of pedagogy, outreach, etc.
* Section 3 is a sorted list of all my blog posts from 2024
* Section 4 is acknowledgements
1. Main research project: reverse-engineering human social instincts
====================================================================
1.1 Background: Whats the problem and why should we care?
----------------------------------------------------------
*(copied almost word-for-word from*[*Neuroscience of human social instincts: a sketch*](https://www.lesswrong.com/posts/kYvbHCDeMTCTE9TAj/neuroscience-of-human-social-instincts-a-sketch)*)*
My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all ([as a lens into Artificial General Intelligence safety](https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8)) back in 2019.
What is this grand problem? As described in [Intro to Brain-Like-AGI Safety](https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8), I believe the following:
1. We can divide the brain into a [“Learning Subsystem”](https://www.lesswrong.com/posts/wBHSYwqssBGCnwvHg/intro-to-brain-like-agi-safety-2-learning-from-scratch-in) (cortex, striatum, amygdala, cerebellum, and a few other areas) that houses a bunch of [randomly-initialized](https://www.lesswrong.com/posts/wBHSYwqssBGCnwvHg/intro-to-brain-like-agi-safety-2-learning-from-scratch-in) within-lifetime learning algorithms, and a [“Steering Subsystem”](https://www.lesswrong.com/posts/hE56gYi5d68uux9oM/intro-to-brain-like-agi-safety-3-two-subsystems-learning-and) (hypothalamus, brainstem, and a few other areas) that houses a bunch of specific, genetically-specified [“business logic”](https://www.lesswrong.com/posts/hE56gYi5d68uux9oM/intro-to-brain-like-agi-safety-3-two-subsystems-learning-and#:~:text=nice%20term%C2%A0%E2%80%9C-,business%20logic,-%E2%80%9D%2C%20for%20code). A major role of the Steering Subsystem is as the home for the brains **“innate drives”, a.k.a. “primary rewards”**, roughly equivalent to the reward function in reinforcement learning—things like eating-when-hungry being good (other things equal), pain being bad, and so on.
2. Some of those “innate drives” are related to **human social instincts**—a suite of reactions and drives that are upstream of things like compassion, friendship, love, spite, sense of fairness and justice, etc.
3. **The grand problem is: how do those human social instincts work?** Ideally, an answer to this problem would look like legible pseudocode thats simultaneously compatible with behavioral observations (including everyday experience), with evolutionary considerations, and with a neuroscience-based story of how that pseudocode is actually implemented by neurons in the brain.[[1]](#fn72ioyjkt4p2)
4. Explaining how human social instincts work is tricky mainly because of the **“symbol grounding problem”**. In brief, everything we know—all the interlinked concepts that constitute our understanding of the world and ourselves—is created [“from scratch”](https://www.lesswrong.com/posts/wBHSYwqssBGCnwvHg/intro-to-brain-like-agi-safety-2-learning-from-scratch-in) in the cortex by a learning algorithm, and thus winds up in the form of a zillion unlabeled data entries like “pattern 387294 implies pattern 579823 with confidence 0.184”, or whatever.[[2]](#fnwdgiptlvzsk) Yet certain activation states of these unlabeled entries—e.g., the activation state that encodes the fact that Jun just told me that Xiu thinks Im cute—need to somehow trigger social instincts in the Steering Subsystem. So there must be some way that the brain can “ground” these unlabeled learned concepts. (See my earlier post [Symbol Grounding and Human Social Instincts](https://www.lesswrong.com/posts/5F5Tz3u6kJbTNMqsb/intro-to-brain-like-agi-safety-13-symbol-grounding-and-human).)
5. A solution to this grand problem seems **useful for**[**Artificial General Intelligence**](https://www.lesswrong.com/posts/uxzDLD4WsiyrBjnPw/artificial-general-intelligence-an-extremely-brief-faq) **(AGI) safety**, since (for better or worse) someone someday might invent AGI that works by similar algorithms as the brain, and well want to make those AGIs intrinsically care about peoples welfare. It would be a good jumping-off point to understand how *humans* wind up intrinsically caring about other peoples welfare sometimes. (Slightly longer version in [§2.2 here](https://www.lesswrong.com/posts/qusBXzCpxijTudvBB/my-agi-safety-research-2022-in-review-and-plans#2_2_Why_do_I_think_success_on_this_project_would_be_helpful_for_AGI_safety_); *much* longer version in [this post](https://www.lesswrong.com/posts/Sd4QvG4ZyjynZuHGt/intro-to-brain-like-agi-safety-12-two-paths-forward).)
1.2 More on the path-to-impact
------------------------------
* **Im generally working under the assumption that future transformative**[**AGI**](https://www.lesswrong.com/posts/uxzDLD4WsiyrBjnPw/artificial-general-intelligence-an-extremely-brief-faq) **will work generally how I think the brain works (a not-yet-invented variation on Model-Based Reinforcement Learning, see**[**§1.2 here**](https://www.lesswrong.com/posts/As7bjEAbNpidKx6LR/valence-series-1-introduction#1_2_Model_based_reinforcement_learning__RL_)**).** I think this is a rather different algorithm from todays foundation models, and I think those differences are safety-relevant (see [§4.2 here](https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4_2_No___brain_like_AGI__is_not_trained_similarly_to_LLMs)). You might be wondering: **why work on that, rather than foundation models?**
+ My diplomatic answer is: we dont have AGI yet ([by my definition](https://www.lesswrong.com/posts/uxzDLD4WsiyrBjnPw/artificial-general-intelligence-an-extremely-brief-faq)), and thus we dont know for sure what algorithmic form it will take. So we should be hedging our bets, by different AGI safety people contingency-planning for different possible AGI algorithm classes. And the model-based RL scenario seems *even more* under-resourced right now than the foundation model scenario, by far.
+ My un-diplomatic answer is: Hard to be certain, but Im guessing that the researchers pursuing broadly-brain-like paths to AGI are the ones who will probably succeed, and everyone else will probably fail to get all the way to AGI, and/or theyll gradually pivot / converge towards brain-like approaches, for better or worse. In other words, my guess is that 2024-style foundation model training paradigms will plateau before they hit TAI-level. Granted, they havent plateaued yet. But any day now, right? See [AI doom from an LLM-plateau-ist perspective](https://www.lesswrong.com/posts/KJRBb43nDxk6mwLcR/ai-doom-from-an-llm-plateau-ist-perspective) and [§2 here](https://www.lesswrong.com/posts/hsf7tQgjTZfHjiExn/my-take-on-jacob-cannell-s-take-on-agi-safety#2__Will_AGI_algorithms_look_like_brain_algorithms_).
* **How might my ideas make their way from blog posts into future AGI source code?** Well, again, theres a scenario (threat model) for which Im contingency-planning, and it involves future researchers who are inventing brain-like model-based RL, for better or worse. Those researchers will find that they have a slot in their source code repository labeled “reward function”, and they wont know what to put in that slot to get good outcomes, as they get towards human-level capabilities and beyond. During *earlier* development, with rudimentary AI capabilities, I expect that the researchers will have been doing what model-based RL researchers are doing today, and indeed what they have *always* done since the invention of RL: messing around with obvious reward functions, and trying to get results that are somehow impressive. And if the AI engages in [specification gaming](https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/) or other undesired behavior, then they turn it off, try to fix the problem, and try again. But, [as AGI safety people know well](https://open.substack.com/pub/dileeplearning/p/amelia-bedelia-and-agi-safety-part?r=4jbwd&utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=77676334), that particular debugging loop will eventually stop working, and instead start failing in a catastrophically dangerous way. Assuming the developers notice that problem before its too late, they might look to the literature for a reward function (and associated training environment etc.) that will work in this new capabilities regime. Hopefully, when they go looking, they will find a literature that will *actually exist*, and be full of clear explanations and viable ideas. So thats what Im working on. I think its a very important piece of the puzzle, even if many other unrelated things can *also* go wrong on the road to (hopefully) Safe and Beneficial AGI.
1.3 Progress towards reverse-engineering human social instincts
---------------------------------------------------------------
It was a banner year!
Basically, for years, Ive had a vague idea about how human social instincts might work, involving what I call “transient empathetic simulations”. But I didnt know how to pin it down in more detail than that. One subproblem was: I didnt have even one example of a *specific* social instinct based on this putative mechanism—i.e., a hypothesis where a *specific* innate reaction would be triggered by a *specific* transient empathetic simulation in a *specific* context, such that the results would be consistent with everyday experience and evolutionary considerations. The other subproblem was: I just had lots of confusion about how these things might work in the brain, in detail.
I made progress on the first subproblem in late 2023, when I guessed that theres an innate *“drive to feel liked / admired”*, related to prestige-seeking, and I had a specific idea about how to operationalize that. It turned out that I was still held back by confusion about how social status works, and thus I spent some time in early 2024 sorting that out—see my three posts [Social status part 1/2: negotiations over object-level preferences](https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-negotiations-over-object-level), and [Social status part 2/2: everything else](https://www.lesswrong.com/posts/7RBbwqHoimj92MRnL/social-status-part-2-2-everything-else), and a rewritten [[Valence series] 4. Valence & Liking / Admiring](https://www.lesswrong.com/posts/LaeP39jJpfPyoiSZm/valence-series-4-valence-and-liking-admiring) (which replaced an older, flawed attempt at part 4 of the [Valence series](https://www.lesswrong.com/posts/As7bjEAbNpidKx6LR/valence-series-1-introduction)).
Now I had at least one target to aim for—an innate social drive that I felt I understood well enough to sink my teeth into. That was very helpful for thinking about how that drive might work neuroscientifically. But getting there was still a *hell* of a journey, and was the main thing I did the whole rest of the year. I chased down lots of leads, many of which were mostly dead ends, although I wound up figuring out lots of random stuff along the way, and in fact one of those threads turned into my 8-part [Intuitive Self-Models series](https://www.lesswrong.com/s/qhdHbCJ3PYesL9dde).
But anyway, I finally wound up with [**Neuroscience of human social instincts: a sketch**](https://www.lesswrong.com/posts/kYvbHCDeMTCTE9TAj/neuroscience-of-human-social-instincts-a-sketch), which posits a neuroscience-based story of how certain social instincts work, including not only the “drive to feel liked / admired” mentioned above, but also compassion and spite, which (I claim) are mechanistically related, to my surprise. Granted, many details remain hazy, but this still feels like great progress on the big picture. Hooray!
1.4 Whats next?
----------------
In terms of my moving this project forward, theres lots of obvious work in making more and better hypotheses and testing them against existing literature. Again, see [Neuroscience of human social instincts: a sketch](https://www.lesswrong.com/posts/kYvbHCDeMTCTE9TAj/neuroscience-of-human-social-instincts-a-sketch), in which I point out plenty of lingering gaps and confusions. Now, its possible that I would hit a dead end at some point, because I have a question that is not answered in the existing neuroscience literature. In particular, the hypothalamus and brainstem have hundreds of tiny cell groups with idiosyncratic roles, and most of them remain unmeasured to date. (As an example, see [§5.2 of A Theory of Laughter](https://www.lesswrong.com/posts/7kdBqSFJnvJzYTfx9/a-theory-of-laughter#5_2_Where_exactly_in_the_brain_is_the__laugh_behavior_controller___top_box_in_that_diagram__where_I_can_read_out_the_alleged_pseudocode_of_Section_3_1_), the part where it says “If someone wanted to make progress on this question experimentally…”). But a number of academic groups are continuing to slowly chip away at that problem, and with a lot of luck, connectomics researchers will start mass-producing those kinds of measurements in as soon as the next few years.
(Reminder that [Connectomics seems great from an AI x-risk perspective](https://www.lesswrong.com/posts/ybmDkJAj3rdrrauuu/connectomics-seems-great-from-an-ai-x-risk-perspective), and as mentioned in the last section of that link, you can get involved by applying for jobs, some of which are for non-bio roles like “ML engineer”, or by donating.)
2. My plans going forward
=========================
Actually, “reverse-engineering human social instincts” is on hold for the moment, as Im revisiting the big picture of safe and beneficial AGI, now that I have this new and hopefully-better big-picture understanding of human social instincts under my belt. In other words, knowing what I (think I) know now about how human social instincts work, at least in broad outline, well, what should a brain-like-AGI reward function look like? What about training environment? And test protocols? What are we hoping that AGI developers will do with their AGIs anyway?
Ive been so deep in neuroscience that I have a huge backlog of this kind of big-picture stuff that I havent yet processed.
After that, Ill *probably* wind up diving back into neuroscience in general, and reverse-engineering human social instincts in particular, but only after Ive thought hard about what *exactly* Im hoping to get out of it, in terms of AGI safety, on the current margins. That way, I can be focusing on the right questions.
Separate from all that, I plan to stay abreast of the broader AGI safety field, from fundamentals to foundation models, even if the latter is not really my core interest or comparative advantage. I also plan to continue engaging in AGI safety pedagogy and outreach when I can, including probably reworking some of my blog post ideas into a peer-reviewed paper for a neuroscience journal this spring.
If someone thinks that I should be spending my time differently in 2025, please reach out and make your case!
3. Sorted list of my blog posts from 2024
=========================================
***The** **“reverse-engineering human social instincts” project:***
* [Social status part 1/2: negotiations over object-level preferences](https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-negotiations-over-object-level) (March)
* [Social status part 2/2: everything else](https://www.lesswrong.com/posts/7RBbwqHoimj92MRnL/social-status-part-2-2-everything-else) (March)
* [Spatial attention as a “tell” for empathetic simulation?](https://www.lesswrong.com/posts/7Pt9fogptmiSduXt9/spatial-attention-as-a-tell-for-empathetic-simulation) (April)
* [[Valence series] 4. Valence & Liking / Admiring](https://www.lesswrong.com/posts/LaeP39jJpfPyoiSZm/valence-series-4-valence-and-liking-admiring) (June)
* [Against empathy-by-default](https://www.lesswrong.com/posts/TprdAhgTvr3tuDJsD/against-empathy-by-default) (Oct)
* [Neuroscience of human social instincts: a sketch](https://www.lesswrong.com/posts/kYvbHCDeMTCTE9TAj/neuroscience-of-human-social-instincts-a-sketch) (Nov)
***Other** **neuroscience posts, generally with a less immediately obvious connection to AGI safety:***
* [Woods new preprint on object permanence](https://www.lesswrong.com/posts/v9qj2LHLh2ALDGKyA/woods-new-preprint-on-object-permanence) (March)
* [(Appetitive, Consummatory) ≈ (RL, reflex)](https://www.lesswrong.com/posts/jZLk6DQJ2EwhSty4k/appetitive-consummatory-rl-reflex) (June)
* [Incentive Learning vs Dead Sea Salt Experiment](https://www.lesswrong.com/posts/YQ4rSTHpHeFcAmhvi/incentive-learning-vs-dead-sea-salt-experiment) (June)
* [[Intuitive self-models] 1. Preliminaries](https://www.lesswrong.com/posts/FtwMA5fenkHeomz52/intuitive-self-models-1-preliminaries) (Sept)
* [[Intuitive self-models] 2. Conscious Awareness](https://www.lesswrong.com/posts/73xBjgoHuiKvJ5WRk/intuitive-self-models-2-conscious-awareness) (Sept)
* [[Intuitive self-models] 3. The Homunculus](https://www.lesswrong.com/posts/7tNq4hiSWW9GdKjY8/intuitive-self-models-3-the-homunculus) (Oct)
* [[Intuitive self-models] 4. Trance](https://www.lesswrong.com/posts/QAjmr323LZGQBEvd5/intuitive-self-models-4-trance) (Oct)
* [[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder](https://www.lesswrong.com/posts/6bW5uJ325JxHYqMFr/intuitive-self-models-5-dissociative-identity-multiple) (Oct)
* [[Intuitive self-models] 6. Awakening / Enlightenment / PNSE](https://www.lesswrong.com/posts/GvJe6WQ3jbynyhjxm/intuitive-self-models-6-awakening-enlightenment-pnse) (Oct)
* [[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations](https://www.lesswrong.com/posts/k8uMmw45k3qp8LPNc/intuitive-self-models-7-hearing-voices-and-other) (Oct)
* [[Intuitive self-models] 8. Rooting Out Free Will Intuitions](https://www.lesswrong.com/posts/JLZnSnJptzmPtSRTc/intuitive-self-models-8-rooting-out-free-will-intuitions) (Nov)
***Everything** **else related to Safe & Beneficial AGI:***
* [Deceptive AI ≠ Deceptively-aligned AI](https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai) (Jan)
* [Four visions of Transformative AI success](https://www.lesswrong.com/posts/3aicJ8w4N9YDKBJbi/four-visions-of-transformative-ai-success) (Jan)
* [“Artificial General Intelligence”: an extremely brief FAQ](https://www.lesswrong.com/posts/uxzDLD4WsiyrBjnPw/artificial-general-intelligence-an-extremely-brief-faq) (March)
* [Response to nostalgebraist: proudly waving my moral-antirealist battle flag](https://www.lesswrong.com/posts/8YhjpgQ2eLfnzQ7ec/response-to-nostalgebraist-proudly-waving-my-moral) (May)
* [Response to Dileep George: AGI safety warrants planning ahead](https://www.lesswrong.com/posts/LJD4C7KAr64onL8fq/response-to-dileep-george-agi-safety-warrants-planning-ahead) (July)
* [A shortcoming of concrete demonstrations as AGI risk advocacy](https://www.lesswrong.com/posts/L7t3sKnS7DedfTFFu/a-shortcoming-of-concrete-demonstrations-as-agi-risk) (Dec)
***Random** **non-work-related rants etc. in my free time:***
* [Some (problematic) aesthetics of what constitutes good work in academia](https://www.lesswrong.com/posts/LZJJK6fuuQtTLRSu9/some-problematic-aesthetics-of-what-constitutes-good-work-in) (March)
* [A couple productivity tips for overthinkers](https://www.lesswrong.com/posts/ZN6L5ysKd35FEyGr6/a-couple-productivity-tips-for-overthinkers) (April)
Also in 2024, I went through and revised my 15-post [Intro to Brain-Like-AGI Safety](https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8) series (originally published in 2022). For summary of changes, see [this twitter thread](https://x.com/steve47285/status/1813971002222952852). (Or [here](https://www.lesswrong.com/posts/btHmC88KCZdzimBCM/steve2152-s-shortform?commentId=DD28htp2tWZg7w8eT) without pictures, if you want to avoid twitter.) For more detailed changes, each post of the series has a changelog at the bottom.
4. Acknowledgements
===================
Thanks Jed McCaleb & [Astera Institute](https://astera.org/) for generously supporting my research since August 2022!
Thanks to all the people who comment on my posts before or after publication, or share ideas and feedback with me through [email](mailto:steven.byrnes@gmail.com) or [other channels](https://sjbyrnes.com/), and especially those who patiently stick it out with me through long back-and-forths to hash out disagreements and confusions. Ive learned so much that way!!!
Thanks to my coworker Seth for fruitful ideas and discussions, and to Beth Barnes and the [Centre For Effective Altruism](https://www.centreforeffectivealtruism.org/) [Donor Lottery Program](https://www.givingwhatwecan.org/donor-lottery) for helping me get off the ground with grant funding in 2021-2022. Thanks Lightcone Infrastructure ([dont forget to donate](https://www.lesswrong.com/posts/5n2ZQcbc7r4R8mvqc/the-lightcone-is-nothing-without-its-people)!) for maintaining and continuously improving this site, which has always been an essential part of my workflow. Thanks to everyone else fighting for Safe and Beneficial AGI, and thanks to my family, and thanks to you all for reading! Happy New Year!
1. **[^](#fnref72ioyjkt4p2)**
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post [A Theory of Laughter](https://www.lesswrong.com/posts/7kdBqSFJnvJzYTfx9/a-theory-of-laughter).
2. **[^](#fnrefwdgiptlvzsk)**
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be *a priori* reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But thats not good enough. The symbol grounding problem for social instincts needs *much* more specific information than that. If Jun just told me that Xiu thinks Im cute, then thats a very different situation from if Jun just told me that Fang thinks Im cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
@@ -1,5 +1,5 @@
---
title: Parkinson's Law and the Ideology of Statistics
title: "Parkinson's Law and the Ideology of Statistics"
date: 2025-01-04 22:59:57.376000+00:00
url: https://www.lesswrong.com/posts/4CmYSPc4HfRfWxCLe/parkinson-s-law-and-the-ideology-of-statistics-1
novelty: 0.6774583882987633
+51
View File
@@ -0,0 +1,51 @@
---
title: "Preference Inversion"
date: 2025-01-03 23:49:06.168000+00:00
url: https://www.lesswrong.com/posts/zMkQFuNqMBpBvuYm8/preference-inversion
novelty: 0.6333205293990318
score: 0.48199906945228577
baseScore: 42
voteCount: 25
---
Sometimes the preferences people report or even try to demonstrate are better modeled as a political strategy and response to coercion, than as an honest report of intrinsic preferences. Modeling this correctly is important if you want to try to efficiently satisfy others' intrinsic preferences, or even your own. So I'm sharing something I wrote on the topic elsewhere.
[You asked](https://threadreaderapp.com/thread/1517207623510208513.html) why people who "believe in" avoiding nonmarital sex so frequently engage in and report badly regretting it. Instead of responding within your frame, I'm going to lay out the interpretive framework that seems most natural to me to use for this problem, and then answer in those terms.
We can call things or actions good or bad, right or wrong, with reference to some intention that both the speaker and listener have in mind. For instance, a sturdier and sharper knife is a better one, because our uses for knives tend to converge. We can expect to be understood when we call some knives "good" and leave out "for cutting," and likewise when we call spoiled food bad without reference to a shared interest, because it harms the body of the eater, which harm we generally expect animals to try to avoid.
Moral injunctions such as "it is wrong to lie," "it is bad to steal," can diverge from the local interests of the organism being admonished, in service of a larger, convergent goal. By abstaining from some narrowly self-interested behaviors now, we preserve the necessary conditions for our needs to be met in the future, and the relation between the costs and the benefits can in principle be explained within the system of reference that judges actions as good or bad.
Not all injunctions are like this. For instance, reproduction is such a large component of inclusive fitness that it's not clear what good an organism could get to compensate it for forgoing reproduction. If, like the early Essenes or Christians, we judge sexual desire and activity to be simply bad, we cannot explain this inside the moral system in terms of an animal's rational decision to defer gratification. (This isn't an analytically certain proof, and depends on some contingent facts about apes. If ants or bees talked about something like right and wrong, or good and bad, their relation to those ideas might work very differently from ours.) Instead, we have to explain these statements from an independent system of reference, outside the one that judges reproduction to be bad. There are two things to be explained:
1 How can someone be induced to persistently endorse, promote, and act on perverted moral judgments, i.e. judgments that on net oppose rather than promote their interests as an organism?
2 How are such inducements ecologically fit? Why are they selected for and under what circumstances? Why do we see a lot of them, with lots of discernible traces in the world, rather than a negligible amount?
In some primate groups, a dominant male will punish submissive males for revealing sexual desire for the sexually mature females.[[1]](#fn0usth4iltbl) This is not exclusive to language-using apes, so it cannot be a mere instruction to lie - it has to be a demand to *fake* disinterest, i.e. to distort one's own behavior to emulate it. This is an easy to understand example of an important general fact about humans: we can be threatened into internalized preference falsification, i.e. preference inversion.
There seems to be some sexual heterogeneity here. On priors this makes sense; while women's concealed estrus allows them to consciously decide whether to conceal or reveal sexual interest, men's erections are notoriously difficult to control consciously, so adolescent men rapidly learn to deform their unconscious desires to match what their society says they ought to want. Experimental evidence confirms this; while both women and men will predominantly tend to *report* sexual arousal patterns that conform to social desirability, [men's genital arousal patterns conform to their constructed identities much more than women's do](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2811244/).
Ecologically, preference inversion seems likely to persist if groups using that social technology have an advantage in recruiting their members into conflicts against other groups, and thus in winning those conflicts. This can take the form of warfighting at scale, which requires people to move towards danger with no clear self-interest in doing so. It can also take subtler forms of indirect conflict, of the sort described in [The Debtors' Revolt](http://benjaminrosshoffman.com/the-debtors-revolt/)*, Moral Mazes, The Golden Notebook, The Fountainhead*, etc.
The ecological success of moral perversions depends on their uneven adoption, i.e. on hypocrisy. If everyone felt an uncomplicated preference for moving towards danger, there would not be a next generation. Likewise if everyone were chaste and celibate. Submissive males in a primate group will be hoping for opportunities to supplant the dominant male, or to subvert his control. Clerics and warriors are recruited or retained through enjoying more approval than peasants for the "virtues" of asceticism and danger-seeking, but they survive through the fruits of peasants' "vicious" way of life, and in some cases have to replenish their own population by recruiting from "bad" peasants.
To generalize, if you have been coerced into participating in a perverted moral order, you are stuck with some combination of internalizing an orientation against life, and internalizing an orientation against morality, i.e. being "bad." A priest or warrior might imagine that they are possessed by a god when lying to peasants or murdering enemies, but possessed by some demon when seeking forbidden intimacy or abandoning a fight. In Freudian terms, these correspond to the superego (literally "above-me," the imagined authority to which you attribute agency for your destructive behavior) and the id (literally "it," an imagined subversive subagent with all the desires your moral frame demands that you disown).
One thing that can cause confusion here - by design - is that perverted moralities are stabler if they also enjoin nonperversely good behaviors in most cases. This causes people to attribute the good behavior to the system of threats used to enforce preference inversion, imagining that they would not be naturally inclined to love their neighbor, work diligently for things they want, and rest sometimes. Likewise, perverted moralities also forbid many genuinely bad behaviors, which primes people who must do something harmless but forbidden to accompany it with needlessly harmful forbidden behaviors, because that's what they've been taught to expect of themselves.
Some societies have norms against nonmarital sex that really do seem to function to promote marital intimacy and monogamous household formation - notably, the Amish and non-Modern Orthodox Jews. There also seems to be a less legibly distinct subset of more conventional conservative Christians who report being eager to marry and experience marital intimacy, though I am not sure how they reconcile this if at all with the New Testament. But these are not the people you are asking about.
You are asking about people whose relevant narrative center is not the positive value of marital intimacy, but the badness of sexuality, whether or not they mouth a party line endorsing the former. Many people in these types of conservative Christian cultures - more often women in my experience - report that after marriage, they have difficulty engaging in sexual behaviors, because they've learnt from childhood that sex was bad and dirty, and it's confusing for this behavior to suddenly shift from condemned to endorsed.
At this point the behavior you describe should no longer be perplexing. People who have been coerced into preference inversion cannot honestly report their own preferences or intentions as an organism. Instead, they must choose between some combination of internalized coercion, and complementary demonic possession.
This treatment of the topic is very compact. I was heavily influenced by Jessica Taylor's [On Commitments to Anti-Normativity](https://unstableontology.com/2021/04/12/on-commitments-to-anti-normativity), and Friedrich Nietzsche's *Genealogy of Morals*.
1. **[^](#fnref0usth4iltbl)**
[Tactical Deception and the Great Apes: Insight Into the Question of Theory of Mind, by Casey Kirkpatrick](https://ojs.lib.uwo.ca/index.php/uwoja/article/download/8872/7066):
Other observations of deception recorded by deWaal (1986) involved several instances in which a subordinate male courted a female by displaying his penile erection. Whenever a dominant male unexpectedly appeared, the aroused subordinate would hide his erection from the view of the approaching chimpanzee (deWaal 1986: 233; Whiten 1993: 377; Whiten & Byrne 1988: 215- 216). The chimpanzee dropped his arm, always leaving his hand to dangle between the dominant male and his erection. This was done in order to avoid a violent confrontation, which would have been inevitable had the dominant been aware of the subordinate's actions.
[...]
deWaal, F. 1986. "Deception in the Natural Communication of Chimpanzees". In [Deception: Perspectives on Human and Non-human Deceit](https://www.google.com/books/edition/Deception/eHV_8YC_NL0C?hl=en&gbpv=1&bsq=Deception%20in%20the%20Natural%20Communication%20of%20Chimpanzees). Mitchell,(ed.). pp. 221-224. Albany: University of New York State.
Whiten, A. and Richard Byrne. 1988. The Manipulation of Attention in Primate Tactical Deception. In Machiavellian Intelligence: Social Expertise and the Evolution of Intellect in Monkeys, Apes and Humans. Byrne and Whiten, (eds.). Oxford: Clarendon Press.
Whiten, Andrew. 1993. "Evolving a Theory of Mind: the Nature of Non-Verbal Mentalism in Other Primates". In Understanding Other Minds: perspectives from Autism. Baron-Cohen, Tager-Flusberg and Cohen, (eds.). pp. 367-396. New York: Oxford University Press.
+214
View File
@@ -0,0 +1,214 @@
---
title: "Review: Planecrash"
date: 2025-01-01 01:31:26.864000+00:00
url: https://www.lesswrong.com/posts/zRHGQ9f6deKbxJSji/review-planecrash
novelty: 0.9339498321936388
score: 0.689734160900116
baseScore: 298
voteCount: 154
---
Take a stereotypical fantasy novel, a textbook on mathematical logic, and *Fifty Shades of Grey*. Mix them all together and add extra weirdness for spice. The result might look a lot like [Planecrash](https://www.lesswrong.com/posts/SA9hDewwsYgnuscae/projectlawful-com-eliezer-s-latest-story-past-1m-words) (AKA: Project Lawful), a work of fiction co-written by "Iarwain" (a pen-name of Eliezer Yudkowsky) and "lintamande".
![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/zRHGQ9f6deKbxJSji/iavnbqjh83hwylcy2ovn)
*(image credit: Planecrash)*
Yudkowsky is not afraid to be verbose and self-indulgent in his writing. He previously wrote a [Harry Potter fanfic](https://hpmor.com/) that includes what's essentially an extended Ender's Game fanfic in the middle of it, because why not. In Planecrash, it starts with the very format: it's written as a series of [forum posts](https://www.projectlawful.com/board_sections/703) (though there are [ways to get an ebook](https://github.com/rocurley/glowfic-dl)). It continues with maths lectures embedded into the main arc, totally plot-irrelevant tangents that are just Yudkowsky [ranting about frequentist statistics](https://www.glowfic.com/posts/5826), and one instance of Yudkowsky hijacking the plot for a few pages to [soapbox about his pet Twitter feuds](https://www.glowfic.com/posts/6132?page=49) (with transparent in-world analogues for Effective Altruism, TPOT, and the post-rationalists). Planecrash does not aspire to be high literature. Yudkowsky is self-aware of this, and uses it to troll big-name machine learning researchers:
![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/zRHGQ9f6deKbxJSji/em3xdcz6zygfhdgpvnup)![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/zRHGQ9f6deKbxJSji/infcql3xpnexi7jco32q)
*(*[*source*](https://x.com/ESYudkowsky/status/1654171832994365475)*)*
why would anyone ever read Planecrash? I read (admittedly—sometimes skimmed) it, and I see two reasons:
1. The characters are competent in a way that characters in fiction rarely are. Yudkowsky is good at writing [intelligent characters](https://yudkowsky.tumblr.com/writing) in a specific way that I haven't seen anyone else do as well. Lintamande writes a uniquely compelling story of determination and growth in an extremely competent character.
2. More than anyone else I've yet read, Yudkowsky has his own totalising and self-consistent worldview/philosophy, and Planecrash makes it pop more than anything else he's written.
**The setup**
-------------
[Dath ilan](https://www.projectlawful.com/replies/1773731#reply-1773731) is an alternative quasi-utopian Earth, based (it's at least strongly hinted) on the premise of: what if the average person was Eliezer Yudkowsky? Dath ilan has all the normal quasi-utopian things like world government and land-value taxes and the widespread use of Bayesian statistics in science. Dath ilan also has some less-normal things, like annual Oops It's Time To Overthrow the Government festivals, an order of super-rationalists, and extremely high financial rewards for designing educational curricula that bring down the age at which the average child learns the maths behind the game theory of cooperation.
Keltham is an above-average-selfishness, slightly-above-average-intelligence young man from dath ilan. He dies in the titular plane crash, and wakes up in Cheliax.
Cheliax is a country in a medieval fantasy world in another *plane* of existence to dath ilan's (get it?). (This fantasy world is copied from a role-playing game setting—a fact I discovered when Planecrash literally linked to a Wiki article to explain part of the in-universe setting.) Like every other country in this world, Cheliax is medieval and poor. Unlike the other countries, Cheliax has the additional problem of being ruled by the forces of Hell.
Keltham meets Carissa, a Chelish military wizard who alerts the Chelish government about Keltham. Keltham is kept unaware about the Hellish nature of Cheliax, so he's eager to use his knowledge to start the scientific and industrial revolutions in Cheliax to solve the medieval poverty thing—starting with delivering lectures on first-order logic (why, what else would you first do in a medieval fantasy world?). An elaborate game begins where Carissa and a select group of Chelish agents try to extract maximum science from an unwitting Keltham before he realises what Cheliax really is—and hope that by that time, they'll have tempted him to change his morals towards a darker, more Cheliax-compatible direction.
**The characters**
------------------
Keltham oscillates somewhere between annoying and endearing.
The annoyingness comes from his gift for interrupting any moment with polysyllabic word vomit. Thankfully, this is not random pretentious techno-babble but a coherent depiction of a verbose character who thinks in terms of a non-standard set of concepts. Keltham's thoughts often include an exclamation along the lines of "what, how is {'coordination failure' / 'probability distribution' / 'decision-theoretic-counterfactual-threat-scenario'} so many syllables in this language, how do these people ever talk?"—not an unreasonable question. However, the sheer volume of Keltham's verbosity is still something, especially when it gets in the way of everything else.
The endearingness comes from his manic rationalist problem-solver energy, which gets applied to everything from figuring out chemical processes for magic ingredients to estimating the odds that he's involved in a conspiracy to managing the complicated social scene Cheliax places him in. It's somewhat like *The Martian*, a novel (and movie) about an astronaut stranded on Mars solving a long series of [engineering challenges](https://xkcd.com/1536/), but the problem-solving is much more abstract and game-theoretic and interpersonal, than concrete and physical and man-versus-world.
By far the best and most interesting character in Planecrash is Carissa Sevar, one of the several characters whose point-of-view is written by lintamande rather than Yudkowsky. She's so driven that she accidentally becomes a cleric of the god of self-improvement. She grapples realistically with the large platter of problems she's handed, experiences triumph and failure, and keeps choosing pain over stasis. All this leads to perhaps the greatest arc of grit and unfolding ambition that I've read in fiction.
**The competence**
------------------
I have a memory of once reading some rationalist blogger describing the worldview of some politician as: there's no such thing as competence, only loyalty. If a problem doesn't get solved, it's definitely not because the problem was tricky and there was insufficient intelligence applied to it or a missing understanding of its nature or someone was genuinely incompetent. It's always because whoever was working on it wasn't loyal enough to you. (I thought this was Scott Alexander on Trump, but the closest from him seems to be [this](https://slatestarcodex.com/2016/03/19/book-review-the-art-of-the-deal/), which makes a very different point.)
Whether or not I hallucinated this, the worldview of Planecrash is the opposite.
Consider Queen Abrogail Thrune II, the despotic and unhinged ruler of Cheliax who has a flair for torture. You might imagine that her main struggles are paranoia over the loyalty of her minions, and finding time to take glee in ruling over her subjects. And there's some of those. But more than that, she spends a lot of time being annoyed by how incompetent everyone around her is.
Or consider Aspexia Rugatonn, Cheliax's religious leader and therefore in charge of making the country worship Hell. She's basically a kindly grandmother figure, except not. You might expect her thoughts to be filled with deep emotional conviction about Hell, or disappointment in the "moral" failures of those who don't share her values (i.e. every non-sociopath who isn't brainwashed hard enough). But instead, she spends a lot of her time annoyed that other people don't understand how to act most usefully within the bounds of the god of Hell's instructions. The one time she gets emotional is when a Chelish person finally manages to explain the concept of [corrigibility](https://www.lesswrong.com/tag/corrigibility) to her as well as Aspexia herself could. (The gods and humans in the Planecrash universe are in a weird inverse version of the AI alignment problem. The gods are superintelligent, but have restricted communication bandwidth and clarity with humans. Therefore humans often have to decide how to interpret tiny snippets of god-orders through changing circumstances. So instead of having to steer the superintelligence given limited means, the core question is how to let yourself be steered by a superintelligence that has very limited communication bandwidth with you.)
Fiction is usually filled with characters who advance the plot in helpful ways with their emotional fumbles: consider the stereotypical horror movie protagonist getting mad and running into a dark forest alone, or a character whose pride is insulted doing a dumb thing on impulse. Planecrash has almost none of that. The characters are all good at their jobs. They are surrounded by other competent actors with different goals thinking hard about how to counter their moves, and they always think hard in response, and the smarter side tends to win. Sometimes you get the feeling you're just reading the meeting notes of a competent team struggling with a hard problem. Evil is not dumb or insane, but just "unaligned" by virtue of pursuing a different goal than you—and does so very competently. For example: the core values of the forces of Hell are literally tyranny, slavery, and pain. They have a strict hierarchy and take deliberate steps to encourage arbitrary despotism out of religious conviction. And yet: their hierarchy is still mostly an actual competence hierarchy, because the decision-makers are all very self-aware that they can only be despotic to the extent that it still promotes competence on net. Because they're competent.
Planecrash, at its heart, is [competence porn](https://tvtropes.org/pmwiki/pmwiki.php/Main/CompetencePorn). Keltham's home world of dath ilan is defined by its absence of coordination failures. Neither there nor in Cheliax's world are there really any lumbering bureaucracies that do insane things for inscrutable bureaucratic reasons; all the organisations depicted are all remarkably sane. Important positions are almost always filled by the smart, skilled, and hardworking. Decisions aren't made because of emotional outbursts. Instead, lots of agents go around optimising for their goals by thinking hard about them. For a certain type of person, this is a very relaxing world to read about, despite all the hellfire
**The philosophy**
------------------
"[Rationality is systematized winning](https://www.lesswrong.com/posts/4ARtkT3EYox3THYjF/rationality-is-systematized-winning)", writes Yudkowsky in [The Sequences](https://www.lesswrong.com/rationality). All the rest is commentary.
The core move in Yudkowsky's philosophy is:
* We want to find the general solution to some problem.
+ for example: fairness—how should we split gains from a project where many people participated
* Now here are some common-sense properties that this thing should follow
+ for example:
- (1) no gains should be left undivided
- (2) if two people both contribute identically to every circumstance (formalised as a set of participating people), they should receive an equal share of the gains
- (3) the rule should give the same answer if you combine the division of gains from project A and then project B, as when you use it to calculate the division of gains from project A+B
- (4) if one person doesn't add value in any circumstance, their share of the gains is zero
* Here is The Solution. Note that it's mathematically provable that if you don't follow The Solution, there exists a situation where you will do something obviously dumb.
+ For example: [Shapely value](https://en.wikipedia.org/wiki/Shapley_value) is the unique solution that satisfies the axioms above. (The Planecrash walkthrough of Shapely value is roughly [here](https://www.glowfic.com/replies/1730444#reply-1730444); see also [here](https://www.glowfic.com/replies/1729094#reply-1729094) for more Planecrash about trade and fairness.)
* Therefore, The Solution is uniquely spotlighted by the combination of common-sense goals and maths as the final solution to this problem, and if you disagree, please read this 10,000 word dialogue.
The centrality of this move is something I did not get from The Sequences, but which is very apparent in Planecrash. A lot of the maths in Planecrash isn't new Yudkowsky material. But Planecrash is the only thing that has given me a *map* through the core objects of Yudkowsky's philosophy, and spelled out the high-level structure so clearly. It's also, as far as I know, the most detailed description of Yudkowsky's quasi-utopian world of dath ilan.
### **Validity, Probability, Utility**
Keltham's lectures to the Chelish—yes, there are actually literal maths lectures within Planecrash—walk through three key examples, at a spotty level of completeness but at a high quality of whatever is covered:
1. Validity, i.e. logic. In particular, Yudkowksy highlights what I think is some combination of [Lindstrom's theorem](https://en.wikipedia.org/wiki/Lindstr%C3%B6m%27s_theorem) and [Godel's completeness theorem](https://en.wikipedia.org/wiki/G%C3%B6del%27s_completeness_theorem), that together imply first-order logic is the unique logic that is both complete (i.e. everything true within it can be proven) and has some other nice properties. However, first-order logic is also not strong enough to capture some things we care about (such as the natural numbers), so this is the least-strong example of the above pattern. Yudkowsky has written out his thoughts on logic in the [mathematics and logic section here](https://www.lesswrong.com/s/SqFbMbtxGybdS2gRs), if you want to read his takes in a non-fiction setting.
2. Probability. So-called [Dutch book theorems](https://en.wikipedia.org/wiki/Dutch_book_theorems) show that if an agent does not update their beliefs in a Bayesian way, there exists a set of losing bets that they would take despite it leading to a guaranteed loss. So your credences in beliefs should be represented as probabilities, and you should update those probabilities with Bayes' theorem. ([Here](https://www.glowfic.com/replies/1750558#reply-1750558) is a list of English statements that, dath ilani civilisation thinks, anyone competent in Probability should be able to translate into correct maths.)
3. Utility. The behaviour of any agent that is "rational" in a certain technical sense should be describable as it having a "utility function", i.e. every outcome can be assigned a number, such that the agent predictably chooses outcomes with higher numbers over those with lower ones. This is because if an agent violates this constraint, there must exist situations where it would do something obviously dumb. As a shocked Keltham puts it: "I, I mean, there's being chaotic, and then there's being so chaotic that it violates *coherence theorems*".
In Yudkowsky's own words, not in Planecrash but in an [essay he wrote](https://www.lesswrong.com/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) (with much valuable discussion in the comments):
> *We have multiple spotlights all shining on the same core mathematical structure, saying dozens of different variants on, "If you aren't running around in circles or stepping on your own feet or wantonly giving up things you say you want, we can see your behavior as corresponding to this shape. Conversely, if we can't see your behavior as corresponding to this shape, you must be visibly shooting yourself in the foot." Expected utility is the only structure that has this great big family of discovered theorems all saying that. It has a scattering of academic competitors, because academia is academia, but the competitors don't have anything like that mass of spotlights all pointing in the same direction.*
>
> *So if we need to pick an interim answer for "What kind of quantitative framework should I try to put around my own decision-making, when I'm trying to check if my thoughts make sense?" or "By default and barring special cases, what properties might a sufficiently advanced machine intelligence look to us like it possessed, at least approximately, if we couldn't see it visibly running around in circles?", then there's pretty much one obvious candidate: Probabilities, utility functions, and expected utility.*
### **Coordination**
Next, coordination. There is no single theorem or total solution for the problem of coordination. But the Yudkowskian frame has near-infinite scorn for failures of coordination. Imagine not realising all possible gains just because you're stuck in some equilibrium of agents defecting against each other. Is that winning? No, it's not. Therefore, it must be out.
Dath ilan has a mantra that goes, roughly: if you do that, you will end up there, so if you want to end up somewhere that is not there, you will have to do Something Else Which Is Not That. And the basic premise of dath ilan is that society actually has the ability to collectively say "we are currently going there, and we don't want to, and while none of us can individually change the outcome, we will all coordinate to take the required collective action and not defect against each other in the process even if we'd gain from doing so". Keltham claims that in dath ilan, if there somehow developed an oppressive tyranny, everyone would wait for some [Schelling time](https://en.wikipedia.org/wiki/Focal_point_(game_theory)#:~:text=In%20game%20theory%2C%20a%20focal,Strategy%20of%20Conflict%20(1960).) (like a solar eclipse or the end of the calendar year or whatever) and then simultaneously rise up in rebellion. It probably helps that dath ilan has annual "oops it's time to overthrow the government" exercises. It also helps that everyone in dath ilan knows that everyone knows that everyone knows that everyone knows (...) all the [standard rationalist takes on coordination and common knowledge](https://www.lesswrong.com/posts/9QxnfMYccz9QRgZ5z/the-costly-coordination-mechanism-of-common-knowledge).
Keltham summarises the universality of Validity, Probability, Utility, and Coordination (note the capitals):
> *"I am a lot more confident that Validity, Probability, and Utility are still singled-out mathematical structures whose fragmented shards and overlapping shadows hold power in Golarion [=the world of Cheliax], than I am confident that I already know why snowflakes here have sixfold symmetry. And I wanted to make that clear before I said too much about the hidden orders of reality out of dath ilan - that even if the things I am saying are entirely wrong about Golarion, that kind of specific knowledge is not the most important knowledge I have to teach. I have gone into this little digression about Validity and timelessness and optimality, in order to give you some specific reason to think that [...] some of the knowledge he has to teach is sufficiently general that you have strong reason for strong hope that it will work [...] [...] "It is said also in dath ilan that there is a final great principle of Law, less beautiful in its mathematics than the first three, but also quite important in practice; it goes by the name Coordination, and deals with agents simultaneously acting in such fashion to all get more of what they wanted than if they acted separately."*
### **Decision theory**
The final fundamental bit of Yudkowsky's philosophy is decision theories more complicated than causal decision theory.
A short primer / intuition pump: a decision theory specifies how you should choose between various options (it's not moral philosophy, because it assumes that we know already know *what* we value). The most straightforward decision theory is causal decision theory, which says: pick the option that causes the best outcome in expectation. Done, right? No; the devil is in the word "causes". Yudkowsky makes much of [Newcomb's problem](https://www.lesswrong.com/posts/6ddcsdA2c2XpNpE5x/newcomb-s-problem-and-regret-of-rationality), but I prefer another example: [Parfit's hitchhiker](https://www.lesswrong.com/tag/parfits-hitchhiker). Imagine you're a selfish person stuck in a desert without your wallet, and want to make it back to your hotel in the city. A car pulls up, with a driver who knows whether you're telling the truth. You ask to be taken back to your hotel. The driver asks if you'll pay $10 to them as a service. Dying in the desert is worse for you than paying $10, so you'd like to take this offer. However, you obey causal decision theory: if the driver takes you to your hotel, you would go to your hotel to get your wallet, but once inside you have the option between (a) take $10 back to the driver and therefore lose money, and (b) stay in your hotel and lose no money. Causal decision theory says to take option (b), because you're a selfish agent who doesn't care about the driver. And the driver knows you'd be lying if you said "yes", so you have to tell the driver "no". The driver drives off, and you die of thirst in the desert. If only you had spent more time arguing about non-causal decision theories on LessWrong.
Dying in a desert rather than spending $10 is not exactly systematised winning. So causal decision theory is out. (You could argue that another moral of Parfit's hitchhiker is that being a purely selfish agent is bad, and humans aren't purely selfish so it's not applicable to the real world anyway, but in Yudkowsky's philosophy—and decision theory academia—you want a *general* solution to the problem of rational choice where you can take *any* utility function and *win* by *its* lights regardless of which convoluted setup philosophers drop you into.) Yudkowsky's main academic / mathematical accomplishment is co-inventing (with Nate Soares) [functional decision theory](https://arxiv.org/abs/1710.05060), which says you should consider your decisions as the output of a fixed function, and then choose the function that leads to the best consequences for you. This solves Parfit's hitchhiker, as well as problems like [the smoking lesion problem](https://www.lesswrong.com/tag/smoking-lesion) that [evidential decision theory](https://en.wikipedia.org/wiki/Evidential_decision_theory), the classic non-causal decision theory, succumbs to. As far as I can judge, functional decision theory is actually a good idea (if somewhat underspecified), but academic engagement (whether critiques and praises) with it has been limited so there's no broad consensus in its favor that I can point at. (If you want to read Yudkowsky's explanation for why he doesn't spend more effort on academia, it's [here](https://www.glowfic.com/replies/1729119#reply-1729119).)
(Now you know what a Planecrash tangent feels like, except you don't, because Planecrash tangents can be *much* longer.)
One big aspect of Yudkowskian decision theory is how to respond to threats. Following causal decision theory means you can neither make credible threats nor commit to deterrence to counter threats. Yudkowsky endorses not responding to threats to avoid incentivising them, while also having deterrence commitments to maintain good equilibria. He also implies this is a consequence of using a sensible functional decision theory. But there's a tension here: your deterrence commitment could be interpreted as a threat by someone else, or visa versa. When the Eisenhower administration's nuclear doctrine threatened massive nuclear retaliation in event of the Soviets taking West Berlin, what's the exact maths that would've let them argue to the Soviets "no no this isn't a threat, this is just a deterrence commitment", while allowing the Soviets keep to Yudkowsky's strict rule to ignore all threats?
My (uninformed) sense is that this maths hasn't been figured out. Planecrash never describes it (though [here](https://www.glowfic.com/replies/1770000#reply-1770000) is some discussion of decision theory in Planecrash). Posts in the LessWrong decision theory canon like [this](https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever) or [this](https://www.lesswrong.com/posts/brXr7PJ2W4Na2EW2q/the-commitment-races-problem) and [this](https://www.lesswrong.com/posts/g8HHKaWENEbqh2mgK/updatelessness-doesn-t-solve-most-problems-1) seem to point to real issues around decision theories encouraging commitment races, and when Yudkowsky pipes up in the comments he's mostly falling back on the conviction that, surely, sufficiently-smart agents will find some way around mutual destruction in a commitment race (systematised winning, remember?). There are also various [critiques of functional decision theory](https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory) (see also [Abram Demski's comment](https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory?commentId=y8zRwcpNeu2ZhM3yE) on that post acknowledging that functional decision theory is underspecified). Perhaps it all makes sense if you've worked through Appendix B7 of Yudkowsky's big decision theory paper (which I haven't actually read, let alone taken time to digest), but (a) why doesn't he reference that appendix then, and (b) I'd complain about that being hard to find, but then again we are talking about the guy who leaves the clearest and most explicit description of his philosophy scattered across an R-rated role-playing-game fanfic posted in innumerable parts on an obscure internet forum, so I fear my complaint would be falling on deaf ears anyway.
### **The political philosophy of dath ilan**
Yudkowsky has put a lot of thought into how the world of dath ilan functions. Overall it's very coherent.
[Here's a part where Keltham explains dath ilan's central management principle](https://www.glowfic.com/replies/1743445#reply-1743445): everything, including every project, every rule within any company, and any legal regulation, needs to have one person responsible for it.
> *Keltham is informed, though he doesn't think he's ever been tempted to make that mistake himself, that overthinky people setting up corporations sometimes ask themselves 'But wait, what if this person here can't be trusted to make decisions all by themselves, what if they make the wrong decision?' and then try to set up more complicated structures than that. This basically never works. If you don't trust a power, make that power legible, make it localizable to a single person, make sure every use of it gets logged and reviewed by somebody whose job it is to review it. If you make power complicated, it stops being legible and visible and recordable and accountable and then you actually are in trouble.*
[Here's a part where Keltham talks about how dath ilan solves the problem of who watches the watchmen](https://www.glowfic.com/replies/1772752#reply-1772752):
> *If you count the rehearsal festivals for it, Civilization spends more on making sure Civilization can collectively outfight the Hypothetical Corrupted Governance Military, than Civilization spends on its actual military.*
[Here's a part where dath ilan's choice of political system is described](https://www.glowfic.com/replies/1773754#reply-1773754), which I will quote at length:
> *Conceptually and to first-order, the ideal that Civilization is approximating is a giant macroagent composed of everybody in the world, taking coordinated macroactions to end up on the multi-agent-optimal frontier, at a point along that frontier reflecting a fair division of the gains from that coordinated macroaction -*
>
> *Well, to be clear, the dath ilani would shut it all down if actual coordination levels started to get anywhere near that. Civilization has spoken - with nearly one voice, in fact - that it does not want to turn into a hivemind.*
>
> *[...]*
>
> *Conceptually and to second-order, then, Civilization thinks it should be divided into a Private Sphere and a Public Shell. Nearly all the decisions are made locally, but subject to a global structure that contains things like "children may not be threatened into unpaid labor"; or "everybody no matter who they are or what they have done retains the absolute right to cryosuspension upon their death"; [...]*
>
> *[...]*
>
> *Directdemocracy has been tried, from time to time, within some city of dath ilan: people making group decisions by all individually voting on them. It can work if you try it with fifty people, even in the most unstructured way. Get the number of direct voters up to ten thousand people, and no amount of helpfully-intended structure in the voting process can save you.*
>
> *[...]*
>
> *Republics have been tried, from time to time, within some city of dath ilan: people making group decisions by voting to elect leaders who make those decisions. It can work if you try it with fifty people, even in the most unstructured way. Get the number of voters up to ten thousand people, and no amount of helpfully-intended structure in the voting process Acan save you.*
>
> *[...]*
>
> *There are a hundred more clever proposals for how to run Civilization's elections. If the current system starts to break, one of those will perhaps be adopted. Until that day comes, though, the structure of Governance is the simplest departure from directdemocracy that has been found to work at all.*
>
> *Every voter of Civilization, everybody at least thirteen years old or who has passed some competence tests before then, primarily exerts their influence through delegating their vote to a Delegate.*
>
> *A Delegate must have at least fifty votes to participate in the next higher layer at all; and can retain no more than two hundred votes before the marginal added influence from each additional vote starts to diminish and grow sublinearly. Most Delegates are not full-time, unless they are representing pretty rich people, but they're expected to be people interested in politics [...]. Your Delegate might be somebody you know personally and trust, if you're the sort to know so many people personally that you know one Delegate. [...]*
>
> *If you think you've got a problem with the way Civilization is heading, you can talk to your Delegate about that, and your Delegate has time to talk back to you.*
>
> *That feature has been found to not actually be dispensable in practice. It needs to be the case that, when you delegate your vote, you know who has your vote, and you can talk to that person, and they can talk back. Otherwise people feel like they have no lever at all to pull on the vast structure that is Governance, that there is nothing visible that changes when a voter casts their one vote. Sure, in principle, there's a decision-cohort whose votes move in logical synchrony with yours, and your cohort is probably quite large unless you're a weird person. But some part of you more basic than that will feel like you're not in control, if the only lever you have is an election that almost never comes down to the votes of yourself and your friends.*
>
> *The rest of the electoral structure follows almost automatically, once you decide that this property has to be preserved at each layer.*
>
> *The next step up from Delegates are Electors, full-time well-paid professionals who each aggregate 4,000 to 25,000 underlying voters from 50 to 200 Delegates. Few voters can talk to their Electors [...] but your Delegate can have some long conversations with them. [...]*
>
> *Representatives aggregate Electors, ultimately 300,000 to 3,000,000 underlying votes apiece. There are roughly a thousand of those in all Civilization, at any given time, with social status equivalent to an excellent CEO of a large company or a scientist who made an outstanding discovery [...]*
>
> *And above all this, the Nine Legislators of Civilization are those nine candidates who receive the most aggregate underlying votes from Representatives. They vote with power proportional to their underlying votes; but when a Legislator starts to have voting power exceeding twice that of the median Legislator, their power begins to grow sublinearly. By this means is too much power prevented from concentrating into a single politician's hands.*
>
> *Surrounding all this of course are numerous features that any political-design specialist of Civilization would consider obvious:*
>
> *Any voter (or Delegate or Elector or Representative) votes for a list of three possible delegees of the next layer up; if your first choice doesn't have enough votes yet to be a valid representor, your vote cascades down to the next person on your list, but remains active and ready to switch up if needed. This lets you vote for new delegees entering the system, without that wasting your vote while there aren't enough votes yet.*
>
> *Anyone can at any time immediately eliminate a person from their 3-list, but it takes a 60-day cooldown to add a new person or reorder the list. The government design isn't meant to make it cheap or common to threaten your delegee with a temporary vote-switch if they don't vote your way on that particular day. The government design isn't meant to make it possible for a new brilliant charismatic leader to take over the entire government the next day with no cooldowns. It is meant to let you rapidly remove your vote from a delegee that has sufficiently ticked you off.*
>
> *Once you have served as a Delegate, or delegee of any other level, you can't afterwards serve in any other branches of Governance. [...]*
>
> *This is meant to prevent a political structure whose upper ranks offer promotion as a reward to the most compliant members of the ranks below, for by this dark-conspiratorial method the delegees could become aligned to the structure above rather than their delegators below.*
>
> *(Most dath ilani would be suspicious of a scheme that tried to promote Electors from Delegates in any case; they wouldn't think there should be a political career ladder [...] Dath ilani are instinctively suspicious of all things meta, and much more suspicious of anything purely meta; they want heavy doses of object-level mixed in. To become an Elector you do something impressive enough, preferably something entirely outside of Governance, that Delegates will be impressed by you. You definitely don't become an Elector by being among the most ambitious and power-seeking people who wanted to climb high and knew they had to start out a lowly Delegate, who then won a competition to serve the system above them diligently enough to be selected for a list of Electors fed to a political party's captive Delegates. If a dath ilani saw a system like this, that was supposedly a democracy set in place by the will of its people, they would ask what the captive 'voters' even thought they were supposedly trying to do under the official story.)*
Dath ilani Legislators have a programmer's or engineer's appreciation for simplicity:
> *[...] each [regulation] must be read aloud by a Legislator who thereby accepts responsibility for that regulation; and when that Legislator retires a new Legislator must be found to read aloud and accept responsibility for that regulation, or it will be stricken from the books. Every regulation in Civilization, if something goes wrong with it, is the fault of one particular Legislator who accepted responsibility for it. To speak it aloud, it is nowadays thought, symbolizes the acceptance of this responsibility.*
>
> *Modern dath ilani aren't really the types in the first place to produce literally-unspeakable enormous volumes of legislation that no hapless citizen or professional politician could ever read within their one lifetime let alone understand. Even dath ilani who aren't professional programmers have written enough code to know that each line of code to maintain is an ongoing cost. Even dath ilani who aren't professional economists know that regulatory burdens on economies increase quadratically in the cost imposed on each transaction. They would regard it as contrary to the notion of a lawful polity with law-abiding citizens that the citizens cannot possibly know what all the laws are, let alone obey them. Dath ilani don't go in for fake laws in the same way as Golarion polities with lots of them; they take laws much too seriously to put laws on the books just for show.*
Finally, the Keepers are an order of people trained in all the most hardcore arts of rationality, and who thus end up with inhuman integrity and even-handedness of judgement. They are used in many ways, for example:
> *There are also Keeper cutouts at key points along the whole structure of Governance - the Executive of the Military reports not only to the Chief Executive but also to an oathsworn Keeper who can prevent the Executive of the Military from being fired, demoted, or reduced in salary, just because the Chief Executive or even the Legislature says so. It would be a big deal, obviously, for a Keeper to fire this override; but among the things you buy when you hire a Keeper is that the Keeper will do what they said they'd do and not give five flying fucks about what sort of 'big deal' results. If the Legislators and the Chief Executive get together and decide to order the Military to crush all resistance, the Keeper cutout is there to ensure that the Executive of the Military doesn't get a pay cut immediately after they tell the Legislature and Chief Executive to screw off.*
Also, to be clear, absolutely none of this is plot-relevant.
![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/zRHGQ9f6deKbxJSji/nzjf59dbqlcf9fksjrv1)
*Above: the icon of dath ilan in Planecrash. When Yudkowsky really wants to monologue, he stops even pretending to do it through a character, and instead we get this talking globe. Hello, globe. Nice political philosophy you got there.*
**A system of the world**
-------------------------
Yudkowsky proves that ideas matter: if you have ideas that form a powerful and coherent novel worldview, it doesn't matter if your main method for publicising them is ridiculously-long fanfiction, or if you dropped out of high school, or if you wear fedoras. People will still listen, and you might become (so far) the 21st century's most important philosopher.
Why is Yudkowsky so compelling? There are intellectuals like [Scott Alexander](https://slatestarcodex.com/about/) who are most-strongly identified by a particular *method* (an even-handed, epistemically-rigorous, steelmaning-focused treatment of a topic), or intellectuals like [Robin Hanson](https://overcoming-bias-anthology.com/) who are most-strongly identified by a particular *style* (eclectic irreverence about incentive mechanisms). But Yudkowsky's hallmark is delivering an entire system of the world that covers everything from logic to what correct epistemology looks like to the maths behind rational decision-making and coordination, and comes complete with identifying the biggest threat (misaligned AI) and the structure of utopia (dath ilan). None of the major technical inventions (except some in decision theory) are original to Yudkowsky. But he's picked up the pieces, slotted them into a big coherent structure, and presented it in great depth. And Yudkowsky's system claims to come with *proofs* for many key bits, in the literal mathematical sense. No, you can't crack open a textbook and see everything laid out, step-by-step. But the implicit claim is: read this long essay on coherence theorems, these papers on decision theory, this 20,000-word dialogue, these sequences on LessWrong, and ideally a few fanfics too, and then you'll get it.
![](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/zRHGQ9f6deKbxJSji/wnjgl8plnilu0qi5sumx)
*After reading Yudkowsky, you're perfectly inoculated against any philosophy so lazy that it doesn't even come with mathematical proofs. (*[*source*](https://x.com/robinhanson/status/810281385852276736)*)*
Does he deliver? To an impressive extent, yes. There's a lot of maths that *is* laid out step-by-step and *does* check out. There are many takes that are correct, and big structures that point in the right direction, and what seems wrong at least has depth and is usefully provocative. But dig deep enough, and there are cracks: [arguments](https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems) about how much [coherence theorems](https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior) really imply, [critiques](https://www.lesswrong.com/posts/ySLYSsNeFL5CoAQzN/a-critique-of-functional-decision-theory) of the decision theory, and [good counterarguments](https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer) to the most extreme versions of Yudkowsky's AI risk thesis. You can chase any of these cracks up towers of LessWrong posts, or debate them endlessly at those parties where people stand in neat circles and exchange thought experiments about [acausal trade](https://www.lesswrong.com/tag/acausal-trade). If you have no interaction with rationalist/LessWrong circles, I think you'd be surprised at the fraction of our generation's top mathematical-systematising brainpower that is spent on this—or that is bobbing in the waves left behind, sometimes unknowingly.
As for myself: Yudkowsky's philosophy is one of the most impressive intellectual edifices I've seen. Big chunks of it—in particular the stuff about empiricism, naturalism, and the art of genuinely trying to figure out what's true that [The Sequences](https://www.lesswrong.com/rationality) especially focus on—were very formative in my own thinking. I think it's often proven itself directionally correct. But Yudkowsky's philosophy makes a claim for near-mathematical correctness, and I think there's a bit of trouble there. While it has impressive mathematical depth and gets many things importantly right (e.g. Bayesianism), despite much effort spent digesting it, I don't see it meeting the rigour bar it would need for its predictions (for example about AI risk) to be more like those of a tested scientific theory than those of a framing, worldview, or philosophy. However, I'm also very unsympathetic to a certain straitlaced science-cargo-culting attitude that recoils from Yudkowsky's uncouthness and is uninterested in speculation or theory—they would do well to study the [actual history](https://www.lesswrong.com/posts/JAAHjm4iZ2j5Exfo2/the-copernican-revolution-from-the-inside) of [science](https://www.benlandautaylor.com/p/looking-beyond-the-veil). I also see in Yudkowsky's philosophy choices of framing and focus that seem neither forced by reason nor entirely natural in my own worldview. I expect that lots more great work will come out within the Yudkowskian frame, whether critiques or patches, and this work could show it to be anywhere from impressive but massively misguided to almost prophetically prescient. However, I expect even greater things if someone figures out a new, even grander and more applicable system of the world. Perhaps that person can then describe it in a weird fanfic.
@@ -0,0 +1,113 @@
---
title: "The Field of AI Alignment: A Postmortem, and What To Do About It"
date: 2025-01-01 00:42:36.538000+00:00
url: https://www.lesswrong.com/posts/nwpyhyagpPYDn4dAW/the-field-of-ai-alignment-a-postmortem-and-what-to-do-about
novelty: 0.9254826649561891
score: 0.5714353919029236
baseScore: 282
voteCount: 174
---
> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".
Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.
At this point, that hope is basically dead. There *has* been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanitys survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.
This post is on my best models of how we got here, and what to do next.
What This Post Is And Isn't, And An Apology
===========================================
This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].
The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:
* [Eliezer's List O' Doom](https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities)
* My own [Why Not Just...](https://www.lesswrong.com/s/TLSzP4xP42PPBctgw) sequence
* Nate's [How Various Plans Miss The Hard Bits Of The Alignment Challenge](https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment)
(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.
Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking *about you* and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.
But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.
Why The Streetlighting?
=======================
A Selection Model
-----------------
First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things *are* the right things to focus on!)
What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the [eliciting latent knowledge](https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge) (ELK) doc, and spends a few months working on it. Bob is excited about [debate](https://www.lesswrong.com/tag/debate-ai-safety-technique-1), and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.
... of course (I would say) Bob has not made any progress *toward solving any probable bottleneck problem of AI alignment*, but he has tangible outputs and is making progress on *something*, so he'll probably keep going.
And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on *something* is easy to measure, and the selection pressure rewards that easy metric.
Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.
### Selection and the Labs
Here's a special case of the selection model which I think is worth highlighting.
Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)
A "Flinching Away" Model
------------------------
While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at *individual* newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?
Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which [breaks the iterative problem-solving loop](https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails), breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.
... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively [flinches away](https://www.lesswrong.com/posts/YpXsZ4H67MaQsoyzz/historical-examples-of-flinching-away) from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.
It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:
* Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels [apparently reproduced this phenomenon](https://www.lesswrong.com/posts/8ZR3xsWb6TdvmL8kx/optimistic-assumptions-longterm-planning-and-cope) very reliably.)
* Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
* Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
* (Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.
... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.
Which brings us to the "what to do about it" part of the post.
What To Do About It
===================
Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?
First key thing to notice: in our opening example with Alice and Bob, Alice *correctly* realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.
Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.
So one obvious thing to focus on is getting traction on the problems.
... and in my experience, there *are* people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at [the ILIAD conference](https://www.lesswrong.com/posts/r7nBaKy5Ry3JWhnJT/announcing-iliad-theoretical-ai-alignment-conference); these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILIAD was a theory conference, I do *not* mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.
Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts?
How We Got Here
---------------
The main problem, according to me, is the EA recruiting pipeline.
On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. Were talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.
... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.
Who To Recruit Instead
----------------------
We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly [a list of stuff I studied has been called "uncannily similar to a recommendation to readers to roll up their own doctorate program"](https://www.lesswrong.com/posts/bjjbp5i5G8bekJuxv/study-guide?commentId=euo2anBF6PXhKYrju). Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
As an alternative to recruiting people who have the skills already, one could instead try to train people. I've [tried that](https://www.lesswrong.com/posts/nvP28s5oydv8RjF9E/mats-models) to some extent, and at this point I think there just isn't a substitute for [years of technical study](https://www.lesswrong.com/posts/bjjbp5i5G8bekJuxv/study-guide). People need that background knowledge in order to see footholds on the core hard problems.
Integration vs Separation
-------------------------
Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.
This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.
+1 -1
View File
@@ -1,5 +1,5 @@
---
title: The Intelligence Curse
title: "The Intelligence Curse"
date: 2025-01-04 18:16:58.921000+00:00
url: https://www.lesswrong.com/posts/Mak2kZuTq8Hpnqyzb/the-intelligence-curse
novelty: 0.6880437127968582
+1 -1
View File
@@ -1,5 +1,5 @@
---
title: The Laws of Large Numbers
title: "The Laws of Large Numbers"
date: 2025-01-04 18:06:02.387000+00:00
url: https://www.lesswrong.com/posts/EhTMM77iKBTBxBKRe/the-laws-of-large-numbers
novelty: 0.5409323783864814
@@ -0,0 +1,129 @@
---
title: "The subset parity learning problem: much more than you wanted to know"
date: 2025-01-04 09:59:16.158000+00:00
url: https://www.lesswrong.com/posts/Mcrfi3DBJBzfoLctA/the-subset-parity-learning-problem-much-more-than-you-wanted
novelty: 0.6979456336147705
score: 0.9548193216323853
baseScore: 64
voteCount: 27
---
Imagine that youre looking for buried treasure on a large desert island, worth a billion dollars. You dont have a map, but a mysterious hermit offers you a box with a button to help find the treasure. Each time you press the button, it will tell you either “warmer” or “colder”. But theres a catch. With probability 2100, the box will tell you the truth about whether youre closer than you were last time you pressed. But with the remaining probability of .9999999999999999999999999999992, the box will make a random guess between “warmer” and “colder”. Should you pay $1 for this box?
Keep this in mind as we discuss the closely related problem of *parity learning*.
In my experience of interacting with the ML and interpretability communities, the majority of people dont know about the impossibility result of the *parity learning problem*. The ones who do will often assume that this is a baroque, complicated result that surely doesnt have a simple proof (another surprising opinion Ive heard is of people knowing about the result, but saying that “theres some new architecture that seems to solve it, actually”, which is somewhat indicative of peoples trust in the concept of “proof” in the ML community).
Recently I was pleasantly surprised to realize that the impossibility of solving this problem (via a gradient-based learning algorithm in polynomial time) actually admits a pretty nice and understandable proof. The gist of it boils down to the silly “pirate treasure” story above: the answer, of course, is that you shouldnt buy the box (at least if youre trying to maximize your expected income), and for the same reason you cant build a cool new architecture that solves the parity learning problem.
In this post Ill briefly explain the problem, how its different from some other impossibility results, and why I think its important. This will later tie in to a series of posts about the insufficiency of Bayesian methods in (realistic) ML contexts.
What is the parity learning problem?
====================================
This is a post where a bit of math can go a long way, but Ill try to make it approachable to anyone with either a bit of math or a bit of ML background. First, the XOR parity *function* is a function πS:{0,1}n→{0,1} from a length-n boolean input to a single boolean output. This function depends on a “secret” variable S, which is a subset: S⊂[n] (here [n]={0,…,n1} is the standard set on n elements). On an input vector →v∈{0,1}n, the function πS outputs ∑s∈Svsmod2∈{0,1}. In other words, you look at the bits of v on indices indexed by S, take their sum, and take its parity. Note (for people with some math background) that the value πS(v) can be written more nicely as vS⋅v, where we replace the set S by the vector vS with coefficients vi=1 in position at indices i∈S and vi=0 otherwise. The vectors v,vS can then be interpreted as having coefficients in the field Z/2 of two elements, and v⋅vS denotes their dot product in this field (i.e., the sum of products of coordinates).
Now the parity problem says that its not possible to solve the parity problem as a learning problem in polynomial time. This statement should be interpreted carefully. First, note that we havent defined what a learning problem is. A special case of a learning problem is any (polynomial-complexity) weight-based ML architecture that learns via SGD on cross-entropy loss (together with any choice of batch size, update step, initialization protocol, etc.). We will take this as our definition of “learnability” for the sake of this post, though later Ill point out that our proof also shows that a much larger class of methods is incapable of solving parity. (On the other hand, as well see, an undergraduate with a few hours to spare *can* solve the parity problem in polynomial time, and with a bit more time can even hand-select weights in an ML architecture to execute their solution.) The second thing to be careful of is that, for any *fixed* choice of “hidden” subset S, it is possible to design an algorithm that learns the parity problem. Indeed, you can simply initialize the architecture to the “right solution”. So its important to conceptualize S here as secret or random.
More concretely, the problem can be conceptualized as a game between two players A and B. Player A *randomly* chooses a secret subset S⊂[n], where [n]={0,…,n1} is the standard set on n elements (there are 2n subsets, including all of [n] and the empty set, so each is chosen with probability 1/2n). Player B commits to a (polynomial-sized) *learning algorithm* M, which for us means an architecture, initialization scheme, and class of hyperparameters for a gradient-based learning scheme like SGD. Player A then randomly generates a number of sample boolean vectors v1,…,vN∈{0,1}n, with N some agreed-upon constant that depends at most polynomially in n, and player B trains the learning algorithm for N steps, where N is another (large) number that is nevertheless polynomially bounded in n.
The theorem then says that no matter what learning algorithm that player B chose, the probability that the setup will learn an algorithm with >51% accuracy is effectively zero (i.e., its exponentially small in n).
This “hidden guessing” game is about P vs. NP, isnt it?
========================================================
This is another common confusion, and the answer is no. Most theoretical computer scientists [believe that](https://scottaaronson.blog/?p=122) P≠NP. And assuming this[[1]](#fn43nx54h76ig) gives another source of unsolvable learning problems. Indeed, if you were to give player A in the above game more freedom, and allowed them to write down any (suitably) randomly chosen circuit (or suitable random ML algorithm) for their “target” function, you get another impossibility result. Here the assumption P≠BPP implies that there is no way, in polynomial time, to get reliable information about As secret circuit C (beyond some statistical regularities) from looking at polynomially many input-output samples of C. This in particular implies that there is no way to guarantee sufficiently accurate behavior of the result of a learning algorithm, since a learning algorithm is a special case of a (probabilistic) polynomial algorithm. But the XOR impossibility is in fact a much more satisfying result. It doesnt require any assumptions about P vs. NP (and is true mathematically and unconditionally), and even more nicely, there actually *does* exist a (probabilistically) polynomial-time algorithm to solve it. In other words, we have the following containments (where note that Im being sloppy about exactly what an “algorithm” is):
BPP-invertible⊂NP-invertible
Learning algorithms⊂BPP-invertible.
And in the case of the “XOR parity” problem, it shows that (without making any assumptions on the first containment, i.e., about P vs. NP), the second containment is *proper*: i.e., there are polynomially invertible algorithms which are impossible to execute as learning algorithms.
To convince you that this isnt some deep hidden knowledge, Ill explain in the following section how an undergrad with a semester of abstract algebra can solve the XOR parity problem. Since knowing the specific solution isnt critical, Ill assume a bit of abstract algebra here, and people without the necessary context can safely skip the following section. To be clear about its result, before going on Ill write the upshot:
*Upshot of the following section*:
In the setup as above, with player A having a hidden subset S and player B receiving N samples of the input-output behavior of the parity function πS(→v) for boolean inputs →v, it is possible for B to recover the subset S (and thus the function πS) in polynomial time N, with overwhelming probability.
In fact, its sufficient to look at N = 2n samples (and the length of the solution algorithm is cubic or better in n).
Polynomial-time non “learning-algorithmic” solution
---------------------------------------------------
For this section Im assuming some linear algebra over finite fields; if this isnt your jam, skip to the next session.
The basic idea is to replace the function πS(v) with the dot product vS⋅v. The N random samples can then be understood as an overdetermined linear equation over the field Z/2 with two elements. Namely, given our N random samples v1,…,vN, we write down an N×n matrix M whose length-n rows are vT1,…,vTN. Because the samples are chosen randomly, these are random boolean vectors. It is now a standard theorem that given N>n vectors, the probability that they are linearly independent goes asymptotically as 12nN. As soon as N>2n, the probability that theyre linearly dependent is <1/2n, i.e., is negligible (in fact, in the formal sense of going to zero faster than any inverse polynomial).
Thus we can safely assume that the N sample input vectors are linearly independent. Now in the assumption of the problem, A gave us both the vectors vi and also the values vi⋅vS=bi∈{0,1}. We can convert this to a system of linear equations (in Z/2) on the secret vector vS. Namely, we have
M⋅vS=b, for b=(b1,b2,…,bN) the vector of parities. Now linear independence of the vi implies that vS is reconstructible in polynomial time. For example if n=N and already the first n sample inputs are linearly independent, then we can write vS=M1b, and inverting a (boolean) matrix is doable in cubic time.
Before going on, note that once weve correctly guessed the secret subset S⊂[n], we can write down a straightforward MLP that executes the parity XOR function πS. Indeed, assume S has k elements (with k≤n since S is a subset of an n-element set) and let i1,…,ik∈[n] be the elements of S, in order. Then we can recursively write πS(v) as
πS(v)=XOR(vi1,XOR(vi2,XOR(vi3,…XOR(vik1,vik)…))).
Now the XOR of two boolean elements is straightforward to write down as a single-layer MLP (whether using ReLU or any other activation function), and appropriately stacking k<n of them together gives a polynomial-size neural net that executes our hidden function (in fact, utilizing some parallelization allows this to be done in O(log(k)) layers).
Handwavy proof of non-learnability
==================================
Welcome back to non-math people. This section is also slightly technical and can be skipped by people who don't care about understanding the proof, but it doesn't require any abstract algebra background.
At the end of the day, we have a function πS that were claiming can be represented by a (polynomial-sized) neural net, but cannot be learned in polynomial time as such. How can one go about showing this? The important bits of information to collect here are the following:
1. The functions πS for different subsets S form a basis[[2]](#fn8fign6n1mt) of all (real-valued) functions on boolean inputs. (Up to rescaling and subtracting a constant, this is also called the “Fourier basis” of functions on boolean inputs.)
2. Any two functions πS and πT for two different subsets are *uncorrelated* on the set of all inputs [[3]](#fnobk781pfeua). This is key: even if S and T differ in only one element, the half of input vectors v that have 1 at that input will have different parities on S and T.
3. The randomness in the choice of N samples leads to noise in the updates, on the order of 1/polynomial(N).
From these facts we see that:
* The gradient update can be decomposed into 2n components associated to each subset S.
* The coefficient of the gradient update along each incorrect direction T can be modeled as a random variable, and is comparable in size (up to a polynomial multiple) to the update in the “true” direction T.
Of course, a priori the proof above assumes that the space of possible functions f\_w(v) associated to possible weight parameter values w coincides with the 2n-dimensional space of all possible functions on boolean inputs. Since weve assumed that the number of parameters dweight is polynomial in n, this isnt the case: rather, the vector space of possible gradient updates is constrained to be in some dweight-dimensional subspace of the 2n-dimensional space of functions with boolean inputs. Equivalently, all the above 2n possible update directions are projected to some low-dimensional subspace (and suitably normalized).
At the end of the day, we can model the gradient update as a noise vector of some fixed size (that is inverse-polynomial in n, and is associated to the randomness of drawing random inputs), plus a projection of a “signal” vector associated to S in R2n to some poly(n)-dimensional subspace. Now standard considerations of high-dimensional projections imply that the “signal” vector might have significant size for some small (polynomial in n) number of “special” subsets S, but for the vast majority of choices of S, it will be suppressed by a massive factor proportional to the square root of dimension: 1/√2n, and will completely fail to affect the noise, even after polynomially many update steps: thus the problem of gradient updating to the correct parity algorithm boils down (more or less) to the problem of the pirate-treasure hunter with the very unreliable box.
The discerning reader will see that I swept a significant chunk of not only the proof, but even the logical flow of the argument under the rug: this is perhaps better described as the “intuition” behind the proof rather than a sketch. However, importantly, this “intuition” applies to absolutely any (polynomial-sized) architecture, and in fact applies to a much more general context than SGD: in fact, any learning algorithm including SGD, Adam, even more sophisticated local Bayesian learning setups, will fail for the same reasons.
In fact, what we really used about the SGD “learning algorithms” was that it has some noise and its updating process only uses information averaged over input samples. In fact there is a general result that any learning algorithm that only uses this information cannot learn parity (in polynomial time). The definition of this class of algorithms and its relationship with various learnability and complexity results constitutes the beginning of the classical field of computational learning theory. For a nice compressed introduction which in particular formalizes the proof discussed here, see [this paper](https://arxiv.org/pdf/2004.00557).
Alternative point of view: lack of incremental pathways
-------------------------------------------------------
An alternative point of view on the failure of learnability in this case is that there is no good way for an algorithm to *incrementally* learn parity. There is no story of learning parity that starts from simple algorithms (in some quantitative or even qualitative, Occam razor-esque sense) and recursively learns added epicycles of complexity which improve classification accuracy. For example if we were to try to approximate the parity function πS by parity functions of smaller subsets, we would totally fail (as parity functions associated to different subsets are uncorrelated); a stronger version of the “lack of incremental pathways” result can be made following a similar intuition to the proof sketch above. This supports the idea that in order to be learnable, an algorithm must in some sense be combinable (at least in a local sense) out of simpler pieces, each of which is “findable” (i.e., doesnt require exponential luck to get right in later posts we will identify this with notions of effective dimension) and each of which reduces loss. This is closely related to the “[low-hanging fruit prior](https://www.lesswrong.com/posts/SbzptgFYr272tMbgz/the-low-hanging-fruit-prior-and-sloped-valleys-in-the-loss)” point of view, and will later serve as a lead-in to a discussion of “learning stories”,
Does this mean that neural nets are weak?
=========================================
Now that weve seen that neural nets trained on examples of the parity prbolem are *provably* incapable of learning it in polynomial time, it is reasonable to ask whether this is a hard limitation on the computational capabilities of neural nets. Indeed, I just explained that it is provably impossible to use a learning algorithm (such as an LLM) to solve a problem that can be easily solved by an undergraduate, at least in an amount of time that is shorter than the length of the universe. Does this negate the possibility that modern LLMs can solve hard-core math problems? Can we stop worrying about human-level AI?
Unfortunately (if youre worried about AI risk), we havent. The impossibility of XOR learning does not imply any limitation on the mathematical ability of LLMs. The issue here is with the notion of “learnability”. In the setup of our XOR problem, we assumed that the LLM is executing SGD learning (or another learning algorithm) on the single learning signal of “what is the parity function applied to the vector v”. If we were to give the parity problem to an advanced LLM, it might be able to solve it, but this would not be from gradient updates on seeing a bunch of examples. Rather, our LLM has seen many mathematical texts, and may be able to use the knowledge in these mathematical texts and a basic understanding of logic to reconstruct the hidden subset S and the parity function πS. Abstracting away the high-level “mathematical understanding” of LLMs, what this is saying is that it *is* in fact possible to learn the parity problem if the direct learning problem is replaced by a suitably sophisticated [curriculum learning](https://en.wikipedia.org/wiki/Curriculum_learning)-style problem with an enriched class of examples and a more sophisticated loss function. Trying to write out a simple “mathematical” ML algorithm that learns to solve the parity problem is an interesting exercise, that might constitute a nice ML theory paper; I wont try to do this here.
Not weak, but also not optimal
==============================
The main reason I want people in AI safety/ interpretability to know about and understand parity is related to a long-standing question in machine learning of Bayesian learning vs. SGD, where the conventional wisdom has been wrong, but (in my limited understanding) is finally starting to converge in the correct direction (as will be typical of these posts, this is the “its complicated” direction). Namely, you can ask whether SGD (and related learning algorithms) can be well understood as finding the optimal solution or more precisely, as sampling a suitable “Bayesian” distribution of near-optimal solutions[[4]](#fnng2bd55znsd). It is easy to see that learning sufficiently general algorithms cannot converge to anything like Bayesian learning for P vs. NP reasons. But a standard counterargument, supported by a standard collection of faulty papers (that Ill complain about later), was that “real life” problems where deep learning is applied do converge to the Bayesian prior.
One (soft) takeaway from the discussion here is that if training “real-life” modern LLMs involves reasoning in the same reference class as parity, then it is likely that the algorithm they learn is not globally optimal (in a Bayesian sense). Indeed, we see from parity that optimal algorithms in this reference class lack the incremental pathways necessary to be learnable via SGD, and the way that LLMs solve complex problems probably is mediated by curriculum-learning-style “training wheels” that learn general solutions, just not of the most efficient type. [[5]](#fn6h2itetjpjt)
Acknowledgments
===============
Ive talked to lots of people about this, but particularly important for this post have been a number of conversations with Kaarel Hänni on related topics. I also want to thank Sam Eisenstat for first telling me about parity and the notion of learnability, and thanks to Jake Mendel and Lucius Bushnaq for related discussions.
1. **[^](#fnref43nx54h76ig)**
More precisely, the probabilistic version P≠BPP, and I might be assuming some cryptographic hardness in other places in this section (but not in the rest of the post!)
2. **[^](#fnref8fign6n1mt)**
Technically, there are 2^n functions but they only span a 2^{n-1}-dimensional subspace, since the "empty-set parity" function π∅ is zero; to make this statement precise, one needs to replace π∅ by a constant function. A more commonly used related basis is the "Boolean Fourier mode" basis with basis elements 12πS, which replace the {0,1}  valued functions parity function by a ±1-valued analog. Working with this basis is generally nicer, and in particular makes "uncorrelatedness" arguments cleaner.
3. **[^](#fnrefobk781pfeua)**
They are uncorrelated on the set of *all* inputs (i.e., the expected value of πS doesn't change even if you condition on a specific value of πT), but they are correlated ("in a random way") on some fixed polynomial-sized set of "training" inputs. In the latter training-set context, they are "not very" correlated, and the correlations can be proved to be suitably "unbiased" when viewed as a noise term.
4. **[^](#fnrefng2bd55znsd)**
This is normally defined as the Boltzmann distribution associated to loss, an object particularly [important to Singular Learning Theory](https://www.lesswrong.com/s/czrXjvCLsqGepybHC/p/xRWsfGfvDAjRWXcnG).
5. **[^](#fnref6h2itetjpjt)**
Note that this isnt even an “intuition-level” proof: its not obvious that modern ML methods require knowing how to solve problems in the reference class of “parity for n-bit inputs such that 2n is very large”. And even if this were the case, its not obvious that ML learning problems dont just happen to have the property that the “training wheels” for learning parity-like problems dont just happen to be needed to produce Bayes-optimal algorithms for some important simpler problems. Later when we discuss connections between physics and ML, well see other more rigorous reasons to dismiss strong versions of the SGD = Bayesian hypothesis. But at the same time, its important to note that in many contexts: namely, when looking locally in a basin, or looking at simple “circuit-level” behaviors that dont have enough accumulated complexity to break out of a low-dimensional paradigm, it is reasonable and productive to make little distinction between the two types of learning.
@@ -1,5 +1,5 @@
---
title: Whats the short timeline plan?
title: "Whats the short timeline plan?"
date: 2025-01-05 00:10:28.708000+00:00
url: https://www.lesswrong.com/posts/bb5Tnjdrptu89rcyY/what-s-the-short-timeline-plan
novelty: 0.8981608613682581