2024-12-21 yapping to friend about o1/o3/search

this is a monologue




so the way i like thinking abt it is

models like 4o, claude sonnet, etc .. their pattern matching / interpolation / wtvr is all prettyyyy analogous to human type 1 thinking .. like imagine if u were asked a question and you gave zero conscious thought about it and just immediately spewed out words based on intuition

(the analogy is not perfect but i think its pretty good so ill continue using it here)

humans are actually pretty bad at type 1 thinking for things like math .. like imagine if you had to do an APMA1650 hw pset with absolutely zero conscious thinking like youre not allowed to consciously think at all .. youd prob get a 0% unless youre like insanelyyyyy cracked then maybe you solve like 1/100 questions like only the rly rly rly trivial ones where the answer is immediately obvious with no conscious thought


gpt 4o is actually insanely good at this, for what it is …. like . if there was a human that had that level of immediate intuition, theyd be probably the most cracked guy in history, like theres really just no human at that level


but this has only been possible with recent advances in compute (lots and lots of money to use lots and lots of GPUs so that we can use models with billions n billions of params) + recent advances in pretraining (idk much abt this lmao but i just know its important)

like how we got from gpt 1 all the way to now having gpt 4o


so the way i like thinking abt it is // models like 4o, [...]

wait going back to this — it’s crazy how far humans can reason when our type 1 thinking is actually genuinely so pitiful compared to gpt 4o

ykwim?

like . 4o is probably 100-10000x better than humans at this type 1 thinking, but humans are somehow able to sequence together many many steps of our extremely relatively bad type 1 thinking that then allows us to reason and do things like solve an apma1650 hw pset


humans are like 100-10000x worse at type 1 thinking but somehow we have this skill/process that some people practice that allows us to do things like math, physics, coding, etc .. or just anything that requires Reasoning


now my hypothesis (even before o3 was released) is that everything is just type 1 thinking. like i mainly believe type 2 thinking (i.e. conscious step by step reasoning) is still just lots of type 1 processes working together


i was journaling abt this a week or two ago but i think this tweet i saw yesterday articulates it succinctly with a cool CS analogy:
https://x.com/Rares82/status/1617962198546145281
https://x.com/Rares82/status/1617962198546145281


so then, assuming my hypothesis about human reasoning is true (i.e. type 2 thinking is just built from type 1 things), i feel like this strongly implies that AI reasoning (type 2 thinking) can be built by just using LLMs (type 1 things) but in a sort of different direction than how we’ve been scaling/training up until now.

like up until now we’ve just been training the AI to have better n better “immediate intuition” or wtvr

but now we have to piece together a few LLMs and somehow train them as a whole system that can reason


so the experiment i wanted to carry out was smth along the lines of . can i train a very very small LLM (mimicking human-level performance of the “immediate intuition”) to coordinate with another LLM (and perhaps a few more things) and then run RL on the reasoning process by having a verifier (like a teacher checking ur work) and another thing to critique the process more generally (just like Anthropic’s Constitutional RL approach, but instead of aiming for Helpful Honest Harmless we now aim for reasoning metrics — just like a teacher who gives feedback for how to even approach a problem)

and then also unsupervised RL (analogous to a student doing practice problems by themself)

etc


so that was my hypothesis and my experiment that i wanted to try


and basically i think that is what openai has done with o1/o3


https://x.com/__nmca__/status/1870170101091008860
https://x.com/__nmca__/status/1870170101091008860
from a researcher at openai


and this makes things incredibly data efficient

like eg, you dont need to do 99999 physics problems to be good at them .. just get good at thinking from first principles and derive things — the former is a salena/LLM approach to math, the latter is what we know to be a much better approach to mathy subjects

and it also makes things cheaper (you dont need a huge capable model to do the immediate intuition, as long as you can use it well — just like humans. we have much much worse type 1 thinking but we string them together very very well.) (o3 mini performs slightly better than o1, but o3 mini is cheaper)

but it also means you dont get immediate answers, the model has to think first and derive the answer instead of giving an immediate answer

etc many implications


at least this is my understanding of o3


https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop

https://x.com/__nmca__/status/1870170098989674833?s=46

https://github.com/hijkzzz/Awesome-LLM-Strawberry

some things i still wanna read


[ break ]


⁦https://www.youtube.com/watch?v=eaAonE58sLU⁩
https://www.youtube.com/watch?v=eaAonE58sLU⁩

using search/planning/thinking allows a model to perform at the level of another model with 100,000x params that uses no thinking

apparently they knew this since 2017

for poker

i bring this up bc i feel like it's rly similar to the earlier message of

wait going back to this — it’s crazy how far humans

this


also wait wtf

graph


around 19:56 ..... So you can see on the y-axis here, we have Elo rating, which is a way of measuring the performance of humans in different models. And on the x-axis, we have different versions of AlphaGo. So you can see AlphaGo Lee is the version that played against Lee Sedol, and it's just over the line of superhuman performance, around 3,600 Elo. AlphaGo Zero is the stronger version, which has an Elo rating of about 5,200. So clearly, superhuman. But AlphaGo Zero is not just a raw neural net. It's a system that uses both a neural net and an algorithm called Monte Carlo Tree Search. And if you look at just the raw neural net of AlphaGo Zero, the performance is only around 3,000. It's below top human performance. So I want to really emphasize this point, that if you look at just the raw neural net of AlphaGo Zero, even though it was trained with Monte-Carlo Tree Search, if you just run it at test time without Monte-Carlo Tree Search, the performance is below top humans. And in fact, this AlphaGo was in 2016. It's now 2024, eight years later. And still, nobody has trained a raw neural net that is superhuman in Go. Now I noticed that when I mention this to people, they would say like, well, surely, you could just train a bigger model that would eventually be superhuman in Go. And the answer is, in principle, yes. But how much bigger would that model have to be to match the performance of AlphaGo Zero? Well, there's a rule of thumb that increasing Elo by about 120 points requires either 2x-ing the model size and training or 2x-ing the amount of test time compute that you use. So if you follow this rule of thumb, if the raw policy net is 3,000 Elo, then to get to 5,200 Elo, you would need to scale the amount, the model size and training by 100,000x. Now, I want to caveat this. There's a big asterisk here, which is that I don't actually think AlphaGo Zero is 5,200 Elo. I think they measure that by playing against earlier versions, earlier versions of the same bot. And so there's a bias where it would do better against those because it's trained through self-play, so it's trained against those bots. So I think the number is probably more like 1,000 to 10,000. But in either case, you would have to scale the model by a huge amount, and the training by a huge amount to match the performance that you're getting by using Monte-Carlo Tree Search where the bot thinks for 30 seconds before acting. And by the way, this is still assuming that you're using Monte-Carlo Tree Search during training, during self-play. So if you were to take away the Monte-Carlo Tree Search from the training process, this number would be astronomical.

tldr: alphago is 5200 with search, 3000 elo without search (worse than top humans), and this was 8 years ago, and now in 2024 still no one has scaled a no-search model to be better than top humans

god he's making it seem like search is such an obvious answer and anyone who wasnt using search was just blind


wait wtf

hanabi

apparently he / his research group beat the state-of-the-art model for hanabi at the time, and literally all he did was just add literally one of the simplest search strategies that you can think of

test-time search in hanabi

test-time search in hanabi graph




More from corbin
All posts