the tricky part about reasoning is if you ask me a question that requires reasoning and I gave an answer to you on the face of it you can never tell whether I memorized the answer and gave it to you are I actually reason from first principles my favorite example is in the old days when Microsoft started having these interviews they apparently used to ask this question as to why are manhole covers right and the very first sucker who had to answer this question basically had to do it from first principles basically they have to realize that pretty much any other shape you can manipulate it such that the cover falls through and you know round is the only place where actually you know if the keep that if you can play with the covers and they fall through then all these holes would be you know coverless and this is a neat way of thinking but the first few people had to do this right now when you know anybody asks this question in interview the only thing you will know based on the answer of the candidate is whether or not they prepared for the interview prepared by looking up the web and looking at what are the usual Microsoft questions one of the ways to get better implicit reasoning with large language models when asking them questions similar to The manho One is using the brave search API with retrieval augmented generation it’s an independence affordable search Solution that’s making information retrieval way more efficient built from the ground up Braves index covers over 20 billion web pages free from biases often associated with major tech companies and what really sets it apart is its unique approach to data collection Brave offers this powerful tool at developer friendly prices which scale with your business making it an accessible option for projects of all sizes so I encourage you to check out the benefits of the Brave search API firsthand you can start with 2,000 free queries monthly by visiting brave.com API sub it’s an absolute honor to have you on mlst I’ve been a huge fan from the sidelines for a long time now can you introduce yourself yeah so first of all thank you for having me um so I am uh subar ra Kam partti I’ve been an old professor at Arizona State University I think this is my 33rd year and um I started working in AI as an undergrad I was doing speech recognition like 40 years back and then worked in planning and decision making most of my research career last 10 years or last like seven years or so I was looking at explainable human AI interaction and most recently the last two three two years really I’ve started looking at all these uh claims about the reasoning and planning capabilities of large language models so that’s s of my journey okay so um you you’ve said that large language models are engram models on steroids approximate retrieval systems or maybe even databases and they can’t reason or verify anything that they do out of the box tell tell us more so so first of all I think the large language models they’re trained essentially in this Auto regressive fashion to be able to complete the next word you know guess the next word uh and these are essentially engram models they’ve been around since the time of cloud Shannon the difference really is those engram models for n equal to one or two or three so we have pretty good understanding of Byram models trigram models if I say left and you tend to think of right and that’s sort of the trigram model sort of a thing uh what actually is happening here is that we really even the lowly GPT 3.5 happens to be you know sort of a three 1,000 G model so that means given last 3,000 words what’s the most likely next word it’s the same idea except on huge steroids and the the the tricky part about this is the number of 3,000 word sequences if you have about a 50,000 word vocabulary is 50,000 power 3,000 and so if you are trying to keep occurrent statistics which is the normal way to do engram models you have we run into two problems one of course is this a humongous you know um set of rows for which you need to have you know conditional probability table and what’s the next word that’s likely to come and secondly the other problem is the possibility that the same 3,000 words will occur more than once is essentially zero and and 3,000 is GPD 3.5 right now we are talking about million word uh models so the thing that’s surprising about llms basically is we essentially have you know are able to train this really large Bast um and this’s huge amount of compression that goes on so I tell students that you know when you think of GPD 3.5 right people would be surprised it has 176 billion parameters and I think no you should actually be so happy that it’s only 176 billion because if you really are trying to do it in the full first principal you would wind up having 50,000 power 3,000 that’s actually Infinity you know in some sense so because there’s this huge compression going on interestingly any compression corresponds to some generalization because you know you compress so some number of rows for which there would be zeros before now they might be nonzeros um and so the for the some words and so this what we found empirically is that with when you complete with these trained models they are giving completions that have very interesting properties in particular it’s a generative you know it’s a sort of a generative IEM and so they capture distribution extremely well and what that means is they capture the style of the data on which they have been trained on and as a um as English as a second language speaker you know I learned English with grammar rules which kind of almost makes no sense but that’s you know your first language you don’t learn grammar you just learn to speak language second language you tend to follow some rules and exceptions and rules and exceptions and that’s kind a pretty hard way to learn a language and I was continually impressed when gpt3 and GPD 3.5 came around that they can’t make a grammatically incorrect sentence you almost have to prompt them to say imagine you are a Indian grad student just arriving or something and then they will make a you know thing so what the point is grammar essentially is a sort of a distributional characteristic and they capture that and it’s a very impressive thing and you know people got very surprised that this is happening and then so they you know for any arbitrary prompt they’re able to give very plausible very good English completions what because we just completely didn’t expect this at all the when our intuitions go you know Haywire we tend to think that everything is now possible so surprisingly for humans actually in fact when I was speaking English as a second language I knew what I wanted to say I knew the content of what I wanted to say it was the damn language putting it in the right grammar that was getting in the way right and so I would tend to assume that if something can actually speak the language maybe the content is easy because content was always natural to me that is where people people start thinking llms because they can speak you know write good high know grammatically correct English know the good form language of any kind you know programming languages English etc etc maybe they have already got the content fine too because we actually find content easier and the style harder okay and that’s actually the place from which all these reasoning claims for llms and the factuality of claims for llms come and and just they don’t hold for angra models this might be an interesting digression actually but there is some debate whether natural language is a formal language there it never was I think I mean from the way I understand it in essence one interesting way of thinking about it um is like for example programming languages or foral languages for foral languages you can have interpretors which are actually one of the things that people get confused is like you know basically for when these kinds of llms um guess code okay there is a way of checking how whe what that code would produce because there’s a python interpreter that can take anything that is grammatically correct Python and tell you what would come as output there is no such thing for natural language because natural language essentially captures everything it not just python it’s actually about all possible um you know scenarios in the world and you know world is The Interpreter for natural language whereas and so because of which you actually it’s it’s a much more flexible much more um you know unconstrained form of expression the subs sub parts you know if you go from like programming languages to even something as simple as firstorder logic which is a formal language you already see that by that time interpreters become hugely unwieldy they’re extremely costly to you know actually you know computationally costly to interpret even first order language and and we already know that first language is not expressive enough okay so that is a great thing because in many ways I tend to think that a foral language often times most reasonable foral languages tend to have interpreters for natural language the world is The Interpreter and and and so that’s a way of seeing and that’s actually the the appeal of natural language because you can pretty much say almost anything I understand that there is this tacit versus explicit difference but in in general pretty much you can express that the expressiveness limitations of natural language are so much smaller compared to pretty much any formal languages and that’s where the interpreters are possible to get for those yes and we will get to the um verification Gap in in a minute CU you wouldn’t it be great if we could build a but sticking with the reasoning thing ju just for a minute so um people out there are convinced that language models reason I mean for example I had Raphael Miler on the show and and he was citing the example of um orell and there was that paper where this it’s a board game you know kind of similar to go and and they they said that it was learning this abstract World model and it and it was generalizing why do people believe so deeply that these things reason so that’s a very interesting question um the tricky part about reasoning is if you ask me a question that requires reasoning and I gave an answer to you on the face of it you can never tell whether I memorized the answer and gave it to you or I actually reason from first principles my favorite examp example is in the old days when Microsoft started having these interviews they apparently used to ask this question as to why are manhole covers round right and the very first sucker who had to answer this question basically had to do it from first principles basically they had to realize that pretty much any other shape you can manipulate it such that the cover falls through and you know round is the only place where actually you know if the keep if you can play with the cover and they fall through then all these holes would be you know coverless and this is a neat way of thinking but the first few people had to do this right now when you know anybody asks this question in interview the only thing you will know based on the answer of the candidate is whether or not they’re prepared for the interview prepared by looking up the web and looking at what are the usual Microsoft questions and of course humans being humans they will also try to say even you know that people want to know whether you can reason and so if you actually happen to know the answer if you blur it out that would give it away so you would act take some time and then blurt out the answer and so you can’t tell just by the end part whether I was able to actually I was reasoning or retrieving and in general this is true and the one reason why we think humans cannot just get by with fullon retrieval is we have a life we basically tend even the craziest interview preparing student people will basically spend a bunch of time looking up the old questions but then they still depend on being able to do some first principles reasoning if the question is not anywhere in the question bank right that’s not a problem for llms so to some extent like for example many of these reasoning claims come from facts like llms are extremely good on the standardized tests right and the thing that people forget about standardized tests is standardized tests have standardized question Banks and I keep telling my students that the hardest people you think that the hard part is answering the exam questions really the hard part is coming up with interesting exam questions that haven’t already been listed on the course hero which is like this website of the kind of the websites that exist now where they take every possible question that has ever been asked on the exam or put on a question bank with the answer and so students can pretty much try to prepare to the test rather than prepare to understand the material and then answer the questions on time so this ultimately winds up being in many cases where people think llms are reasoning it winds up being the case that they’re actually doing approximate retrieval now how do you show that the usual way you get to show that is by a diagonalization argument you almost for example I’ll give you an example so I am interested in planning capabilities of large language models planning is a form of reasoning it’s a form of reasoning involving time and action um so one very simple kind of planning problems that people in AI planning have been looking at are things like block stacking that is you have a bunch of blocks in some configuration in the initial State they could be just named blocks or colored blocks and you want to have a different configuration in the gold state and there’s a bunch of actions such as picking up putting down stack unstack right given these actions given this blocks world problem can you come up with a sequence of actions that actually will lead to the goal State that’s the planning problem interestingly Enough original llms like gpt3 Etc were suck very bad added this GPD 3.5 in fact and GP gpt3 in fact I think in 2022 we wrote this thing uh when everybody is making claims about planning capability is that no they can’t in fact they get close to 6% accuracy they’re just guessing and then the the action sequence won’t lead them to go interestingly when gp4 came and you might remember that Sebastian buck and coote this the notorious Sparks paper and so we were interested in seeing maybe are there the Sparks going to help the planning capability so we checked GPD 4 interestingly it increased in accuracy it’s still nowhere near 100% but it went from something like 6% to close to 30% you know so you can argue that wow if if GPT 4 is 30% gp5 would be maybe 70% and by GPT 10 you can get 150% accuracy of some sort right the question that you we asked ourselves is can you explain this potentially in terms of just the well as again as actually reasoning if you are doing reasoning then if I change I take the blocks world and change the names things like instead of Stack I will say Feist instead of you know unstack I will say slap something like that so if you know if you have any background in logic you realize that the predicate names don’t change the Dynamics of the system and so any planner which can solve the original domain we can solve this with exact the same efficiency and accuracy you give this to um gp4 it dies and in fact we wrote this paper called plan bench paper for new rips 2023 which essentially shows both the blocks World Logistics kinds of planning problems as well as these obfuscated versions of them and it’s almost like a fun sideline for as that KARK walikum who is the lead author on that paper he keeps running every new sorta um llm gp4 and um you know Claude and um you know um Gemini and they all basically are stuck close to zero because they are essentially guessing the plans for blocks world because they have the word you know they have the the usual words associated with blocks World which are stacking unstacking Etc and there’s enough of the data that in the in the web scale corpora but you change the words they completely lost so that is a great way of showing essentially that they are not doing reasoning because if you are doing reasoning you should be able to solve this just as easily as the original one so in general I can understand why people easily fall for this because typically as I said um when people answer a question that requires reasoning you can’t tell just on the face of it whether they just happen to have heard the question and the answer before they came into the room and they just blured it out or did they actually start from that question and answer it by you know reasoning for um it might be interesting to have a a small digression on on what reasoning is now I I’ll give you my my definition so I think it is performing an effective computation to derive knowledge or achieve a goal using a world model when all existing semantic so you know you can think of it very Loosely as you know deriving uh kind of truths that we don’t currently know if that makes sense yeah so I would go back to I mean first of all reasoning is pretty well defined from a logical perspective I mean that’s one particular kind of reasoning that we have understood from the Greeks on right and you know that I always find that it’s useful to just go back to that even though you know there’s all sorts of reasoning logical reasoning is certainly one very important kind of reasoning and as you said in the case of logic given a set of Base facts you are able you’re trying to see whether some new facts will follow in the sense is the the deductive closure of the knowledge that you have are these other things present so this is actually very important distinction so everybody in AI used to know uh logical background um you know but nowadays most of the grad students actually start with Just DE learning courses and machine learning courses this is very important technology I’m not questioning that at all but you need to understand that the in versus out of distribution the usual terminology that people use in machine learning that’s for databases for reasoning you need to think in terms of deductive databases at the minimum which is you’re not just looking to see whether the stuff in the database you know basically whether you are whether the query that you’re asking is in the same distribution as the kind of queries that you already presented previously instead you’re looking to see whether you whether the system is able to compute things in the deductive closure okay that’s what winds up requiring logical reasoning and there’s no reason to believe that LMS can do that um and one of the very interesting first of all there actually multiple increasingly more and more evidence showing that this defitely can’t do it you know we’ve been saying that um as a corer of the fact that they can’t do planning they clearly must be failing in doing um you know um deductive closure but there people have even shown recently there was a paper which talks about um for a transitive closure transitive closure is a poor cousin of deductive closure which is if you let say a P1 B P2 C okay you should be able to say a P1 P2 C yeah right they can’t even do that and if you you have to actually train them in addition to the normal base facts you have to give them additional facts from the deductive closure and then they might try to do little bit more generalization there but then even after that they would not be able to do a P1 P2 P3 D because that’s again an extra step and they just don’t realize that there is a method to this madness that you ESS looking at transitive closure if a is connected to B and B is connected to c c is connected to D then a is essentially connected to D in this P1 P2 P3 way if you can’t do that you certainly can’t do deductive closure because transitive closure is a really small part of deductive closure and so that’s one way of realizing that llms cannot do reasoning now people say but no there are things that llms are doing that are not considered B facts but that actually is because we don’t really understand the training data the training data is not something that we put together it is the entire web and everybody thinks that they know what’s on the web nobody really understands how many things are on the web so I keep remembering this thing that when Palm the old Google llm came about one of its great claims to fame is it could explain jokes why is that like a AI task is beyond me but it is surprising that it could kind of explain jokes and so you’ll be very surprised wow it is actually able to do some sort of thing that’s beyond the data but really I don’t know whether you know this but there are humor challenged people in the world and there are websites that explain jokes and these websites are part of the entire web crawl that the system has been trained on and because of which it’s not that surprising so reality is in normal logical terms you would think that I give base facts and the system says something that’s not typically a base fact then it must have done some kind of a reasoning but often times the web corpora contain both what we would consider base facts and what we would consider parts of the deductive closure in in an interesting you know mishmash way and so when once in a while it actually winds up retrieving the stuff from the deductive closure you think it’s actually reasoning because you think it it couldn’t have probably been in the web that’s because you don’t know actually what is on the web it’s like it’s very hard to show that what is is and what is not on the web um I actually have an experiment that this Tom Griffiths group did that I really repeat to a lot of people that again makes the same kind of a point which is one of the Sparks that um the original gp4 paper talks about is that gp4 can do Cipher text decoding right and scissor Cipher decoding and so what these guys basically did is check you know because C Cipher decoding just basically based on like an offset right so you take a plus 4 then you take the know four letters off from a and then replace a by that that’s the usual thing so for each offset there’s a code and so they try to see how well does the system gp4 decode for different offsets 1 2 3 4 all the way to 25 and surprisingly they find there’s a huge Peak at 13 and it’s basically close to zero anywhere else and why is that interesting if you remember if you’re old enough you’ll remember that in Unix there used to be this there is still a command called rot 13 rotate by 13 which is while Caesar Cipher is a general idea 13 is the special case you know that is rotating Letters by 13 is the special case that CES a cipher is more famous for and this tons and tons of data of normal text and rotated by 13 text on the web so that’s the part that llm does well and it doesn’t do well for two three four all the and any other number a little kid who you explain to them what the point of sissor Cipher is will not have no trouble doing either 13 or four or two in fact they might even might find it easier do four because it’s easier to count to four than to count up to 13 and shift the letter and and yet you know it’s one of these cases where you would be surprised at what is called emergent um reasoning capabilities and I think if I might be so bold as to say what’s somewhat wrong about the kind of empirical studies uh nms is that people tend to stop at the first sign where they get interesting results and I think you have to be very skeptical you have to assume by thinking that the system does not have you know certain capabilities and then actually try to you know poke holes rather than write a paper right away um and currently we are going through a hype cycle where essentially both in research and in you know commercial terms where people are just happy to write llms are zero shot XXX where XXX is reasoning planning something else and mental modeling etc etc and it turns out for most of these you can find that really you can explain the results from an approximate retrieval point of view because by diagonalizing in fact as I mentioned the planning where you change the names that’s a diagonalization argument what these guys did in checking whether it can do Cipher text decoding for something other than 13 is another diagonalization argument because if you know the general principle you should be able to do it if on the other hand you memorizing you won’t be able to do it and so that’s a classical way of finding but you know showing that it’s not doing reasoning but I agree that typically in general it’s easier to be impressed because coming up with this diagonalization arguments is hard but from a scientific perspective we need to see you know whether or not they actually have these capabilities yeah I mean a few thoughts on that I mean um it’s interesting that people are not more skeptical I mean my my religion is skepticism yeah I think so science is supposed to be skeptical in general and and I think I I find this fascinating it’s like in addition to the research implications this also kind of says a lot about the sociology of research and in general humans are not good at giving credit to anything in general they would like to say it is their idea except strangely enough with llms they’re willing to give credit to llm than to themselves because I think my cynical way of saying looking at it is if you say llm did it you get a new RP’s paper if you say you did it who cares you know you it’s not supposed to be impressive but to some extent people are suspending their disbelief too easily and that is not a good thing for science I mean you know if you’re doing it if I’m a startup clearly yeah the whole point is let’s hope that the world suspends their disbelief while I make my Millions right but science it’s a different thing and so one of the interesting questions is what’s going to be the half-life period of many of these llms are zero shot XXX kind of papers I have this in tongue full fullon tongue in cheek I have said on Twitter that when some of these papers are shown to be kind of wrong they should lose their citations right now what happens is the bad papers which actually made claims that actually were proven to be wrong they still keep gaining citations because people say this said this and it was wrong I think one way of Designing mechanism is that if I prove your idea to be wrong I should get all your citations for the bad that’s to the extent for us the citations is the currency I mean I mean obviously I’m being factious but in general that reduces the incentive for people to rush to print and uh and hope that maybe you know basically they may be right and even if it’s not right they will get some number of some amount of um attention this is exactly what you have to do as a startup because you want I mean there’s the whole first rule of there’s this 21 rules of marketing that thin book and it says the first rule of marketing is it’s better to be fast than to be better right and but research is supposed to be different right and so we have become lot more like it’s better to claim something uh as the first person and then get it into newps or I ca and and wait it wait for other people to figure out whether they claim is right or wrong than to do your own due diligence uh beforehand and that is playing out right now in a larger uh sense it’s like this is more of the sociology of science it’s a human endeavor after all you know most of research is human endeavor so these motivations matter yeah so science should be like that snake game where you know when you kill another snake your snake Grows by the just it should be it should be yeah exactly yeah but um uh a a couple of things on that um I mean and first of all actually when I speak with um you know ex- risk people they also make a slightly hedging argument where they say oh similar with your marketing argument that you know it’s better to be there first they say well it’s better to have like a a big inefficient AI because you can just approximate a smart AI with a dumb Ai and that’s kind of what language models are doing because they’re doing this Mechanical Turk thing where wherever it’s doing bad oh we’ll just throw more data or do augmented data and then it creates this kind of Illusion where people anthropomorphize it and it seems to do well because it’s been trained in many many specialized situations to to do the thing but I wanted to quickly come back to to reasoning because what you were saying was brilliant I’m I’m a big fan of wed subber I don’t know if you know him but he’s talking about you know transitivity and deductive closure and and all this kind of thing and I want to bring in creativity so first of all you know when we do deduction we are applying these rules that could work in any situation but if you think about it we can compose these deductive rules into this big kind of tree into a big graph it’s similar to in the physics world we talk of the physics World being causally closed but we’re talking about like a a deductive closure and what is the role of creativity when we Traverse this that’s that’s first of all that’s an an excellent question in in general this actually so despite what the tenor of our conversation until now I actually think llms are brilliant it’s just the brilliant for what they can do and just I don’t complain that they can’t do reason use them for what they are good at which is unconstrained idea generation okay and in pretty much every human field you know fields of endeavor you have this interesting gter position between the creative idea generation and the laborious deductive idea checking okay even in math this is true phma came up with this conjecture and of course also claimed that he proved it and Andrew wild spent 17 years of his life 200 years later to actually prove that what Fara was saying is actually you know provable with respect to the mathematical axm you know with some math that was invented much later than forat right uh for and and same with everything so know this there’s this beautiful book that I keep telling people about um Gregory perelman’s um you know life story called perfect rigor Gregory perelman you know as you might remember uh basically s you know proved the poer conjecture and he refused you know taking the fields prize because Fields medal because saying I did it for the mathematics of it not for external validation but the more interesting part is the the way he wound up actually completing the proof involved using somebody else I forget this mathematician in in Berkeley who actually made a conjecture a creative conjecture about combining two very different fields of mathematics and saying if this holds then this will actually work and and then actually you know Perlman went on doing this laborious backbreaking work of actually showing that it does hold okay creativity math requires both the creativity as well as deductive thing and sometimes the same mathematician has both of them sometimes it’s also the case that some mathematicians are better at the actual proof part and some are better at seeing the creative connection ramanu Jan the guy from India is like the most famous of his lot because he seemed to have had a creative generator that almost had a much higher density of being correct and so he would just write these huge crazy uh Know series and write the answer with no proof and you know G hardi basically famously said they must be true because if they weren’t nobody would have the imagination to write this he was obviously being fous I mean he was just a great you know um supporter of raman’s work but in general the creativity part is something that llms are extremely good at and in in any kind of a theorem proving sorts of things you have to you can do shortcuts by coming up with interesting um hypothesis and proving that because there just too many it’s just if you try to do it in the first principles way of looking at all possible theorem proving paths it’s essentially impossible I mean this was corresponds to I mean there’s bigger reasons too you know there’s a Godel incompleteness if you remember you know Hilbert was trying to like basically automate mathematics by saying just it’s all nothing but logical theorem proving and if Godel said even that’s not true but even if it were true even if it for the part of math where it’s really just showing deductive closure computation of deductive closure for a large enough system is like horrendously inefficient and so you need to make inductive leaps there’s nothing wrong with having ideas the only thing that Civilization needs is that when you get the idea you check whether the idea works that’s the robustness aspect if it doesn’t work you give up on this idea go for something else and we typically find that verification of an idea is easier having the idea is harder in fact in our Life Time in our lives we would have friends sometimes who are F of ideas so you get stuck you ask you know hey Tom do you have an idea for on how to D this Tom may not have a big stake in this but they are good at generating ideas they’ll give an idea and it’s your responsibility to check if it actually works and we when does when it does work we tend to give much more credit to Tom uh because Tom gave the idea but Tom didn’t prove that the idea was right okay you proved it and it’s easy for us to kind of verify and we think idea generation is the more important thing llms are actually good for the idea generation and they’re actually bad for the thing that is easy for us to some extent which is verification and and so in in general that sort of the creativity part is making the inductive leaves to cut down the search you know um so that once you if you just make an idea and you don’t actually check if it works then it’s just an idea okay and but then once you can if it if you have hopefully a uh a generator of ideas that has a higher density of good ideas it can reduce the search quite significantly in terms of and so that’s sort of the way creativity and the reasoning come together so inductive versus deductive leaps right and inductive leap versus deductive closure okay the inductive leavs basically all the imagination all the creativity part and llms are actually quite good at some of those things because not not it’s no skin of their nose they’re giving you ideas you ask and they will give you ideas but you have to check whether that idea actually makes sense in the context of all the constraints of the of the problem and you know in general people in like more um more creative Fields like things like design in like engineering design which straddle the creativity aspect as well as the realizability aspect architecture is another one they tend to almost say that during the ideation phase you need to be unconstrained and then then once you get good ideas then you have to make sure that actually the building which looks like that can be supported with the kind of strengths of materials that we currently have otherwise it’s just a beautiful looking building that doesn’t you cannot build it and and so this second part is equally important and you know one of the hope is when people you know uh do architecture courses let’s say they tend to constrain their generator their ideation phase such that they’re more likely to come up with you know ideas which are likely to be realizable so this is you know a slow uh thing but that’s my way of of looking at where the creativity and the reasoning part come in and it’s very relevant with llms because I would use llms for their creative things because not mostly because ideas require knowledge it’s like ideation requires shallow knowledge shallow knowledge of very wide scope and most of us don’t have like for example you want to give some idea about you know this is another place analogies are a great case where llms actually do better than normal you know people a man on the street and to do analogy you need to have at least significant amount of knowledge about the world so if you want to sort of analogize analogize how things in Finland will work knowing how things in India will work you need to know something about these countries and not everybody might know them so they may not be able to construct those analogies whereas llms actually have been trained on the entire world’s data you know and and so they can make these kinds of shallow unguaranteed um leaps and then you have to pick up and check whether or not that thing holds so that’s one great place for them yeah mean let’s just explore this a tiny bit more I mean I agree there is a beautiful symbiosis between the generation of ideas and the verification of them but and and also you’ve said that uh even though we can be pessimistic about llms and in many ways actually they are incredibly useful but if you if you don’t mind I want to just gently push on on this um creativity thing so um you said that um what was the term you use yeah that’s uh in in order to generate ideas we we need knowledge but reasoning is about creating new knowledge and if you think about it we could just be completely open-ended when we generate our ideas and that wouldn’t be very good or it can be guided by Intuition or existing knowledge but I think there’s a difference between inventive creativity and combinatorial creativity so I think there should be some form of reasoning even in the creativity um itself and and that leads me to think that language models are surely Limited in the types of things that they can um create because they’re bounded by the training data I I I agree but my point is that compared to you and me they have been trained on lot more dat that even if they’re doing shallow almost pattern match across their vast knowledge to you it looks very impressive and it’s a very useful ability so there is this entire question as to I mean actually there there’s this very interesting uh point about whether in fact the creativity that a machine creates is considered as you said combinatorial versus something Beyond but that becomes almost a second order question if you don’t even have you know a tool which actually generates this combinatorial um creativity over vast knowledge and you know in that sense you know again I am actually not viewing llms in terms of AG connections as much as incredibly useful tool and for computer supported Cooperative work in particular uh where it’s Cooperative with the humans or cooperative with other reasoning systems uh that makes a lot more s you know that’s where they really shine and and then for that perspective I am willing to use them for that creativity aspect I would almost never use them for reasoning aspect because reasoning requires some kind of a guarantee that what you said is actually true and uh I can’t believe that I they are not going to be able to do that for at least the usual correctness um uh considerations they can do it better for style in general that means they can create an essay that looks more like a well-written English essay as again as making sure that the content content of that essay is actually factual and that there is and then and maybe even making interesting deductive closure claims uh which is actually going from the premises towards you know more interesting uh that’s the part that I don’t believe LMS can do but they can write and also check the style of another essay because that’s what they’re good at and to some extent actually this is a I mean autor regressive LMS are a specific form of generative Ai and generative AI learns properties of a distribution style is a distributional property correctness uh and factuality is a instance level property and llms do very well on the distributional properties because they’re distribution Learners and they are not going to give you any guarantees about instance level uh correctness and and so that we do have other tools for obviously and we can put them together yeah interesting so so it sounds like we’re saying there is a verification and reasoning Gap and potentially a creativity Gap but the reasoning Gap is much bigger we we should focus on that but the other thing of course is that because we we’re talking about building this m system where humans are engaged so we’re always doing creative things we produce data the data goes into the language model so a creating this big collective intelligence and and in a sense um let’s say you you do something creative tomorrow it will be in GPT for yeah exactly yeah so so the whole thing just it is that’s that’s the way to think about it because in essence they are tools and we build them tools to some extent I mean I keep actually telling people many people don’t realize as you very well put just now that they are scaffolded over our Collective knowledge so I keep asking people to do this thought experiment imagine Sam Alman got Satya nadala to already pay for training GPT I mean you know gpk right and he just didn’t have web right and he had to create the web and so he comes and says guys why don’t you just put everything you ever wanted to talk about on the web so that I can use this as a data that would be dead on arrival because nobody would do that we made web for each others to actually you know kind of communicate with each other and that then got became f for these systems so in a way the fact that these things are doing well is very much based on the fact that they are doing shallow um pattern matching is sorts of things over this vast knowledge that we have put on the web you know it’ll be lot more impressive um if in fact somehow it was a martian who basically had no access to any of what we created and yet they were able to you know help us that’s AGI kind of uh thing but I’m very happy with tools I’m very very happy with tools that I think Google is a great tool right and I would you say Google is like AGI it doesn’t matter to me it’s extremely useful because it it was telling me you know know information that other people know that I don’t know and so that basically is a very reasonable way of looking at these amplifying you know uh effects of these kinds of Technologies and you know but it’s very important to remember that they’re not creating this in fact if you make them create knowledge create kind of text you know and my biggest worry is I actually said this that you know in nway there is this um seed wal and and they basically have a wult in somewhere high up in Northern Norway where they kept all the seeds of the you know the things just in case there’s some nuclear this thing Etc and we need to restart the civilization you have seeds there and I always felt that we need to take a snapshot of a curent web and put it in that seed Vault because as llms generate more and more completions and they become part of the web essentially you are now have a much more noisier version uh than things like Wikipedia and New York Times Etc which are a lot more curated kinds of sources and when it’s all combined it becomes much harder to tell you know which is which already and so that sort of tells you how important it was that the the training the data that they’ve been trained on was not actually generated by them at all and uh and so whenever when people talk about we will you know you can somehow you don’t need human data I mean not too many smart people people say that but a few um you know not well informed people think that somehow why can’t llms just create data and train themselves that’s essentially blind leading the blind and it’s actually becomes much worse because you know the completions are only the distribution they’re not really have any kind of an accuracy and and so the factuality goes even more down and when people talk about synthetic data typically what they’ll wind up doing is they’ll depend on some external solver which actually produces the synthetic data with guarantees so if I want to kind of do better planning I will create a huge number of planning problems use an existing outside planner to solve them and then now have the problem and the plan I find you in my llm so that it sort of become slightly better at generating Solutions uh for this distribution of planning problems if it just generates plans itself and trains itself that would be you know blind reading the blind yeah just a quick digression on this are you familiar with um Chet’s Arc challenge yes yes because that’s a great example so I mean I’m I’m friends of Jack Cole and the winning team and they’ve done data set generation and you know like um test time inference and fine-tuning and so on and and they’ve kind of they they’ve um they’ve made a whole bunch of transform rules you know so the reflections and symmetries and stuff like that and they’re they’re kind of like building up the data distribution and then they’re fine-tuning a language model on it and then another approaches like Ryan green blats one where you use it as an idea generator just like you Advocate exactly Ryan green blats one is actually very much the sort of llm modulo version because they have python interpreter so they generate a huge number of code Snippets and then check whether or not those Snippets could have generated the right results on essentially like the AR training data and that kind of gives them a huge advant vage and it’s very reasonable thing to do uh but it’s basically the the guarantee is coming from this interpreter as we were talking earlier formal languages have interpreters natural languages don’t so by actually converting the whole thing into I will guess the python code for solving the arc challenge right you now can depend on the help of the Python interpreter uh to make sense of whatever you guessed and you can at least compare the expected output with you know what actually output you know this particular piece of code is generating in in a at in a at a level it is still extremely Brute Force way because they’re generating huge numbers of you know huge numbers of combinator possibilities and you have to make sure that it is diverse enough so in general this is again similar to this llm modul stuff that I talk about where these are sort of generate test approaches where tester is an external thing which actually can give guarantees and and the soundness is guaranteed by the Tester the completeness depends on the generator and as as impressive as llms are there no guarantees that they’re complete this is sort of goes back to this issue of yes they’re creative but can they create every possible thing and in in general that almost requires a prompt diversification strategy is you know where you bring in extra outside knowledge to tell them okay now stop generating this kind of code generate this other kind of code because you then introducing some sort of an inductive bias saying I believe that you know you should also consider this other kind of code because maybe one of them might actually pass the verifier test that is this extra knowledge that you’re bringing in this old idea this this idea of tree of thoughts that you know basically people that that that’s become very popular actually is best understood in that way because it’s really a prompt diversification strategy which brings in external knowledge uh to push the llm yeah so um I I interviewed Ryan greenblat and unfortunately he still thinks that I mean because obviously I asserted that it’s uh you know You’ got a system one a system two you’ve got a neuros symbolic model it’s only working because you have the python interpreter um you know verifying and and he was adamant that no no no you know just wait until the next GPT model comes along it’s going to you know and but the the the the be beautiful segue is that he did a few things right he did um Chain of Thought so you know he kind of like he carved up the space he did self-reflection so refinement of the solutions and he believes deep in his bones that the llm can autonomously verify itself and come up with the right solution you’ve done some very interesting work in this area I yeah so I completely am on the other camp on this that for the correctness verification llms just are actually no good when and in in our own experiments where basically first of all you have to talk about correctness and verification where there is actually a formal specification of what correctness is and in those cases and so I tend to differentiate between tacit versus explicit knowledge tasks and tacit knowledge tasks there is no actual formal specification there’s no formal way of saying what makes something a cat or a dog or something else and the tastic knowledge tasks are the ones that we share with all the animals we can deal with it and in fact those are the ones you know that very messy and there’s no formal description but the civilization that we built that the animals don’t have very much depended on this addition of the explicit knowledge and verification so I don’t want to get on the plane and hope that it will get somewhere I want to have be able to trust that somebody did the detailed calculations to make make sure that there is enough fuel and all these other things and that it’ll actually reach its destination you know there’s always got you know other exogenous events that can happen but there should not be any egregious let’s just hope that you know it might work kind of a thing so to me that’s what’s the kind of verification that we should be interested in I know tacit knowledge tasks are typically about style explicit knowledge is about correctness for the correctness things like you know let we basically in our work we looked at both both planning and uh constraint satisfaction problems graph coloring kinds of problems and also looked at the 24 puzzle which has become somehow popular in the llm community all of them have formal verification possibilities right you can give a solution you can give with full guarantee you can check you know whether or not it’s a solution so what we did was we let llm criticize its own Solutions uh versus criticize it’s being criticized by an extern verifier when the llm criticizes its own Solutions its accuracy goes down not up but down and that’s because when it is critiquing its Solutions it hallucinates false errors as well as misses actual errors you see what I’m saying and because of which then it is it can very easily if it blundered onto what is actually a correct solution instead of stopping it might continue changing it such that it becomes worse and it’s like you know fascinating and so because of which when they do self-reflection quote unquote they actually wors and um however if on the other hand you have an external verifier which is giving an external signal about where whether or not it’s correct even as simple as it’s not correct is actually enough to improve the performance so this is basically the beginnings of what I’ve been calling llm modulo systems where l generates the guess external verifier gives at least a binary signal saying this is not right try once again and at this point so the very simplest idea would be they generate like 150 Solutions and you check if any one of them is correct you the external verifier checks okay llm saying that okay one of these must be right is not enough you need to actually say which of these Solutions is actually correct and that’s what the external signal does and then you can do it more incrementally by saying okay here is what is wrong with this solution here are the errors and and then as a back prompt and then llm can potentially use that information and coming up with a new guess now when I say potentially we don’t quite really understand what happens when you give a prompt to llm and how is it actually quote unquote using that prompt to come up with the next completion other than at the level of conditional probabilities you know about the next token right so we are in this strange world where llms are at once going to basically we we talk about the fact that they can have million word context and at the same time I can give a 10w context and they will do the wrong completions so it’s like it’s a interesting kind of a shallow use of prompt uh rather than any deep I understand the prompt and I’m able to actually use you know use it in generating next tokens but this back prompting essentially at least we do agree that we can see that it’ll at least bias it towards a different set of completions and the hope is verifier will check those completions and see if they are actually error free so that’s basically the way uh verification would work most of the time when people think llms are actually verifying they tend to confuse Style versus correctness content so many of these papers have been written on the llm improved its essay I would argue that essay I mean that improved an essay is actually a tastic knowledge task I mean there is no simple formal system which can take an essay and say it is correct right because you’re tending to basically look at the kind of a style characteristics rather than um in fact if you have a beautifully written essay which at some point of time has a fatal factual flaw they’re not going to be able to find it right and yet if in fact then basically what they will do in these cases they will do human subject studies and you ask humans are these better essays versus worse essays and they might find that somehow the essay quality has improved with this llm critiquing itself but that’s the style Improvement not the correctness Improvement and in fact in the llm modu architecture we basically say that you can you can critique both the style and correctness for the correctness it’s actually in some sense boring but you know because one of the interesting things is the correct you could have a correct plan that doesn’t have the right style so I give this example of a correct travel plan to come to Vienna from India where I started is walk one mile run for another mile then bike for another mile etc etc by the end of which you know indefinite number of these actions I will be in Vienna that is sort of correct but it is highly bad style most people would not consider that as a reasonable travel plan because they tend to stick to like like an airline flying or some other kind of you know standardized versions so the style is something that llms are actually better at critiquing the correctness is something that they cannot and you want the external verifiers and it’s a complimentary thing because because classical AI systems are actually much better at correctness than style yeah I mean I I agree with you so I’m I’m kind of like a philosophical rationalist you know I I think we should be reasoning in the domain of certainty you know it’s either correct or it’s not correct so we agree with that but devil’s advocate for a second um there are folks who think we can do endtoend predictive models and we can do this active inference and we can just keep leaning into the test instance and fine-tuning and fine-tuning and fine-tuning and then you know we can’t verify it we don’t know that it’s correct but we can build systems that work you know reasonably robustly and and produce often the correct answers do you think that that’s a reasonable so yeah I think the point there is first of all the end to end correctness you can use the world itself as a verifier right and that is an idea that works only in erodic domains where in fact the agent doesn’t die you know when it’s actually trying its bad idea so even when you’re doing end to endend verification there needs to be a signal as to whe whether the output is actually correct where is that signals coming from this first question the second question is how costly is this going to be because this is sort of the whole qar Anon conspiracy that I tend to call because there remember for a while back people were saying there’s this qar algorithm that openi is working on it will do wonders etc etc and obviously nobody knows whether anything was done Etc but you know some of the ideas that were being floating about were this idea that this sort of a closed you will have some kind of a verifier which with which you kind of generate synthetic data and then you f tune the system but there’s no Universal verifier verifiers are problem specific and fine tuning is a fine enough idea but ultimately the question about fine tuning would be amortization cost in general even for simple fine tuning right as I told you the blocks world you know basically these llms do badly so I can always find tun and basically get them to solve um you know Fen tune with like a gazilion blocks World instances and it improves performance on let’s say after let’s say three to four blocks it improves performance on three to four block Stacks but then when I say five to six block it will again fall down and you know A variation of this is Agent Choice that’s groups experiments where they did 4x4 multiplication with llms the S they then do 4x4 multiplication with fine tuning with like a trillion or so pairs of 4x4 digit multiplication and their answers and they paid an arm and a leg to like Sam Alman like I think 150,000 or something and then they got the 4x4 multiplication to have 98% accuracy let’s leave for one minute the silliness of why do I want LMS to do 98% accurate 4x4 digit multiplication when I have Cheapo calculators that will give you 100% accuracy but the more uh damning thing is then they give a 5x5 digigit multiplication it goes back to zero so this is the amortization argument this is actually the same problem that occurs for Chain of Thought basically when you give an exam so fine tuning and Chain of Thought are very closely connected so Chain of Thought essentially gives kind of an advice about how to solve a couple of examples and you’re hoping that llm learns something from this you know examples but as an old AI guy I know that advice taking is AI complete problem if fact in fact John McCarthy original I mean is the founding fathers of the you know with Marvin Minsky alen no and Herb Simon he was one of the four people you know who did this you know doou conference and John McCarthy called said that the Holy Grail of AI should be an advice taker program so you give it advice it will follow that advice in general that’s an extremely hard thing to do and in general it you would like to have an advice taker to whom you can give very high level advice and they will follow it right in general and Chain of Thought makes it look like llms can do that but what they really wind up doing is if you for example tell them how to solve like a three four block stacking problems they actually improve their performance on the three four block stacking problems but you increase the number of blocks the the principle Remains the Same in particular for example one Chain of Thought idea is in the blocks world you can always un put all the blocks on the table and then just construct the stacks as you need in the gold state if I sell this to a kid they understand this and then they can do it for one block or 15 blocks or 200 blocks until they get bored but llms basically don’t get the procedure they only do better for the three four stacks and then as you increase the number of blocks they die and it turns out it doesn’t even need the planning it actually can be doing like the last letter concatenation is this much simpler problem where given a couple of words um you know dog and cat you want LM to say G and T because G is the last letter of dog and T is the last letter of cat and original coot paper basically said look you know out of the off the shelf they can’t do it but if I give you know this advice about how to do this they seem to do much better and we basically did this experiment where we increase the number of words having given examples for two three word problems we increase the number of words and that should be very reasonable if you want an AGI it should at least understand that last letter concatenation is a simple problem where the number of words you just basically repeat the same thing it dies so I keep telling that you know this is a sort of the advice taking in general it sort of reminds me of this old you know capitalist proverb that teach a man how to give a man fish and you will feed them for a day and teach a man how to fish you will feed him for life and Chain of Thought acts as if it’s a second but it’s it’s actually a weird version of the second where you have to tell llm how to fish one fish how to tell how to fish two fish how to fish three fish Etc at which point you will basically lose patience because it’s never learning the actual underlying procedure it is basically just taking the examples and that of the length that you gave and you’re almost it’s doing sort of a um more or less um pattern transfer there as against any sort of procedure learning and interestingly now uh you know at least I know that D shurman who is one of the original authors of the Chain of Thought paper he gave an invited talk at iaps which is the automated planning conference he basically said that whatever we are saying is in fact true that they have the trouble in actually following the um you know procedure the point was at that point you know original results basically left it to people’s imagination that maybe it will learn I mean I’ll show I’ll give how to do three to four blocks and I’ll test again on three to to four blocks and then the performance increased I didn’t say anything about what happens if you increase the number of blocks and you a sign of hype cycle is people assume the best you know rather than the worst about what happens when you actually increase the general you know if you increase the problem size and you know they didn’t claim it will do better but they kind of left it unsaid and people assumed it is doing it but then it turns out once you actually point this out once you realize that it’s not just general advice taker at all so Chain of Thought is a it’s not what people intuitively believe because they tend to anthropomor and they think if I tell a person this is how you solve three blocks the problems basically put all the blocks on the the table they will be able to solve any number of blocks and the same thing for last letter concatenation and they think that’s what llms are able to do but no that’s not what they’re able to do they’re actually only able to do it for the number of words that you told the you gave the examples you increase the number of words they’re not getting the general principle so a couple of things I mean first of all many of these techniques they they brutalize the model in in the sense that they make them quite domain specific even if we do external verification but also things like Rag and tour use and and Chain of Thought you know these things kind of make make the the models quite domain specific but the bigger problem I wanted to point out is that we want to teach the models how to fish but the models don’t know how to fish in principle because they’re not touring machines you gave the example in your paper here of you know like a semi- decidable problem you know and maybe you should bring that in but the these aren’t solvable in principle in language models so actually there is an interesting point there that people many of the bad ideas in llms come because of the wrong connections people make between computational complexity and what llms do okay in fact for example the very reason why people thought llms can be better at verification is generation is typically computationally harder than verification so they just assumed that maybe they can’t do the harder problem they can do the easier problem the reality is they’re not Computing the solution in any which way so in fact as you know many people are pointed out um it if I give like consider the following thing if I have like five different prompt let’s say the one prompt basically implicitly for those of us who understand English it’s implicitly asking uh for a solution for a constant time computation problem another one is linear time another one is polinomial time and then all the way to another one is undecidable problem and for all of these the answer is yes or no let’s say except if you want to give the guarantee you need to actually do the computation ask yourself is llm going to take more time coming up with the the token yes or no depending on the prompt no it basically just takes constant time to come up with the answer and so people try to say let’s just actually people basically try to say then that since it’s anyway going to give it in constant time let’s make it put out more tokens so that all together it’ll take more time first of all that’s somewhat of a silly idea because there two different ways one is the really egregiously bad silly ideas there’s actually a paper I forget uh who the other were who said let’s pause let’s actually put pause tokens in llm hoping that then that way it will start reasoning because constant time it can’t possibly be reasoning so you just make it pause and maybe somehow reasoning will occur that’s like totally magical thinking okay the other one is you then split the problem into subp Parts somehow and you do a whole bunch of work and somehow convert this prompt into multiple other sub prompts and that’s the kind of thing that people have been looking at but you have to remember that in that case you’re doing the computation not the machine not llm you are doing the computation and and and so it still doesn’t actually the computational complexity metaphors are mostly Irrelevant for understanding how llms are actually you know completing the prompts there is a very different question of can trans Transformer networks are Transformer networks Shing complete I tend to argue that that’s an orthogonal question you can make it in fact remember there were neural touring machines beforehand and you can actually Transformers with external memory can be made T in complete but the real question there are variations that people have talked about but the real question is I think that’s not actually impacting how llms are you know generating the next token essentially it’s not like they’re actually doing touring computation uh of any kind they’re just basically picking the next token in constant time they andram models and so one of the other points you made with respect green blads thing that’s actually the thing that I mean first of all I thought green blads work I read that blog actually surprisingly enough I read your Tweet about his work when because I was I was following France fores Arc thing and then you wrote this thing that we talked to no no sir not you one of the other guys somebody else said this but so I went back and looked at you know the the the Green blads blog it was first of all I think it’s good work it’s very nice work and I think he a smart kid uh I a kid because most people are younger than me but but I do question this issue of later models will do everything the real question is if if you are saying the later modelss of a different architecture than Auto regressive llms I have no reason to disagree with you because the world is wide I mean there’s just so many different things you can do and we never said AI cannot do reasoning AI systems do exist that do reasoning like you know for example um you know Alpha good does reasoning it’s an RL systems do reasoning planning systems do reasoning Etc but llms are a shallow Broad and shallow type of AI systems they’re much better for creating ity than for reasoning tasks and and if you just keep increasing the parameters and the size of the system I have no rational reason to believe that they can do reasoning they may be able to convert reasoning into retrieval for larger subclass and that may be practically enough you know like you can you can fake reasoning as again as can I ever be if they’re doing reasoning then I shouldn’t be able to come up with any diagonalization arguments of the kind that I was talking about earlier such as you know changing the names of the predicates or looking at other offsets for the sees CER test Etc I don’t see that going to change just by increasing the size of the models and so I don’t share that optimism honestly in fact somebody from openi who should know a lot more about reasoning in general than most other people was actually you an open meeting was basically they were just trying to claim all of them were claiming that oh the bigger LMS will solve the problem ET and I asked why do you think so and this guy says but you see that Alpha go Sol does reasoning and I said Alpha go is not an llm alpha go is an RL system and nobody ever said RL systems can’t do reasoning we are talking about whether llms by themselves can do reasoning and that there is no reason to believe they can and that the size increases is going to make any difference as I said we actually play this game where whenever there’s a new bigger and uh bigger llm that that comes out we do the same plan bench experiments on them and as of like I mean we haven’t yet done the Lama 3.1 that just came yesterday uh but pretty much everything basically dies on the on the upated blockx problem for example and and so there’s no real reason to believe that that I don’t share that optimism do I share optimism about can we find some other kind of models yeah of course that’s always possible I I’m a big beli on General you know um uh the sort of the promise of AI systems okay I think currently you know I think Sarah hooker talked about this idea called Hardware Lottery and once you get a hardware when you win the hardware Lottery then basically that thing is what you keep instead of doing what you wanted to do you do the things that the current Hardware can do right that’s what you you change your thinking an equivalent is a software Lottery and llms have currently won the software Lottery the architecture Lottery and so people are much more they have figured out lots of engineering principles to some extent at least in a sub small subset of people as to how to strain bigger and bigger llms more and more efficiently and so they’re more interested in doing more of that but to some extent they’re just hoping that that will somehow solve the quote unquote the reasoning problem but there’s no reason to believe that you have to consider other kinds of architectures and right now I mean there are people like Yan Lun who says that he will do this japa style architecture that might help Etc but nobody’s really not even Facebook is actually doing that because that’s the thing about software lotteries because there is currently all the resources are going into training bigger llms so my way of looking at it is they’re going to be here for a while and they’re actually incredibly useful tools I’ll use them in llm modulo fashion with this sort of you know back prompting from external verifiers and and so on and so that will be a while before you know maybe completely different models will come that might have Better Properties well that that’s a really good segue and by the way I’m I’m a fan of H jeer from from laon but um so you’ve got a position paper out called llms can’t plan but can help planning in the llm modulo framework now by the way um I think if you called it Alpha reasoning that it would be more catchy but but in in short though you you I mean know Deep Mind hasn’t been giv me any money you need their marketing this is all marketing man but um yeah so um I’m you know I’ve been a big fan of neuros symbolic architectures for years but the thing is this is a different type of neuros symbolic architecture it’s bidirectional because we want to have the system one and system two and you’ve got this idea of critics and reformulates and and prompt generators and so on and and this is a a template architecture for doing really really good reasoning can you introduce it can you also compare it to things like you know Alpha geometry and fund search and stuff like that so so in general actually the this is a position paper and you know my I basically was actually talking about unifying a whole bunch of San ways of combining llms to with existing other systems to do reasoning tasks with guarantees okay and that is possible with respect to the whole discussion that we had earlier I tend to think LMS as great idea generators they can generate ideas about everything so in the context of learning they can generate plan guesses they can generate domain model guesses they can generate potential elaborations of an incompletely specified problem all of this they can do without any guarantees okay so but the ability to generate reasonable guesses is nothing to be sneezed at and in a and so what llm modulo tries to do is to leverage this idea generation aspect of llms in a sort of a generate test framework where the testing is done with a bank of critiques Bank of verifiers and and the llm itself is generating plan guesses and some of these verifiers might be modelbased verifiers they might need the planning domain model and and then they can then compare a plan with respect to this domain model to check whether it is doing what it’s supposed to be doing right so when they’re able to do that that’s sort of almost like having a python interpreter for the pl and everything so where is this model coming from in fact it turns out that you can also get tease out the model from the llm by asking okay what are the potential actions in this domain what are the potential preconditions and effects Etc and then do a internal llm modulo where this guess of the domain model is improved um for syntactic for with synta checks and also as a final uh check there’s a human in the loop you know who would do once per domain they will say okay this is this is a reasonable domain model and you know in the old AI basically this is the knowledge engineering step now the knowledge engineering part is now is you know kind of made much simpler because llm generates pretty good domain models and and so there’s a newps paper on that and and so in some sense so this is generate test framework where even the testers themselves can you can construct the testers themselves with partial help from llms so they can they can be used for multiple roles you know in in this overall architecture and and then there’s even when you have like a bank of critiques the order in which these critiques are called in can simulate essentially hierarchical planning versus normal flat planning so for example if you call certain kinds of critiques earlier um like La causal chain critiques then in some sense you’re getting a plan that almost works there in that level and then you are refining it such that it works for other constraints But ultimately the plan comes out of this Loop only when all the correctness critiques say we are fine and one of the other interesting things is we also allow um style critiques and for the style critiques we actually use llms because I as I said earlier style critiquing there is no other approach anyway there’s no formal verifier that can tell you is this good style and honestly that’s the reason why we always assumed you know in you know in the human civilization that style is somehow a lot more interesting than content anybody can have content it is the style that’s interesting and llms have just shifted the equation everybody can have style because they can call Chip to get the style the content is is is going to be a big issue but the style verification we have the style criti so we actually have done there’s a work there’s a paper that’s coming out in com um which basically shows how you can use VMS to look at Behavior trajectories videos of behavior trajectories and make criticisms about useful criticisms about in what ways are those behaviors good versus bad and that can still be improved that can be used as a back prompt to improve um the next um you know Behavior that’s synthesized so we do all of this and so this is sort of the way to look at LM Modo and you bring mentioned things like Alpha search and fun search they pretty much sort of become special cases of LM modulo and obviously LM modulo is sort of like an architecture the specific D problem that you focused on will be important so fun search and Alpha geometry in particular they wind up doing Alpha geometry does this you know math Olympia problems we are interested in plan planning lot more planning problems and so on um but in in general the I share I think that that that those approaches are very much in line with and you know possibly kind of at least subsumed by this General Vision of letting llms do what they’re good at which is guessing and letting verifiers do uh what they’re good at which is given a guess going through the thing and this actually interestingly enough harks back to what you were saying way back earlier about deductive how do you kind of combine creativity and deductive uh reasoning and that’s basically the creativity of llms whatever it is whatever level they have is being used in conjunction with this you know deductive reasoning abilities of the verifiers and and then that way you’ll get you can actually make guarantees and in fact one of the other things is if you want to get synthetic data to find you in the llms right now basically you where is that synthetic data that’s guaranteed to be correct coming from you know whereas with this kind of a thing actually whatever comes out is guaranteed correct with respect to the critiques that are giving the badge and so you can then if you do enough of these you can actually then fine tune the llm that improves the generator density of better solution better guesses and that will help the only other thing that I would mention essentially is as you first of all pointed out that this is bidirectional interaction that you know the guest will be given to uh the verifier verifier is back prompting and the back prompting can be done in different varieties like you can do binary criticism saying try again uh this is not correct or you can give here is something wrong with the the the current plan or you can also be constructive saying the critique can be constructive saying here why don’t you replace this action with this other action and when they do that the critique is using some sort of solving abilities in the underneath and since I brought up the solving abilities I want to also mention that I am a fan of using verifiers rather than solvers directly because solvers verifiers are composable you can have multiple kinds of constraints you can have verifiers for each type of constraints each type of you know correct considerations and say they become composable solvers tend to kind of package them together and if you stick to a specific solver you are stuck with their expressiveness limitations and so on this sort of allows like multi multiple experts multiple verification experts and in fact you know some of the real world planning like things the kind of things that people in like NASA do in Mission planning Etc in the end it’s like a bunch of humans you know subject experts looking at this plan and then saying yeah we are fine we are fine we are fine and and when all the experts are fine then you try to send a plan out and it may still fail because as you know real world is very extremely complex and so you can still have the oing disaster but the whole point of planning is to stop errors failures that you could have foreseen right and that’s the whole point because it reduces the number of failures and so in that sense you know that’s basically the llm modulo idea and it sort of combines what we already have from the verification angle to with the with the so one other point I want to make is that in when you’re talking about verification one other thing is unit testing is a kind of a verification so if you have in in code that’s this also llm module s of is connected to the way some of the sener methods in using llms to automatically generate code are working on so there you know python interpreter can take a piece of code that you that LM generates and check it on unit test and see these are the unit test things and is it giving the correct answer now if it gives the correct answer on the unit if it doesn’t give correct answer for one of the unit test you know for sure it’s not you’re not going to send it out but even if it gives correct answer it may still fake because it only means that it’s doing well on these unit test so that’s like of partial verification and that sort of makes sense and in in in basically automatic programming you know Community right now people using llms wind up using this and so I would say that’s again consistent with modulo architecture Yeah final question we’re seeing the crashing together of Two Worlds now so in the open AI camp for example they are talking about generalist Foundation models and we are talking about sort of specialized Hybrid models with neuros symbolic architectures and so on people in the middle are talking about Building agentic Systems so one way to implement this might be okay well we’ll have a foundation model we’ll do rag and it will call into llm modulo um another way might be we build a multi-agent system I mean how do you see this all panning out so so my the way I think about generalist systems Etc is that you essentially as long as you’re sticking to llms you are like VMS versus llms or llms with additional extra training Etc this is all to improve the density of quality guesses that the llm gives out irrespective of that they will still never be able to kind of guarantee that the solution that they’re giving actually has any properties okay but it just it’s sort of you being constrained to generate better Solutions like going back to that old ramanan thing he was obviously I mean there were things that he said that wound up being wrong but he was one of the guys who seems to have had very good uh generator for you know this particular you know mathematics right so that is kind of the hope that people have on the jist systems you know in terms of improving their accuracy Etc um and the agentic systems first of all I I kind of I’m like bewildered by the whole agentic hype because people confuse acting versus with planning okay and my my comeback to that is you leave a gun in a house with a toddler the toddler would act but they’re not planning necessarily so that’s why we shouldn’t have guns you know in the country I come from unfortunately with there too many people leaving guns with toddlers around and it’s like you know and that is acting is not guaranteeing that somehow acting is about affordance okay so you can have they’re almost orthogonal that you could have somebody who can come up with a plan but they actually don’t have the affordances one of the funniest Far Side cartoons by Gary lson that I remember is these two cows are sitting in the living room and the phone is ringing on the wall and the cow looks one cow looks at the cow and says there here we are sitting and there goes the phone ringing and here we are sitting with lacking opposable thumbs because we can’t pick up the phone so they actually have the plan of how to actually answer the phone but they can’t execute it because they’re missing the affordances they’re missing this thing conversely the fact that you can pick up something or you can press any little button or you can make function calls doesn’t guarantee that those function calls will actually lead to desirable outcomes so most people in agent systems essentially are just somehow thinking that if you can call the function everything will be fine that’s only true in highly erodic worlds where pretty much any sequence will kind of none of the sequences will fail so it’s the only kinds of cases where planning almost is not needed that’s when you can basically get by but otherwise agentic systems require in addition to planning the ability to call function functions okay but you want to upfront prove to yourself at some level that the kinds of functions that you’re going to call aren’t going to override databases lead to loss of data that cannot be you know reverted Etc currently nobody is talking about those issues in some sense you have to kind of orchestrate the plan uh humans have to orchestrate the plans and take the blame um and and the all the other things that that’s actually happening is you know function calling so that to me I think is a kind of a people I mean obviously would like to have agentic systems but to the extent since llms can’t do planning you know I don’t expect that anything is going to change unless you can have llm modulo agentic systems where the plan is kind of guaranteed there and then you do function calls you know that’s sort of a a a possible uh thing uh so that’s that’s how I look at this particular you know generalist versus Agent systems versus you know um and then the in the final thing is about things like El Maro is as I said because llms won the software Lottery they’re going to be here for a while I don’t know what for a while means I me we are in this in you know incredible time as you mentioning earlier it’s a great time to be alive in Ai and it’s like you don’t know what’s going to happen next week right uh but they would be around for a while for sure and so it would be making sense to kind of make them do things in a same way and that’s where this LM modulo architectures I’m you know that I’m saying because that sort of AI already exist you know RL systems verifiers Etc exist already and so you can put them together with llms to actually you know give some modicum of guarantees about the kinds of um reasoning correctness guarantees and planning correctness guar this SRO it’s been an honor and a pleasure and I really hope that you’ve inspired more young researchers to go into this field I mean could just a final few words on that where can people find out more about you but what what kind of research should people be doing so I actually think the the the usual um advice I give to grad students is you should have broad knowledge not just the thing that is currently the most popular because what happened to some extent is many of the grad students actually haven’t done anything with logic and reasoning and deductive clure some of these words that we’re talking about are not even like the things that they they understand they’re very bright people but they basically focused on a very particular set of skills and part of it is because you know this stuff is working you know logic is old school Etc but the point is you have to differentiate between normative versus operational uh uses logic Still Remains the normative way of judging reasoning correctness right and so you want to know this so that you can be more careful about the claims you are making so that you can for example understand the difference between reasoning and Ral you can understand the difference between a database and an engram model and a deductive database and once you have that humans are incredibly smart and certainly smart grad students are even smarter than rest of the humans and they can do good work and in in in in general my sense is and the other thing I think we were talking beforehand to um have skepticism you know in because I find that in the era of llms AI has become an arar Natural Science and so it’s like instead of building an artifact to have certain guarantees you build the artifact act and poke it to see what is it able to do this whole notion of emergent abilities old Engineers would be scared to death like for example an old civil engineer you know finds out that the bridge over danu in addition to kind of supporting traffic on Fridays tends to whistle and on Saturdays flies they would think that’s a failure of bridge building right because you are supposed to build it for specification and we are in this interesting world where we are actually developing these huge large models and you are poking it to see what they’re trying to do and then all these emerging abilities and Sparks papers that are coming out and what is needed in that sort of observational studies is a rigor about don’t stop just because you got one positive results actually try to see where else it is actually likely to fail so a classic example is the Chain of Thought paper which is is hugely influential hugely popular interestingly in their paper they have multiple cases where they applied Chain of Thought after like seven or so cases in about four cases it doesn’t do well instead of trying to understand why it doesn’t do well there they focus the rest of the paper on the three cases where it works well right and then the rest of the community just basically went with that positive message but it’s important to understand where it doesn’t do well too because that’s like a rigor for observational uh studies and once you do that so we are sort of zoologists right now because we have basically this you know incredibly complex organisms that we are training without knowing what they’re supposed to be doing and they are getting surprised that they’ll show some spark here and a spark there but you want to be able to make supportable claims and you need to be more skeptical about your empirical claims in particular and that’s the kind of advice I would give and as far as my work itself I as you know first of all thanks for again for having me on this and you know I’ve been going around giving tutorials and giving talks and you know all this stuff is available on our websites and there are lots of people the llm modular paper actually talks about lots of other related work that if people are interested in that sort of Direction they can also look at that but no that’s what I would say amazing and um yeah negation is a good example by the way and there was the is a okay not is it have you seen that paper where it’s like um John Cruz is the mother of so and so and then you you you reverse it and then it doesn’t yeah the reversal Cur sorts of things happen too but in general the the issue in general is in empirical studies I keep telling this to my students that in empirical studies just when you think there are results out of supporting your hypothesis is when you need to be extremely skeptical because the human tendency is to let’s just celebrate this success and write a paper and that tends to to this sort of a situation where you’ll wind up writing papers that are having low halflife period and they actually kind of you know collectively push the field in you know somewhat you know futile directions look I mean I don’t think there is any simple way I mean you know research is done by individuals and they have all sorts of interesting motivations and I’m completely aware of that so since you ask me what advice would I give I would give this advice knowing fully well that nobody cares about the advice unnecessarily following and hopefully somebody might might actually pay some attention and that might help but in general you know it’s a decentralized process and you know it’s the only good thing about science is it self- correcting you know you know there was Chain of Thought paper there is also our Chain of Thought doesn’t work paper and you know ultimately I mean I want to end with this thing that I was mentioning somebody else that um I love this story that one time Albert Einstein was asked by the journalist that did you know that there is this book um by like like the quote unquote Arian scientists saying 100 Arian scientists again as relativity and what do you think about it and Einstein said it doesn’t matter if if they’re right one is enough and if they’re wrong 10,000 doesn’t make a difference right ultimately that’s the self-correcting nature of science and so hopefully you know in fact there’s a beautiful paper yesterday in the position track position papers track which was the first oral in icml about embracing negative results and I understand that people tend to say just be positive not negative but science you need to understand both the limitations and the capabilities as much as you know people who may have just tuned in in the middle might think I’m kind of a llm lite I think they’re actually incredibly useful it’s just you don’t want to be delusional about what they can do you should know what they can and cannot do so that you can use them correctly to bro honestly it’s been an absolute honor it was worth coming to Vienna and losing my luggage just for this interview thank you so much I appreciate it I appreciate it thank you right [Music]