the Kaleidoscope hypothesis is this idea that uh the world in general and any domain in particular follows the same structure that it appears on the surface to be extremely rich and complex and uh infinitely novel with every passing moment but in reality it is made from the repetition and composition of just a few atoms of meaning um and a big part of intelligence is the process of mining your experience of the world to identify bits that are repeated um and to extract them extract these unique atoms of meaning and uh when we extract them we call them abstractions and then as we build uh sort of like inner Banks uh of such abstractions then we can reuse them to make sense of Novel situations of situations that appear to be extremely unique and novel on the surface but actually they can be interpreted by composing together these uh uh reusable abstractions before we start building a GI we need to ask ourselves the hard questions what is intelligence how can we measure it Benchmark progress um and what are the directions that we should follow to build it so I just want to uh give you my take on these questions and I’m going to start by taking you back to Peak AGI hype which was early last year do you remember what February 2023 felt like Chad GPT had been released just a couple months prior GPT 4 just came out Bing chat just came out it was the Google killer anyone remember Bing chat and we were told that Shad GPT would make us 100x more productive a THX more productive that it could outright replaces the existential risk of AI with becoming front page news and AGI was just around the corner it was no longer 10 years away not even five years away it was just a couple years away you could start the countdown in Mones and that was one and a half years ago and uh clearly back then AI was coming for your job right away it could do anything you could but faster and cheaper and how did we know this well it could pass exams and these exams are the way that we tell whether other humans are fit to perform a certain job if AI passes the bar exam then it can be a lawyer if it can solve programming puzzles then it can be start to engineer and so on so many people were saying that all lawyers all software Engineers all doctors and so on were are going to be at the job uh maybe even within the next year which would have been today in fact most death jobs were going to disappear and we faced mass unemployment um so it’s it’s very fun thing about it because today the employment rates in the US is actually higher than it was at the time so was that really true you know was it really what the benchmarks uh were telling us back then so if you go back to the real world uh you know away from the headlines away from the February 2023 hype uh it seems that llms might be a little bit short of gener ey I’m sure most of you in this room would agree with that they suffer from some problems and these limitations are inherent to cure feding it’s they’re inherence to the Paradigm that we using to build these models so they’re not easy to patch and in fact there’s been basically no progress on these limitations since day one and day one was not last year it was actually when we started using uh uh these Transformer based large language models so over five years ago so over five years ago and we’ve not really made any progress on these problems because the models we are using are still the same they parametric curves fitted to a data set desent and they’re still using uh the same Transformer architecture so I’m going to cover these limitations I’m not actually going to cover hallucinations because all of you are probably very familiar with it uh but let’s take a look at the other ones so to start with an interesting issue with LMS is that because they’re Auto regressive models they will always outut something that seems to uh that seems likely to follow your question without necessarily looking at the contents uh of your of your question so for instance for a few months after the original release of CH GPT if you asked what’s heavier 10 kilos of steel or one kilo of feathers it would answer they weigh the same and it would answer that because uh the trick question what’s heavier one kilo of steel or one kilo of feathers is found you know all over the Internet and the answer is of course that they way the same and so the model would just uh pattern match the question without actually looking at the numbers without passsing the actual question you’re asking and uh same if you provide a variation of the monthy hold problem which is the screenshot right here um dlm has memorized perfectly the clinical answer to the actual mon old problem so if you ask an absurd variation it’s just going to go go right through it and and output uh the the the answer to the original problem so to be clear these two specific problems uh they’ve already been patched via rhf but they’ve been touched by special casing them and it’s very very easy to find new problems uh that still fit this face mode so you may say well you know these examples are from last year so surely today we are doing much better and in fact no we are not uh the issues have not changed since day one we’ve not make any progress towards addressing them they still play the latest set ofthe art models like clo 3.5 for instance uh this is a paper from just a few days ago uh from last month that actually uh investigat some of these examples inside of the art models including uh close 3.5 so a closely related issue is the extreme sensitivity of llms to phrasing if you change uh the names or places variable names in a text paragraph it can break LM performance or if you change uh numbers in a formula uh there’s an interesting paper that investigates this so you can uh you can check it out it’s called Embers of autoaggression and people are very optim IC they would say that this brittleness is actually a great thing because it means that your models are more performant than you know you just need to query uh in the right way and you will see better performance you just need prompt engineering and uh the counterpart to that statement is that for any llm for any query that seems to work there is an equivalent rephrasing of the query that a human would readily understand that will will break right and to what extent do LMS actually understand something if you can break their understanding with very simple renamings and rephrasing it looks a lot more like superficial pattern matching than robust understanding you know besides that there there’s a lot of talk about uh lm’s ability to perform in context learning to adapt new problems on the Fly and what seems to actually happen uh is that llms are capable of fetching memorized programs like problem solving templates and map them to the the current task and if they don’t have a memorized program ready if they’re faced with something slightly infamiliar even if it’s something very simple they will not be able to analyze it from first principles the way a human could so one example uh C our ciphers llms state of the artms can solve a Cesar Cipher which is very impressive right but as it turns out they can only solve it for very specific values of the key size the specific values like three and five that you find commonly in online examples if you uh show it an example of Cipher with a key size like 13 for instance um it will fail so it has no actual understanding of the algorithm for solving the cipher it has only memorized it for very specific values of the key size right so my hypothesis is that llm performance depends purely on task familiarity and not at all on task complexity there’s not really any complexity ceiling to what you can get llms uh to solve to memorize as long as you give them the opportunity to memorize the solution or the problem solving template the program right that you have to run to get the answer um instead llm performance depends entirely uh on task familiarity and so even very simple problems if they are unfamiliar uh will be out of Frid and lastly uh llm suffer from a generalization issue with the programs that they did in fact memorize so some examples include the fact that LMS have trouble with number multiplication as you probably know with list sorting and so on even though they’ve seen millions of examples uh of these problems so they typically have to be aided by external symbolic systems in order to handle uh these things um there’s an interesting paper that investigates how llms handle composition uh so it’s titled the limits of Transformers of compositionality and the main finding uh is that llms do not actually handle composition at all what they’re doing instead uh is linear linearized subgraph matching uh there’s another paper that’s also very intriguing it’s the reversal curse so the authors found that if you train an LM um with cont like a is B it cannot actually infer uh uh the reverse B is a so that’s that’s really you know breakdown of generalization on a on a deep level and that’s actually quite surprising like even even I I’m tyly very skeptical of Els I was very surprised with this result so one thing to keep in mind about these failure cases is that specific queries will tend to get fixed relatively quickly because the models have being constantly find tuned on new data collected from Human contractors based on pqu history and so many of the examples that I show in my slides are probably already working with some of the state of the art because they failed in the past and so they’ve been manually addressed since but that’s a very brittle way of making progress because you’re only addressing one query at a time and even for a query that you patched if you rephrase it or if you change names and variables it’s going to start failing again so it’s a constant game of guacal and it’s very very heavily Reliant uh on human Labor uh today there’s probably uh you know between between 10 and 30,000 uh humans that work fulltime on creating annotated data to train LMS so you know on on balance it seems a little bit contradictory like on one hand llms are beating every human Benchmark that you throw at them and on the other hand uh they’re not really demonstrating a robust understanding of the things they’re doing so to solve this Paradox you have to understand that skill and benchmarks are not the primary lens through which you should look at these systems so let’s zoom out by a lot there have been historically two currents of thoughts to define the goals of AI first there’s the Minsky style view which Echoes uh the current big Tech view that uh HGI would be a system that can perform most economically valuable tasks so Minsky said AI is the science of making machines capable of Performing tasks that would require intelligence if done by humans so it’s very task Centric you care about whether the AI uh does well on a on a fixed set of tasks and then there’s the M cart view so he didn’t exactly uh say what I’m quoting here but he was a big proponent of his ideas that jity in AI is not a a task task specific performance scale to many task it’s actually about getting machines to handle problems they have not been prepared for and that difference Echoes uh the lock view of intelligence versus the Darwin view of intelligence so intelligence as a general purpose learning mechanism versus intelligence as a collection of task specific skills uh imparted to you by Evolution and my view is more like the lock and Mark view I see that intelligence is a process and skill task specific skill is the output of that process this is a really important point if there’s just one point you take away from this talk uh it should be this skill is not intelligence and displaying skill at any number of tasks does not show Intelligence it’s always possible to be skillful at any given task without requiring any intelligence and this is like the difference between having a road Network versus having a road Building Company if you have a road Network then you can go from uh A to B for a very specific sets of A’s and B’s that are defined in advance but if you have a road Building Company then you can start connecting arbitrary A’s and B’s on the fly as your needs evolve so attributing intelligence to a crystallized Behavior program is a category error you are confusing the output of the process with the process itself intelligence is the ability to deal with new situations uh the ability to blaze uh fresh trails and build New Roads it’s not the road so don’t confuse the road and the process that created it and all the issues that we are facing today with L they’re a direct result of this misguided conceptualization of intelligence the way we Define and measure intelligence is not a technical detail that you have you can leave to externally provided benchmark uh it reflects our understanding of cognition so it reflects the questions that you’re asking and by by through that you know it also kind of limits the answers that you could be getting um it’s really the way that you measure progress It’s the feedback signal that you use to get closer to your goals if you have a bad Fe feedback signal you’re not going to make progress towards actual generality so here are some key Concepts that you have to take into account if you want to Define and measure intelligence the first thing to keep in mind is the distinction between uh static skill and fluid intelligence so between having access to a large collection of static programs to solve known problems like what llm would do versus uh being able to synthesize brand new programs on the fly to solve a problem you’ve never seen before so it’s not a binary right either you have fluidity or you don’t it’s more like a spectrum but there’s higher Intelligence on the right side of the spectrum and the second concept is operational area there’s a big difference between being skilled only in situations that are very close to what your famar with versus being skilled in any situation within a broad scope so for instance if you know how to add numbers then you should be able to add any two numbers not just specific numbers that you’ve seen before or numbers close to them if you know how to drive then you should be able to Drive uh in in any City you should even be able to you know learn to drive in the US and then move to London and drive in London where you’re driving on the other side of the road if you know how to drive but only in very specific geofence areas you know that’s less intelligent so again there there’s a spectrum here it’s not uh it’s not a binary but there’s higher intelligence uh on the on the higher generalization side of spectrum and uh lastly the last concept is information efficiency how much information how much data was required for your system to acquire a new skill program if you’re more information efficient you are more intelligent and so all these three concepts these three qualities they’re linked by the concept of generalization and generalization is really the central question in AI not skill forget about skill forget about uh benchmarks and that’s really the reason why using human exams to evaluate AI models is a terrible idea because exams were not designed with generalization in mind or rather you know they were designed with generalization assumptions that are appropriate for human beings but are not appropriate for machines you know most exams assume that humans haven’t read the exam questions and the answers beforehand uh they assume that the questions you’re going to be asking are going to be at least somewhat unfamiliar to the to the test Tak unless it’s a pure memorization exam in which case it would make sense that llms could could a it since they’ve memorized the entire internet so to get to the next level of capabilities uh we’ve seen that we want AI uh to have the ability to adapt to generalize to new situations that it has not been prepared for and to get there we need a better way to measure disability because it’s by measuring it that we’ll be able to make progress we need a feedback signal so in order to get there we need a clear understanding of what generalization means generalization is uh the relationship between the information you have like uh the prior that you’re born with and the experience that you’ve acquired uh over the course of your lifetime and uh your operational area over uh the space of potential future situations that you might encounter as an agent and they are going to fature uncertainty they’re going to F your novelty they’re not going to be like the past and generalization is basically uh the efficiency with which you operationalize past information in order to deal with the future so you can interpret as a conversion ratio um if you enjoy math you can in fact use algorithmic information Theory to try to characterize and quantify precisely this ratio so I have a paper about it uh if that’s interesting to you you can check it out uh one of the things I talk about in the paper is that to measure generalization power to measure intelligence you should control for priors and experience since intelligence is a convention ratio you need to know what you’re dividing by and if you’re interested specifically in comparing AI to human intelligence then you should standardize on a shared set of cognitive priors which should be of course human cognitive prior what we call core knowledge so as an attempt to fulfill these requirements for a good Benchmark of intelligence I’ve uh put together a data set it’s called the abstraction reasoning Corpus for artificial general intelligence or Arc gii for short and uh you can think of it as a kind of IQ test so it’s kind of intelligence test uh that can be taken by humans it’s actually very easy for humans or uh AI agents and you can think of it as a program synthesis data set as well so a key ID is that in Aria every task that you see every task you get is novel it’s different from any other task uh uh in the data set it’s also different from anything you may find online for instance so you cannot prepare in advance for Arc you cannot just solve Arc by memorizing the Solutions in advance that just that just doesn’t work um and AR try is to control for experience because um you’re doing F shot program learning you’re seeing you know two or three examples of a thing and then you must infer from that uh the the program that links uh the input to the output and we’re also controlling for prior in the sense that argi is grounded purely in core knowledge prior uh so it’s uh it’s not going to be using any sort of acquired knowledge like the English language for instance it’s only built on of four core knowledge systems there’s objectness uh there’s uh basic geometry and topology and there there’s numbers and there’s Agent M so we first run a kle competition on this data set in early 2020 uh that produced several very interesting Solutions all based on program synthesis and right now the state-ofthe-art is about uh 40% of testar soled and that’s very much baby steps and that’s a data set that’s from before the edge of f but actually it has become even more relevant in the edge ofms because most benchmarks based on exams and so on they’ve already saturated in the Aged ofms but not AR gii and that’s because Arc gii is designed to be resistant memorization and all the other benchmarks can be hacked uh by memory alone so in June this year we’ve launched a much more ambitious competition around Ari uh we call it the arc prize so together with Mike canop the co-founder of Zak here we’re offering over a million dollar in prices to get researchers to solve Arc gii and open for the solution uh the competition has two tracks there’s a private track that takes place on kagle is the largest kagle competition at the moment uh you get evaluated on 100 hidden tasks and your solution must be self-contained uh so it must be able to run on a GPU VM uh within 12 hours with good CPU as well uh there’s a big price to get over 85% uh and then there are prices for the top scores as well and there’s even a best paper price 45k so even if you don’t have a top result but you have good ideas just write a paper submit it uh you can win some money and there’s also a public track which we added because people kept asking okay but you know uh how do the state of the AR LS like GPT 4 clo and so how do they do on the data set so we launched uh this uh sort of like semiprivate eval where the tasks are not public but they’re also not quite private because they’re being carried by this remote API uh and surprisingly the S of theart this track is pretty much the same as on the private track uh that’s that’s actually quite interesting so what’s llm performance on rgi exactly it’s not very good the state of the artms are doing most of them are doing between 5% and 9% and then there’s one that’s doing better it’s clo 3.5 clo 3.5 is a big jump it’s at 21% um and meanwhile you know basic program search should be able to get you at least 50% so how do we know this 50% is what you get if you assemble all of the submissions that were made in a 2020 competition which were all Brute Force program search so basically if you scale up Brute Force program search to more comput you should get at least 50% and meanwhile humans do like easily over 90% the private set was verified by two people and each of them scor uh 97 to 98 right and together uh they can s 100% so you know 5 5 to 21% is not great but it’s also not zero so it implies that llms have nonzero intelligence according to the Benchmark and that’s intriging but one thing you have to keep in mind is that the Benchmark is far from perfect there’s a chance that you could achieve this score by purely memorizing patterns and reciting them it’s possible so we have to investigate where this comes from because if it comes from a canal of reasoning then you could scale up the approach to become more General over time and eventually get to General AI but if that performance actually comes from memorization then you’ll never reach generality you will always have to uh keep applying these onetime human guided pointwise fixes to acquire new skills um it’s going to be this Perpetual game of wacko and it’s not going to scale to generality so to better understand what LMS are doing we have to talk about abstraction abstraction generalization are closely TI because abstraction is the engine through which you produce generalization so let’s take a look at let’s take a look at abstraction in general and then we look at abstraction in alms uh to understand abstraction you have to start by looking around uh zoom out look at the universe uh an interesting observation about the universe is that it’s made of many different things that are all similar they’re all analogous to each other like one human is similar to other humans because they have the shared origin uh electromagnetism is analogous to hydrodynamics is also analogous to gravity and so on so everything is similar to everything else we are surrounded by isomorphisms I call this Kaleidoscope hypothesis so you know what a kaleidoscope is it’s a a tube with a few bits of colored glass that are repeated and Amplified by a set of mirrors and that creates uh this remarkable richness uh of complex patterns out of just a few canels of information and the universe is like that and in this context intelligence is the ability to mine the experience that you have to identify bits that are reusable and you extract these bits and you call them abstractions and they take the form of programs patterns representations um and then you’re going to recombine these bits together to make sense of Novel situations so so intelligence is sensitivity to abstract analogies and in fact that’s pretty much all there is to it if you have a high sensitivity to analogist then you will extract powerful abstractions from Lal experience and you will be able to use these abstractions to make sense of the maximally large area of future experience space and uh one really important thing to understand about abstraction ability is that it’s not a binary thing where either uh you’re capable of abstraction or you’re not it’s matter of decree there’s a spectrum from factor to organized knowledge to abstract models that can generalize broadly and accurately to meta models that enable you to generate new models on the Fly given a new situation uh and the degree zero is when you purely memorize pointwise Factor it there’s no abstraction involved it doesn’t generalize at all that’s what you memorized so here we are representing our factoids as functions with no argument uh going to see why in a bit the fact that they have no argument means that they’re not abstract at all once you have lots of related Factor you can organize them into uh something that’s more like an abstract function that encodes knowledge so here this function has a variable X so it is abstract for x uh so this type of thing this type of organized knowledge based on pointwise factor or interpolations uh between pointway factories uh it doesn’t generalize very well you know kind of kind of like the way llms add numbers it looks like abstraction but it’s relatively weak form of abstraction it may be inacurate uh it may not work on data points that are far from the data points that you’ve seen before and the next degree of abstraction is to turn your organized knowledge into models that generalize strongly a model is not an interpolation between factoids anymore uh it’s a concise and causal way of processing inputs to obtain the correct output so here a model of addition using just binary operations is going to look like this uh this Returns the correct result 100% of the time it’s not approximate and it will work with any input whatsoever like regardless how large they might be so this is strong abstraction llms as we know you know they still fall short of that uh the next stage would be the ability to generate abstraction autonomously that’s high going to be able to handle novel problems like things you’ve not been prepared for and that’s what intelligence is everything up to this point was not actually intelligence it was just crystallized skill and the last stage is going to be to be able to do so in a way that’s maximally information efficient that would be HGI so it means you should be able to master new tasks using valal experience very little information about the task so not only are you going to display High skill at the task meaning that the model you applying can generalize strongly but you will only have to look at a few examples a few situations to produce that model so that’s a holy grail of AI and if you want to situate llms on the spectrum of abstraction uh there’s somewhere between organized knowledge and generalizable models they’re clearly not quite at the model stage as per the limitations that we discussed you know if llms were at the model stage they could actually add numbers or sort lists but they have a lot of knowledge and that knowledge is structured in such a way that it can generalize to some distance from previously seen situations it’s not just a collection of pointwise factors um and if you solve all the limitations of llms like hallucinations and britness and so on you would get to the next stage but in order to get to actual intelligence to on the Fly model synthesis there’s still a massive Trump uh you cannot just purely scale the curant approach and get there you actually need brand new directions and of course AGI after that is still a pretty long way off so how do we build abstraction in machines let’s take a look at how abstraction works I said that intelligence is sensitivity to analogist but there’s actually more than one way to draw analogist there’s two ways there are two key categories of analogies uh from which uh arise two categories of abstraction there’s value Centric abstraction and this program Centric abstraction and they’re pretty similar to each other they mirror each other they’re both about comparing things and then merging individual instances into common abstractions by erasing certain details uh about the instances that don’t matter so you take a bunch of things you compare them among each other uh you erase the stuff that doesn’t matter what you’re left with is an abstraction and the key difference between the two is that uh the first one operates in continuous domain and the other one operates in discrete domain so value Centric abstractions about comparing things via a continuous distance function like dot product in llm for instance or or the L2 distance uh and this is basically what powers uh human perception intuition and pattern cognition and meanwhile uh program Centric abstraction is about comparing this discrete programs which are graphs obviously and instead of computing uh distance uh uh a distance between between graphs you are looking for exact uh subgraph isomorphisms for exact uh subgraph matching so if you ever hear like a software engineer talk about abstraction for instance this is actually what they mean when you’re factoring something to make it more abstract that’s what you’re doing um and um both of this form of substraction you know they they’re really driven by analogy making is just different ways to make analogies analogy making is the engine that produces abstraction and value analogy is grounded in Geometry uh you compare instances via distance function uh and a program Centric analogy is grounded in in topology you’re doing uh exact uh uh uh subgraph matching and all cognition arises from a an interplay between these two forms of abstraction and you can also remember them VI had the left brain versus right Brain Analogy or the type one versus type two thinking distinction from kenman uh so of course you know the left brain versus right brain stuff it’s actually an image it’s not how lateralization of cognitive function actually works in the brain but you know it’s it’s fun way to remember it and Transformers are actually great at type one at Value Centric abstraction uh they do everything that type one is effective for like perception intuition pattern cognition so in that sense Transformers represent a major breakr in AI but they’re not a good field for type two abstraction and that is where all the limitations we listed came from this is why you cannot add numbers or why you cannot uh infer from a is B that b is a as well uh even with a Transformer that’s strain on all the data on the internet so how do you go forward from here how do you get to type two how do you solve problems like you know rgi right any any reasoning or planning problem the answer is that you have to leverage discrete program search as opposed to purely manipulating uh continuous interpolative embedding spaces uh learn with SC descent and there’s an entirely separate branch of computer science that deals with this um and in order to get to AGI we have to merge uh discret program search with deep learning so quick intro you know what’s where’s this script program search exactly it’s basically Comal search over graphs of operators taken from a domain specific language DSL and uh there are many flavors of that idea like genetic programming for instance and to better understand it you can s of like draw a side by side analogy between what machine learning does and what program synthesis does so in machine learning your model uh is a differentiable parametric function in PS it’s a graph of operators taken from a d uh in ml the learning engine is quite incent uh in PS it’s communal search um and uh in in in ml you know you’re looking at a continuous L function uh whereas in PS you only have this binary correctness check as a feedback signal and the big herle in ml is data density to learn a model your model is a curve so to fill it you need a dense sampling of the problem space uh meanwhile PS is ex extremely data efficient you can see the program using just a couple examples but the key hle is communal explosion the size of the program space you have to look at to find the correct program is immense and it increases combinator with DL size of program length so program synthesis has been very successful on AR gii so far even though these are just baby steps uh and so all program synthesis Solutions on Arc they follow the same template you know basically during Brute Force program search um and even though that’s very primitive it still outperforms a stateof the art tamps with much less compute by the way so now we know what the limitations of LMS are we know what they’re good at they’re good at type one uh we know what they’re not good at type two so where do we go next and the answer is that we have to merge uh machine learning like this solar flag uh type one thinking uh with the type two thinking provided by program synthesis um and I think that’s really how intelligence Works in humans that’s what human intelligence is really good at that’s what makes us special is that we combine perception and intuition together with explicit step-by-step reasoning we combine really both uh forms of struction so for instance um if you’re playing chess you’re using type two when you calculate step by step like you unfold specific interesting moves uh but you’re not doing this for every possible move because there are lots of them you know it’s Cal explosion you’re only doing this for a handful of different options so you use your intuition which you build up by playing lots of games in order to uh narrow down the sort of discrete search uh that you perform uh when when you’re calculating so your merging uh type one and type two and that’s why uh you can actually play chess using very very uh small cognitive resources compared to what a computer can do um and this blending of type one and type two is where we should take AI next so we can combine the planning and discrete search into a unified uh super approach so how does that work well the key system two technique is this research of SP of programs and uh the key wall that you run into is cator explosion and meanwhile the Key System One technique is curve fitting and generalization VI interpolation so you embed lots of data uh into an interpolative manifold this manifold can do fast but approximate judgment calls about the target space so the big idea is to leverage those fast but approximate judgment calls to fight Catal explosion so you use use them as a form of intuition about the structure of the program space that you’re trying to search over that you’re navigating and use that to make search tractable so a simple analogy that sounds a little bit to abstract is drawing a map you take a space of discrete objects with discrete relationships that would normally require communal search like past finding in the Paris Metro is a good example that’s a Comal problem and you embed these discrete objects and their relationships uh into a geometric manifold where you can compare things via a continuous distance function and you can use that to make fast but approximate inferences about relationships right like can pretty much draw a line on this map look at what it intersects and this gives you a sort of like uh candidate path that restricts uh the the the the set of uh uh discrete passes that you’re going to have have to have to look at one by one so this enables you to keep Comal explosion in check but of course you cannot draw maps of any kind of space right like program space actually very very nonlinear so in order to use deep learning for a problem you need two things you need an interpolative problem like it needs to uh follow the manifold hypothesis and need lots of data if you have only two examples it doesn’t so if you look at a single AR task for instance it’s not interpolative and you only have like two to four examples so you cannot use deep learning uh you cannot uh Solve IT purely by reapplying a memorized pattern either so you cannot use the LM but if you take a step down a lower down the scale of abstraction and you look at core knowledge the core knowledge systems that AR is built upon each core knowledge system is interpolative in nature and could be learned from data and of course you can collect lots of data about them so you could use the planning at that level to serve as a perception layer that passes uh Arc world into discrete objects and uh likewise if you take a step higher up uh the scale of abstraction and you look at the space of all possible rgi tasks and all possible programs uh that solve them then again you will find continuous dimensions of variation uh so you can actually leverage interpolation in that space to some extent you can use deep learning there to uh produce intuition over the the structure of the space of our tasks uh and the programs that solve them so based on that you know I think there are two uh exciting research areas to combine deep learning and program synthesis the first one would be uh leveraging discrete programs that incorporate deep learning components so for instance use deep learning as a perception layer to pass uh the real world into discrete building blocks that you can fit into a program synthesis engine you can also add uh symbolic add-ons to deing systems which is you know something I’ve been talking about for for a very long time but it’s actually starting to happen now uh with uh things like external verifiers tool use fors and so on uh and the other direction would be uh deep learning models used to inform discrete search and improve its efficiency so you use deep learning as a driver as a guide for program synthesis so for instance it can give you a intuitive program sketches to guide your program search um and it can reduce the the space of possible branching decisions that you should cons node and so on so what would that look like on rgi I’m going to spit out to you how you can crack AR gii and win a million dollar Maybe so there are two directions you can go first you can use the planning to draw a map in a sense of grid State space grid space uh so in the limit this solves program synthesis because you take your initial uh grid input you embed it uh on your on your manifold then you look at the grid output you embed it uh and then you you draw a line between the two points on your manifold and you look at the grids uh that it interpolates and this gives you approximately the series of transformations to go from input to Output right you still have to do local search around them of course because they this is this is fast but this is very approximate it may not be correct uh but it’s a very good starting point you are turning program synthesis into a pure interpolation problem um and the other direction you can go is program embedding you can use deep learning to draw a map of program space this time instead of grid space and you can use this map um to generate discrete programs uh and make your search process more efficient so a very cool example of how you can uh combine LMS with discret program search is this paper um after this search IND resoning with the language model so it uses llm to First generate a number of hypothesises about an OCT in natural language and then it uses another llm to implement candidate programs corresponding to each hypothesis in Python and so by doing this they’re actually getting a 2X Improvement on argi so that’s very promising uh another very crude example that’s very promising is uh the submission from Ryan greenblat on the argi leader board the public leader board so he’s using a very sophisticated prompting pipeline uh based on gp40 where uses gp4 to generate candidate python programs to solve our tasks and then he has an external verifier um is generating like thousands of tasks per per thousands of programs per task also has a way to uh refine uh tasks uh programs that seem to be seem to be uh doing you know close close to what you want so this scores 42% on the public leaderboard and that’s the current state of the art so again here’s where we are we know that llms fall short of AI they’re great at system one thinking they lack system to and meanwhile progress towards system two has stored progress towards the GI has stor the limitations that we’re dealing with with lams they’re still the exact same we are dealing with five years ago and we need new ideas we need new brrs and my bet is that the next brex through will likely come from outside while you know all the big labs are busy training bigger LMS so maybe it could even be someone in this room maybe you have the new ideas that I want to see so see you on the leaderboard for rgi thank you