<no title>

for Chalet he views a difference between skill at some task and intelligence intelligence for Chalet is the efficiency that you can take whatever information you’re starting off with so this can include like prior knowledge or just your data and turning that information into new skills that can perform well in some scope of tasks in fact I remember it was opening I5 like there was this AI system that could beat the world champions cuz like since I’ve been playing for years I sort of like know the skill required to play at the highest level so that was really sort of mind-blowing but interestingly enough several years later like I look back and I don’t find the do system actually that impressive just because like you know it was trained on like 45,000 years worth of data or something so it’s very inefficient in sort of like the data it’s using to acquire those skills which interestingly sort of links to exactly what I’m working on now which is how can we you know get closer to this sort of like human intelligence of efficiently being able to use data to acquire new skills my name is alesandro pomerini I finished my undergraduate degree last year at the University of Edinburgh uh and it’s sort of like a artificial intelligence computer science degree um I knew towards like third year fourth year and especially doing my undergraduate project which I guess we’ll talk about later um that I loved research I love working on these sort of problems but I didn’t know exactly what I wanted to work on I was sort of split between two different directions so I didn’t end up applying for anything um following my undergraduate I just uh finished my undergraduate had nothing to do and just began exploring reading a lot trying to basically figure out which problems I wanted to work on most um that is when I stumbled across uh France’s on the measure intelligence probably first through the podcast you did with him and reading that paper was basically very inspiring I began working on that problem um for about a month and it was during that time that I saw a lot of overlap between some of my ideas that I was having and work that Melanie Mitchell at the Santa Fe Institute who you also had on the podcast um was doing several decades ago and and so I ended up reaching out to Melanie asking if there’s like basically saying that I have nothing planned is there any chance that I can come over and work with you um as like an intern or something and amazingly enough um that all worked out and she was able to fund me at The Institute where I am now so working at a what’s called a postback so I’ve not done a PhD in between undergraduate and PhD doing research I’m incredibly jealous so you get to work with you know Melanie Mitchell I I think I look up to her more than anyone else in the entire space so you’re so incredibly lucky I I’m extremely lucky yes melan is amazing yeah she is Absol I mean um her her she did a book on on um complexity and then one on on you know um like a history of of AI I think it’s AI for thinking machines is the name of it um Melanie is is absolutely brilliant so you’re working at the Santa Fe Institute yes mhm tell us about that sure um so the Santa Fe Institute is like a small research research institute in Santa Fe um there’s not many people working on like computer science and the stuff that Melanie does like they’re actually very broad I’d say like the High theme um is like complexity science and complex systems but you have people working in all different areas like you theoretical physics biology origins of life there’s some his torians sociologists um sort of like a small core like 12 or 15 resident faculty members similar number of post docks um a post back and then um lots of external faculty members that are constantly sort of coming in and uh like visiting the Institute so you always have like a very wide range of people there which is great to interact with a lot of mathematicians as well yes how’s how’s the culture there different from you know any other type of research institution one sort of unique thing is they don’t have departments and like the resident faculty I think they’re not even allowed to have uh groups and like their own Labs sort of like the allegiances to the whole Institute and and maybe to like stop the specialization because I think there’s a lot of uh um you know like solving some of the problems that they’re interested in at The Institute requires this sort of like cross-disciplinary uh culture um yeah so before my undergraduate actually I was not uh I I I didn’t know I had this interest in Ai and computer science cuz I was actually a professional gamer in fact I remember it was uh open ai5 do you know that like they the DOTA the Dota 2 bot so I played a MOA just like DOTA and so I remember being just like mind blown to see like there was this AI system uh that could beat the world champions cuz like since I’ve been playing for years I sort of like know the skill required to play at the highest level so that was really sort of mindblowing but like interestingly enough several years later like I look back and I don’t find like the do2 system actually that impressive just because like you know it was trained on like 45,000 years worth of data or something um so it’s very inefficient and sort of like the data it’s using to acquire those skills which uh interestingly sort of links to exactly what I’m working on now which is how can we you know get closer to this sort of like human intelligence of efficiently being able to use data to acquire new skills um yes yeah well there’s a really interesting thing here because you know certainly my my friend Dan um what he does as a commentator is analysis and what I’m I’m sure what the best Gamers do is is this kind of analysis which is to say they construct skill programs so so given given a game given it’s quite situational but some of the skill programs generalize and you’re kind of creating these powerful heris that you can use to kind of gain Alpha if you like to gain Advantage everyone everyone else and I can imagine how that experience kind of seeded a fascination with the process of creating the skill program I I I think absolutely yes like uh like I was always fascinated in sort of like in the mind and like how we can actually learn these things to like do better than others so I would say yeah it is definitely probably influenced my interest like my change interest into AI yeah so what’s the difference between what you did as a human and something like you know the DOTA algorithm so the main difference in this I mean this is mainly like Chet’s ideas which like I I agree with his view of intelligence as being like skill acquisition efficiency rather than skill so like people have a tendency to confuse uh like skill output at some task uh versus intelligence which is that that process of acquiring the skills at that task so like the DOTA Tu bot is insanely skilled like it has a lot of skills at at some frozen subset of the game because then as soon as you introduce new changes that requires basically creating new skills and that’s where it completely breaks down because it’s missing that intelligence part of being able to like see a few changes as humans would like a new hero gets introduced and being able to adapt to like flexibly adapt to that and create the new skills to like use that hero whereas they like the DOTA two bot can’t do that because it was trained with massive amounts of data on a particular subset of the game acquired that skill and like that’s sort of it frozen there um yeah so big difference between skill and intelligence I mean can you can you give the audience a refresher on Chet’s measure of intelligence so for Chalet he views a difference between skill at some task and intelligence intelligence for Chalet is the efficiency that you can take your whatever information you’re starting off with so this can include like prior knowledge or the experience so like for like training time for example like practicing at a game or just your data and turning that information into new skills that can perform well in some scope of tasks so you can always arbitrarily like buy more performance buy more skill at some task by just adding more prior knowledge or more um experience so like the like world best chess engines they don’t really leverage lots of experience but they have loads of baked in uh prior knowledge um and so like they struggle to generalize if you were to like change the rules or create new tests on the other end you have like the systems like opening I5 that don’t have too much P knowledge but they are requiring huge and huge amounts of experience in order to learn that skill and like he he gives a good example like if you think of locality sensitive hash function and a hash table like this is already fully equipped to to reach maximum skill and maximum performance at any task where you can sort of sample unlimited data as in any video game so like you just get a new experience you map that experience to what you should do at that like in that situation and then by just sampling more and more and more and more experience you can basically increase your skill but you’re not increasing your intelligence because um for Chet intelligence is how efficiently you can acquire that skill yes and I guess he also makes the tangental argument that llms are glorified um engr models you know they’re they’re like a database right like yeah and like the problem is it’s it’s so difficult to distinguish memorizing an instance of a Thing versus doing the reasoning right right and I I’m sure and I I’m pretty sure I’ve heard sh say this like humans are definitely doing memorization as well but like we we have I I I don’t think they’re completely separated like a lot of and just my own experience as a gamer like there are stuff that I would uh just do so like I I would practice basically over and over and over and over again so it just basically becomes like I see that situation and I can just execute but it’s also full of these situations that I’ve never seen before and you sort of have to adapt on the Fly based on your existing sort of like building blocks that you’re sort of assembling into new skill programs yeah because there are so many um psychological biases that seem to prove that out you know like social proof and Authority we use all of these heris stics where we just we want to be economical we don’t want to do the Reon we just what people are tics but we do still have the ability to be creative when we when we need to absolutely yeah so chal created he didn’t just do the measure of intelligence he he actually wanted to have an actionable Benchmark for it which is the ark challenge can you tell the audience about that sure so Ark um is chol’s sort of answer to how we can go about measuring this aspect of intelligence rather than skill as many of the sort of AI benchmarks are set up to just measure and performance for Chet there’s essentially no task where performing well at that task demonstrates you have intelligence unless it is some sort of metat task that is uh evaluating uh how efficiently you can acquire skills across like a broad range of tasks within some scope while also controlling for priors and experience so an arc a single task so it’s made up of um a thousand tasks and a single task in Arc contains a few input output grid um pairs so you may have two or three examples these are like like pixelated colored grids demonstrating some underlying transformation so each pair in the task is nonidentical they look very different at the surface level but there’s some common structure underlying that and and the task for the solver is to essentially induce what this underlying rule is which is essentially like the shared concept and and use that concept um to transform a test input grid that has no output grid and and generate the correct output grid so that is a single Arc task and Chalet assumes that the only knowledge needed to solve um these AR tasks are what Elizabeth spelly will call core knowledge and these are basic things that are assed to be either innate or learned very early on by humans so things like objectness like the world can parsed into different objects and they have certain physical properties like basic numerosity um like less like counting in comparing like less than or greater than um basic like topology and geometry simple things like these um and then the experience is controlled because you only get two or three examples to sort of discover what this underlying rule is very interesting yeah so so sh comes at this from a kind of nativist in Psychology and rationalist perspective which is that um you know reasoning is the ability to you know create new truths or new knowledge about the world but it doesn’t come from nothing you know like empiricist say that it’s all derivable from from experience this school of thought is we kind of construct it compositionally from existing uh knowledge and we do this um you know this um uh all of these series of deduction if you like to build um you know new models compositionally right now you cited um David Deutsch in your paper yes tell us about that yeah so um David deut might actually be a big reason that I am into research right now because when I started my degree I definitely um didn’t have much of an interest in research I was just more very interested in uh like AI systems and coding and programming and all things like that but uh reading his book The Beginning of infinity which I can highly recommend for everyone um got me much more interested in sort of bigger more fundamental questions and especially like theoretical computer science um in his view of epistemology which draws from Carl Popper’s epistemology um is definitely had a big influence on how I approach all these problems so for example just looking at a single Arc task that has two or three examples um there’s no way you can look at these examples and sort of mechanistically derive the correct program um so like you said um earlier reasoning about sort of like deducing and building upon these building blocks I don’t think it would there there definitely will be parts of deduction but a lot of it is um induction it’s sort of like guessing like there’s no way a priority to know how to go from the examples to the program solution instead some form of conjecture or hypothesizing is needed just as in our scientific theories like a very useful slogan from Carl poer is that um like scient ific theories are not derived from observations they’re tested on observations so we will sort of conjecture these theories and then we will try and interpret the world and see if we can understand the world in terms of those theories and then there will be mistakes and those sort of mistakes um produce problems and we try and solve those problems by conjecturing more and then testing on those observations so I think it’s the exact same with these sort of Arc problems or any sort of computer science like program induction type problem um where translating it into computer science that form of like guessing uh is essentially search so basically this Bas this tells us that some form of search is needed um and the question is how do we go about doing that efficiently because we can somehow do that very efficiently yeah you you raised a very very important point so there’s this deductive closure so we can have all of this based knowledge and we can create these deductive traversals by composing them together but we would only be able to know for sure whether or not we can produce the test samples right we’re actually solving a much harder problem we’re trying to generalize you know into different instances of the problems we’re only given like three instances and that’s when it becomes an inductive problem and this is actually really really really difficult because we have these heuristics in our brain where we just kind of feel that this is probably the correct program maybe because it’s anthropocentric or because it’s very simple and we’re actually doing something which more resembles like Abduction or induction right yeah I I think so for sure it can be I mean one thing that David deuts will always say when people think that they are just looking at observations and the theory is sort of you know popping into their head based on the observation that it it’s actually just a misconception that really the idea is always there first and like without the idea first like we wouldn’t even know to interpret some observation on in in whichever way that we did um so yeah I I I think the idea is basically always being generated first and then we are testing it on the observations and the same applies to these program induction type problems very interesting well you you just said the magic word program induction so that there’s this whole space in computer science called inducted program synthesis and a lot of people in the ml space wouldn’t even have heard of this tell us about that sure so um Arc could be viewed as an example task uh program induction task but at a high level program induction is um you’re given some specification of a desirable program and this could come in many forms um and we want to like automatically generate a program that satisfies that criteria so the most common example is like programming by example where I give you a few input output pairs relating to a transformation and we want to automatically generate a program that can transform all the inputs into the respective outputs so essentially like we we give an example of the functionality or the behavior and we want to automatically generate a program that matches that behavior um so outside of Arc I could give you a list of numbers uh um so I might give you like 3 2 1 gets transformed to 1 two3 681 gets transformed to 168 um give me a program that can sort of generalize Beyond this and then you want to try and discover a sorting algorithm for example very cool so where is I mean obviously I associate this with like MIT and Kevin Ellis and just hannen bman s so that there’s there’s a big group of research there is is that where the locus of the research is or um a lot of the program induction uh work especially like the um basian program learning comes from J tound and like Armando Sol L’s uh groups at MIT that’s for sure um and the system that I’ve worked most closely with um is called dreamcoder and that was uh developed mainly by Kevin Ellis during his PhD at uh Josh and Armando’s Lab at Ma yes amazing so so dreamcoder might be the coolest algorithm ever created it is very cool can you and as you say that was Kevin and Josh’s group can you give us the the pitch for it so dreamcoder as you said is a inductive program synthesis system so as input you are receiving a large Corpus of program induction tasks in some domain and dreamcoder wants to solve those tasks um so all programs are built out of these primitive um operations and elements um and and this can be a finite set of elements um or elements and operations but the way they can be combined to form new programs is infinite and like as you start looking at programs with more components then like the number of programs with that many components sort of grows exponentially so it’s sort of uh we can think of it as like a search tree and the search tree is growing exponentially so if we’re not looking at trivial cases so as as we go beyond trivial cases if we want to discover um desirable program Solutions we need a way to simplify search so dream coder is an algorithm that as it’s learning and searching for these program Solutions it’s also going to learn how to simplify search uh for those program Solutions and it will do this in two stages so um broadly there’s two ways we can reduce the or simplify the search problem in this big space of uh programs one is by reducing the depth of search and the other is by reducing the breadth of search so dreamcoder will reduce the the depth of search by creating new library components and I will explain what that is so if we start off with um so and I should say when I refer to a library that’s just our initial program operations and elements that we use to you know combine together and build new programs so for example if we’re coding in Python it would just be all of Python’s basic uh Python’s grammar essentially um so if we start like imagine if we have a simple Library just G contains like the plus operator and the element one then like we can have the program outputting one we can have the program outputting 1+ one too or like 1+ 1 + one outputting three um so this creates like a program tree if we were to look at some program in this like built out of this Library so for example we could take the program outputting free 1+ 1 + one and if we were to sort of collapse that into its single operation that we’d add back to our library so now our library contains plus one in the program outputting fre now if we want to build the program outputting four we can just do 3 + one so this requires like a depth of free as opposed to in our previous Library if we wanted to create the program out putting for we’d have to do 1+ 1+ one plus one like a depth of seven so essentially by collapsing program components and adding it to our library we can reduce the depth needed to re reach like behaviorally equivalent programs and yeah we can reduce the depth to reach those programs um so that’s what was referred to as chunking we’re taking program components chunking them and adding it to our library dreamcoder as it starts solving problems and so let’s say I give you a 100 tasks initially you have no knowledge about the domain so all you can do is search blindly um let’s say you happen to solve 10 of the tasks by just randomly searching once you have those Solutions you can dreamcoder will look across the solutions for common functionality um sort of the most expressive in common functionality and it will chunk that into program components and add it to the library so this is essentially compressing and allowing the algorithms to express its program Solutions in fewer components which of course reduces the depth needed if you were to search for those programs again um it will then reduce the breadth of search by training um a neural search policy so it’s basically going to learn a heuristic for how to navigate the search space um it will take its program Solutions and it will train a neural network which is a distribution um to infer given a task what are the likely programs to solve that task and by putting a distribution over Library components you’re essentially down ating certain trajectories in your search space and this corresponds to reducing the breadth of search so yeah it it it will start searching for so as an overview it will start searching for program Solutions as it discovers some program Solutions it’s going to look for common functionality it’s going to chunk that functionality and add it to the library compressing those Solutions it will then train a neural search policy to essentially use that library and in for given a task what are likely programs and once it’s trained that uh youur know search policy it’s going to start searching for more solutions hopefully this leads it to finding more solutions and then it can take that like larger set of examples to do the same again chunk more functionality retrain and neur search policy and iteratively iteratively um solve more and more programs so so two quick questions on that so you said that there’s a procedure for looking for common functions and and also um you know there’s this program search guiding thing which um you know basically says I think this program is good at solving this task right and that must be done heris so how can you statistically tell whether a program or a function does the thing you think it’s going to do if we have a single program and we want to basically determine does this program solve our task all we need to do is run this program on each of our given inputs and see if we get the desirable output um of course uh so if you want to test this program then we might assume that there is an underlying um program um and so there is a corresponding output grid that we can’t see for a test input only grid then we may have a program that transforms all our training input output pairs but doesn’t transform the input grid into the a test input grid into the actual out grid but if if we never receive that output grid there’s no way to tell like if we have some magical tester who’s like keeping that hidden then they’ll be able to tell us whether or not we have the right program and if we don’t then we would just have to keep searching for a new one does that answer your question that makes sense so there’s there’s one stage which is where we’re actually doing verification so we’re actually running the program and we’re going to see does it work on the tested yes on this neural guided search isn’t there a statistical component which is that we we think that these modules you know we should search in in this Direction More than the other direction and that’s done with with a neuron Network how do how does that work and is it brittle so the neural network is essentially parameterizing a distribution a posterior distribution of um over program programs given a task and so um this works by like putting a probability distribution over Library components and you can uh start composing programs that way as in taking the well so dreamcoder will actually search by enumerating programs and decreasing likelihood of this distribution and it’s going to keep searching through programs until it hits one that covers the criteria that we just talked about but yeah it’s a it’s a useful point to say that um like if we have a probability distribution over programs given tasks then we essentially need uh a new distribution for every single task like every single task has a separate distribution by using the neural network to map tasks to distributions over tasks this is what’s called like amortised inference um we’re using like one function one set of parameters to specify essentially an infinite number of distributions like given any task it’s going to spit out a distribution over programs for that task and this will be useful for what we’ll talk about later yes I guess it’s really interesting because you know new networks are really good at generating ideas and there’s this interesting symbiotic relationship between generating ideas and of course that has to be sensitive enough to include the correct thing in there and then we do verification as well right yes so we like generating ideas and and then we’re selecting with the verifications right the verification yeah is very important here because we’re not just randomly and and this is a big difference between dream coders like wake sleep algorithm as they call it in traditional wake sleep which will also use this posterior distribution but it’s just going to sample the distribution once and get like a like so in our case the latent variables are these programs that we can’t see and in traditional wak sleep they’ll just sample from the posterior get a Laten sample and use that to train their model whereas um in dreamcoder we’re not just sampling we’re actually going to enumerate and verify that we have a correct solution very very interesting so let’s talk about chunking yes um cuz I’m I’m thinking that there are different approaches to doing this so you could um chunk quite greedily and you might end up with lots of chunks that hardly ever get used or you could be quite stingy about it and only chunk if you think it’s going to get used a lot what are the trade-offs there yeah so there this is a very good question because there’s a big tradeoff if we imagine so if we just want to best compress our program Solutions then we should just chunk all of those programs and add them to our library because now we can express all of those programs in just a single component rather than multiple but um if we think of it in terms of a search view like we want to simplify search for discovering more program solutions to solve sort of unseen tasks then every time we chunk a component we have a new component in our library and so we are increasing the breadth of our search um so that there’s like a strong interplay between successfully reducing the depth of search and successfully reducing the breadth of search and they’re sort of Highly reli on each other so if we go back to our example of the library containing Plus in one and um if we chunk the program outputting free and add it to our library but our neural search policy um doesn’t care about using that program outputting free it still just cares about plus and one then essentially we we we’ve done we’ve we’ve not done anything um in terms of reducing the depth of search like we it is just going to ignore the free operator all the time so in order to successfully reduce the depth of search ideally it complements how we are going about reducing the breadth of search and and just kind of um trading off those two things because there’s a kind of aspect ratio so you know there’s the breadth and the depth do you think a system is more intelligent if it has more depth and it’s composing lots of things together or do you think you know another approach would be just to have a lot of breadth and kind of memorize many different heuristics like what what are the trade-offs right that’s a very good question um okay so this is something that I’ve not thought much about so um but it seems like if you have um if you have like a lot of bread so you have these certain building blocks like these certain program components and like some of these may be chunked from existing solutions that we know how to work then if we have a lot of breath but we’re not going very far then it means we’re sort of making these small variations on the programs so we’re like adding a small number of things which might allow us to generalize to maybe similar cases that are um maybe novel that we’ve never seen before but still not requiring much generalization at all whereas it’s really the depth like going far away from what you currently know that will allow for these handling uh novel situations that require these sort of like very novel and creative ideas so um yeah that’s that’s interesting yeah mean it’s a similar thing in neural networks I I guess you can you can think of it both ways because in some sense as you go through the layers of a neural network it’s learning more complex features you know like high frequency features and so on but but you could also view it in terms of it’s reusing kind of like composition you know that there’s if you look at the trajectory of of the activations in the neural network it’s kind of like it’s using compositions of of representations which make it generalize more right yeah that’s a good point so you you’ve written this paper called basian program learning by decompiling uh amortized knowledge so you you you started with dreamcoder and you’ve added a very important modification to it yes tell us about that we are using the exact same dream coder system that I’ve just described so it’s going to operate in these free sort of stages you’re searching for program Solutions or you’re going to chunk program components like these modules and add it to your library and then you’re going to train a neural search policy to use that Library um in put a distribution over how you want to sort of guide search and then these fre stages bootstrap each off each other and you can interly learn to solve more problems what we do differently is rather than chunk program components based on what best compresses our current Solutions um we’re going to take a different approach and intuitively at a high level the approach is given we have this neural network that’s guiding search so essentially it’s the one deciding how to go about composing program Solutions um why not leverage that knowledge um that it’s learned for like patterns about how uh what program Solutions are likely to solve a given task um to decide what to chunk so we’re going to look at the functionality that it wants to use most often for creating program Solutions and again we’re hoping that here it’s generalized uh it’s learned some generalizable patterns that might be helpful for Unsolved tasks and can we somehow extract the useful um program components that it wants to use and add it to the library so that it can sort of be retrained with the new uh primitive chunks allowing it to get to the program Solutions it wants faster so like this ties back into the interplay between chunking to reduce depth in a way that would complement how we are going about using those chunks to reduce breadth of search um and we use the method for decompiling because a decompiler will take um low-level machine code and try and sort of reverse engineer what the highle LEL source code might have been that got compiled into that machine code so here when we’re training a neuros search policy we are essentially compiling in useful information for composing program Solutions and we want to try and extract from those into the weights of the neural network and we want to extract from those weights useful highlevel program components that we can then chunk and add to the library very cool so um I’m going to show a graphic on the screen now which is the figure from your paper so you’re showing this positive feedback loop can can you explain this figure sure so that figure describes the three stages main stages of dreamcoder We Begin searching for program Solutions as we find more program Solutions and we have more examples to infer so yeah that figure contains two parts like the first part is just an overview of pure dream coder the black arrows and it starts with searching for program Solutions as you discover more program Solutions you can then infer more what would be useful functionality across those program components that you could then chunk and add to your life Library once you do this you can then retrain a neural search policy to use those program components and another interesting part of dream coder that we haven’t described yet is when they’re training the neural search policy they won’t just use program solutions that were discovered in while they were searching what they call wake while they were searching for actual program solutions to solve the tasks they were provided they will also use the current Library uh and prior distribution over that library to generate fantasy examples so like they’ll randomly sample these fantasy programs they’ll run it on certain inputs and this creates a new task so now you get these program task pairs that were not observed in sort of the real world but was sort of created by the system itself um and they’ll train the neural search policy on those fantasy samples as well as the real samples so when we chunk and update our library we’re essentially updating what they call the generative model and this allows us to then train a better neuros search policy when we train the search policy we’re training it to infer the hidden programs given the task and then we will use this library and search policy to go back to that first stage and search for more program Solutions so this will help us find more program Solutions more program Solutions help us make better chunks the better chunks allow us to generate better fantasy examples which allows us to train a better neural search policy and we can just sort of iteratively repeat this procedure the other part of that um diagram is the orange arrow which says that in the dry compiling work we’re not just going to take what was given by the Wake phase when we were searching for program Solutions we’re actually going to use the recognition model that was guiding search to try and help us decide what to Chunk in order to reduce the depth of search yes and and to do that there’s a ranking function right and that’s that’s a really important thing in your paper because you did you approached it in a couple of ways can you tell us about that so we have this high level idea that we want to to somehow extract the useful functionality that the recognition model wants to use a very simple approach is just look at which functions which are just programs subprograms the recognition model wants to generate um with the highest probability averaged across all tasks so this is like one simple way to do it a problem with this approach is um one it only provides a ranking so it can tell us that like function a is preferable over function B but it doesn’t tell us how many functions would be useful to chunk into our library to help the recognition model the second problem is that um like larger functions are naturally harder to generate than smaller functions so as an example if we had uh one function f and we had one sub function part of f um the larger function will always have lower probability than the smaller function because we have to generate everything in the smaller function plus more comple components but if the recognition model was only ever intending to generate the larger function then it would be more useful to generate uh sorry to chunk the larger function than the sub function because um uh yeah then then we collapse uh more components and the sub function has no use elsewhere so these are sort of two problems with that simple approach um which we try and address in our other approach to decompiling uh the recognition model tell us about that sure this is the one with the Bui distribution yes this is the one with the beri distribution what is a Beni distribution so a beri distribution is like a coin flip so we have a binary variable a binary random variable it could appear as one state or another state and so we’re going to have um sort of a split distribution over these two states that add up to one and so our goal is to decide like if I give you some function will this function if we chunk this function will the overall benefit for the recognition model or neural search policy to generate program Solutions across all tasks increase or not so will net benefit be helpful or not because of course we can chunk something and then if if we sort of allocate some probability to this functionality and it’s not actually useful for solving our programs then we’re actually going to make things harder to generate our program Solutions so this is our our Burly distribution is over this like uh variable that we’re constructing of like will net benefit be positive or not and and then it’s a question of um well how do we calculate this and there’s no way to calculate this exactly without actually going ahead and chunking a component retraining the neural search policy and actually checking if it um is helpful or not so essentially we want to express some uh so we want to express this Inc certainty with a probalistic model um sort of systematically saying whether or not chunking some function could be helpful or not so that’s essentially our criteria and like this sort of solves the first problem we had of um like only giving a ranking because like once you have this distribution you can start using decision Theory assuming you have some utility function and basically only chunking if like the expected value is positive or not and although in the paper like we just use the simplest Criterion it’s enough to sort of show the strength of the approach we don’t actually provide a utility function but in principle future work could solve this problem the second problem was um like we don’t want to just chunk based off generation preference because we had that example of like larger functions are harder to generate we actually want to capture the notion of what would be useful for the recognition model to generate all programs and so like how well we solve that depends on how well the probabalistic model sort of approximates that true distribution yeah and tell us about the caching scheme if we take this distribution of like um the probability of a function being uh useful to chunk for the recognition model we can um simplify the pro uh Problem by introducing um by tasks and programs and marginalizing them out and then if like we do you know just applying basic laws of probability we can simplify to needing a model of how useful will it be to chunk some function for generating one particular program on one particular task so this is much easier than what we had before but we still can’t calculate it exactly without chunking and retraining on all tasks and so we can approximate this distribution with what we call a caching measure and we use caching to be distinct from chunking so for us caching is we have a recognition model some probability distribution over programs and we’re saying how useful would it be for this distribution to generate a particular program given it didn’t have to generate a sub function at all so we’re just saying like let’s remove the need to sample a particular subfunction how useful or how much benefit was gained for generating that fill program and we do this with the introduced caching measure which intuitively is just capturing the relative proportion of sample difficulty removed from generating The Fill program when you cach that function so in information Theory you have this notion of like surprise which is like the negative log probability of a random variable so the lower probability some variable is of being generated um the higher the negative log probability so the higher the surprise we could also use the word uh like difficulty like if we assume this distribution is like generating uh these variables then the higher the negative log probability the higher the difficulty and so the caching measure is actually very simple it’s just the the difficulty or surprise of the function uh over the difficulty of the F program so like we’re just trying to capture the ratio of difficulty uh of generating the function as part of a program you want to maintain some some surprise in in the system presumably because I’m I’m looking at your posterior marginal over the programs and you know maybe like we won’t use this bit but obviously you’ve done all of this basian analysis and and you you’ve computed this you know this this symbolic representation of of of of this marginal and this must be intractable to to comp what’s the relationship between I’m I’m doing this Bean analysis and I’ve got this like lovely basan representation of the problem which I’m writing in the paper versus how it’s actually implement it the actual distribution um that I described earlier like the like how beneficial will it be to chunk some function um like when you introduce all the programs and tasks and marginalize over them it is definitely Interactive so like there’s several approximations that have to be made to calculate that like one of them is using um what’s called a partical based approximation which is like simply just um restricting the the set of programs that you’re marginalizing over and dreamcoder also uses this as part of their uh objective when they’re approximating their objective and then the other one is just using Monte Carlo estimates um where we’re using sort of our empirical data as a sample estimate of the the real distribution very cool very cool tell me about your results so when we chunk according to this dream de compiling scheme what we see is that we can increase the bootstrapping effects so we can learn to solve more programs faster um than dreamcoder however dream coder in like many of the domains um is able to catch up eventually so like what this basically says is we can by leveraging the knowledge learned by the neural recognition model which is being trained on these fantasy samples as well as the real data we can learn to make more useful chunks when we have very little and constrained experience and this sort of feeds back into the system which allows it to solve more tasks and then we start getting like building a lead and we once we have more tasks we can learn to make even better chunks and uh solve more tasks and generalize more um but and so so like performance sort of increases faster and then eventually begins to plateau and if you look at dream coder’s performance it won’t increas as fast but eventually will catch up and I mean there’s some domains where they both do very similar yeah sh is smiling because sh is saying your knowledge acquisition efficiency is better than dream Cod yes in in in some of these domains yes but then the question is why does it Plateau well eventually it it gets too hard to solve the test so like um and it seems like um no more chunks that we begin adding and no more training of the neural recognition model is helping us get to the parts of search space that um where these sort of missing program Solutions are so I don’t know I mean there’s also the question of and I don’t think this is the case T in dreamcoder but when we chunk and add program components to our library we are not changing the the possible program behaviors that we can um construct so um whatever program we create with our new library we could always take any of our trunked pieces and sort of like explode it into its smaller pieces that were initially used to build it so that the set of possible programs is the same what this means is if you start with a library which in the dreamcoder case and so ours as well is um fixed like like humans are deciding what the initial library is then if any program Solutions are outside of this program space then no matter how much chunking you do you’re never going to create a program space where that contains those program Solutions so that could also always be a problem in terms of um reaching a plateau where you can just never find program Solutions although I don’t again I don’t think that’s the case here that makes sense that makes sense well the the question on everyone’s minds and you know I’m really excited about the arc challenge yes and at the moment all of the people who are winning are using language models and I mean obviously I’m a big fan of Jack Cole and Ryan green black you know um Jack’s a great guy talk to them a lot but um I still feel that Jack’s solution is against the spirit of the measure of intelligence right it’s using this big language models doing all this fine tuning and so on so I would like to see a solution like yours on the arc challenge right um are you are you putting an entry in I would hope to if I can but the problem is that I’m very bad at estimating like how long things will take me always takes me way longer than I think so so I’ve been working on an approach now um since I finished my undergraduate like one year ago and that’s what I’ve been doing mainly at the Cent Institute um and things are starting to look very promising but a lot of it is mainly in theory so there’s I I have implementations of earlier versions of the project but um then there was like lots of problems and so I have to go off and try and solve them and so now I’m still I’m I’m not actually at the stage of being able to implement these ideas um there’s still a few sort of big problems that I I want to solve before I can get to it but if I was to get to that um I would try and I I’m definitely going to try and Implement and release a submission when that comes but whether that will be in time for this deadline which I think is November early November please please because I think you’re going to be famous man but if it works if I don’t know are your questions aimed at an approach to Arc based directly on this because my Approach is that’s definitely where I started but like over the year it’s evolved it’s it’s a it’s not really this at all oh oh interesting so you you’ve got a completely different approach but but it’s still similar in principle right it’s it’s very similar in the sense that it’s still doing this sort of like combinatorial search but it’s no longer in program space so um Chalet actually gives the example of like uh deep learning in neural networks like the learning mechanism there stochastic radi in descend it’s very computationally efficient like you can get a big learning signal and sort of move closer to the correct solution but it’s very data inefficient like you require a huge sampling of the operating space in order to make this work on the other hand you have program search which is doing this combinatorial search um which is very data efficient like you only need two or three examples and you can create a generalizable solution by searching for the space and getting the uh yeah and finding something that mates those and automatically generalizes however the the search at the moment is very computationally inefficient like some like for these non-trivial cases you have to search for loads of things um so I’m doing a similar similar sort of combinatorial search that will result in a transformation rule but it’s not searching directly in program space um in order to get there and it it sort of has these nice properties that allows the the search mechanism to be more efficient at least than program search and program induction to essentially like adapt to the feedback you’re getting and like prune large parts of the search space and jump to corresponding places more easier amazing Alexandre we run out of time but this has been amazing thank you so much for coming on so much for having me it’s been a pleasure talking good luck with the with the arc entry if everything goes well I can maybe come back to share more about it definitely definitely thanks so much man thank you [Music]