there’s been a lot of attention on Fran Chet’s Arc challenge recently now the arc challenge is around a thousand different challenges that are based in a 2d grid world and they resemble intelligence tests which is to say you just have a few examples and the goal is to extrapolate or to generalize from those few examples the reason why chle invented this back in about 2017 was his genius recognition that there is a knowledge Gap in deep learning models deep learning models are a little bit like an interpolative database which means they generalize really well when they’ve seen lots of data about a particular problem but when there is an explicit knowledge Gap you need to do reasoning because reasoning is knowledge acquisition so if the knowledge is not in the language model then the language model isn’t going to help you very much if you could collaborate with any AI researcher who would it be in why ch H that’s an easy one love CH left to hire him if you’re For Hire email me out I can’t pay nearly as much there I’m really worried I piss challe off I hope I haven’t cuz I love him he’s he’s still my favorite person but we did a thing on the Chinese room and I got him to record like a little 15 minute spiel about it and he was coming at it from the whole functionalism thing you know that like it it it kind of emerges Consciousness emerges when you have functional Dynamics and and I said I’m going to give Chalet the benefit of the doubt and I should have edited that out because it sounded like I was like mansplaining to Chalet which wasn’t my intention I think I mean that’s probably about 10 people messaged me like in quotes I’m going to give Chet the benefit of the D oh okay I should I would have really done R I should have done yeah yeah I love CH you know I love CH certainly yeah anyway so who would it be I have no idea like I don’t know in terms of working I I don’t know what anyone is like working together right would you work with Lon sure yeah what would you work on in I whatever they want he probably has a team that doesn’t does stuff so yeah I’m not too set on particular directions okay okay so the arc tasks were discreet they are combinatorial and they’re designed to be minimally interpolative now personally I think that they can still be interpolative I mean language is discreet uh for me it’s more about the fact that they are low frequency data so they haven’t been sampled during training which is why the Assumption with a measure of intelligence is if you can broadly expand your horizon from a few examples normalized by the amount of priors and experience you have then it’s intelligence so franois Chet’s measure of intelligence is technically non-computable um if you look at the we won’t talk about the formulas now um by the way we we’re interviewing challe very soon so we’ll kind of save the foreplay for that episode and we’ll talk all about the measure of intelligence again and the arc challenge in more detail but um the basic formalism is non-computable okay and what what it intends to measure is generalization efficiency right or knowledge acquisition efficiency something like Arc serves as a good proxy for intelligence because if we assume that there was a knowledge Gap and we have an algorithm which fills in the Gap and we can make certain assumptions that the algorithm isn’t cheating because it hasn’t seen the private set so know nothing’s leaking from the private set and the algorithm isn’t taking statistical or perceptual shortcuts then we might make the assumption that the algorithm is intelligent now Arc is a really difficult problem because we only have a handful of examples per challenge if you think about it there are something like you know 10 which is the number of colors to the power of 900 possible combinations for a given challenge so this is absolutely huge search space that we are traversing through out of all of those possible answers exactly one gives you the credit which means it’s a very sparse problem so there’s a really cool paper that introduces Arc actually it’s called um neuron networks for abstraction and reasoning towards broad generalization and machines and it’s by Mikel Boba irizar and it talks about these um bongard problems which were originally conceived in 1967 and humans can look at these things and they can recognize analogies so we have some kind of perceptual inference and then we kind of transform a problem into system two which is the space of analogies so this is really interesting right so it’s saying can you identify the difference between the two sets where each problem encapsulates a different concept so if you look at the um the concepts here on the left and the right hand side of the vertical line you can see that the the first one is playing on the concept of connectedness so the six figures on the left hand side are line drawings which are not connected together and on the right hand side there’s a gap so they are disconnected so we look at that and we say Ah that’s connectedness and on the far right hand side this is the concept of inside so the little square is inside in inside the containing structure on the right hand side and on the outside sorry on the left side it’s outside the containing structure so we recognize this as an inside of abstract concept now for me the arc challenge is all about efficiency right so what we need to do essentially is search a massive space of possibilities but use intuition like we do because when we look at an intelligence test we very quickly hone in on the knb of the problem we use these mentalistic so we’re doing the equivalent of searching this huge combinatorial space where we compose together all of these different skill programs in a creative way and we very quickly happen upon the answer that is what intelligence is so Brute Force searching all of the possible combinations of these skill programs that we have it’s not intelligent so a huge thing with The Arc challenge is to have some notion of this huge library of subprograms and search the space of combinations as efficiently as possible now um an abstraction can be thought of as a commonly reoccurring subprogram so the thing that we seem to do is we start from this kind of basis set of fundamental subprograms we compose them together and when patterns or sets of compositions work really well we seem to kind of consolidate that into a single program and that gets embedded into our library and over time our knowledge grows in that way so you can see here there are some examples here of typical Arc uh challenges it’s a 2d grid world where you have I think nine possible colors now all of the arc problems work on the basis of core knowledge so this is a really popular idea in Psychology that the way we think about things is we have this core knowledge and the way we do reasoning is by recombining that core knowledge into a new situation so it’s almost like conjecturing in mathematics where we have all of these deductive rules about how the world works and that’s like a generative model so we can produce this huge tree of combinations of of those rules but we have this creative ability to heris combine them together in an interesting way so the core knowledge that is used to generate the arc challenge are objectness priors so this is handling objects and their interactions go directedness prior so many of the tasks involve a general notion of intentionality finding the simplest solution to some problem so for example drawing the shortest path through a maze rather than a longer one and here are some examples of our challenges so in the top one you can see that we have some kind of object and where there is a hole in the middle which is inscribed by Green squares it’s quite obvious looking at this that the goal is to insert a yellow Square where there are green squares surrounding it up and down so you can see there’s one 2 3 4 five examples here and then we’ve got the question mark which is the test inst so clearly what we would do here is um fill in a yellow dot in all of the situations where there are black cells inscribed by Green cells let’s look at the next one by the way what what you should notice here by the way is that the the size of the grid of the input and the output can be different because on this one we have an 8 by8 on the input and we have a 4x4 on the output so this next one is very challenging so you do need to be quite smart part to do these things um so the next one is transforming it into a 4x4 from an 8x8 and we are superposing the quadrants on top of each other in a certain order and you can see here that it’s happening in a clockwise fashion starting from yellow so the top right quadrant always wins which is why the top right uh quadrant is perfectly expressed in the in the test example so this is a pretty difficult one the next one let’s look at this so I’m actually looking at this for the first time so in the next one when we have a red dot we inscribe it in the four diagonal Corners with yellows when we have blue we inscribe it vertically with orange and we can see on the third instance that when we have a pink one we don’t inscribe so on the test instance we would inscribe the red and the blue but not the cyan or the pink the the next example here is quite straightforward so we have a mosaic pattern and this is a den noising prior so whenever we have black you know kind of like missing cells we inscribe the pattern of the Mosaic into those black cells so that one’s quite simple now this is pretty cool so this talks about the um the first winner to the original Arc challenge which was hosted on kagle in 20 9 and it was by this guy Johan Socrates wind and he um implemented a domain specific language so a DSL with 142 handcrafted un functions on grids and these are functions which perform various different Transformations that represent all of these things that we were just talking about so things like you know Symmetry and and movement and d noising and so on and um what he did was from a particular instance he constructed a dag so a direct to-day cyclic graph basically think of it as a tree and he greedily sampled all of the different Transformations I think doing a bread first search and then he had to have an evaluation function so for every um sample trajectory on this dag um let’s look at the Hamming distance or something to see how close is it to the test instance and then um he presumably selected let’s say the top whatever um sets of Transformations and then that became the program used for the final prediction so I think that did around 20% on the hidden test set does it say something like that anyway um it also goes on to talk about a neuros symbolic approach with dreamcoder so dreamcoder has a waking phase where um it generates programs to solve tasks and then it has um by the way an interesting thing here is that dreamcoder is bootstrapped with a main specific language known as a library and the library contains a set of functions and values defined with a functional type system this is really important right so when you generate um just syntax which is code the space of valid syntax is much larger than the space of valid typed syntax so having some kind of a a compiler or a type Checker allows you to dramatically cut down the space of i programs that you generate so you can see there are some examples here that there are types of integer and string and various different discrete sets of values and different function mappings and um in this particular case the waking phase would generate programs that are constrained with this minimum description length this is very similar to oam’s razor that we should find programs that are as simple as possible because in the natural world simple things seem to generalize and work much better than complex things so yeah dreamcoder then had the um abstraction sleep which was essentially you look at recurring motifs in the generated programs and then they those become kind of condensed into um you know primitive functions that can then be composed um out of the the DSL in the waking phase for producing new programs so this is something very similar to what we do when we sleep we consolidate our knowledge and then that makes us better at reasoning the next day there’s also a dreaming um sleep which is where the generative program is Guided by a neural network and we fine-tune that neural network which means when we generate new trajectories on that neural network in the waking phase it’s kind of Guided by what was previously successful before so we’re kind of iterating through this waking phase in these two sleeping phases to produce this neurog guided program search and dreamcoder is a really really cool idea it was by Kevin Ellis out of Josh tanon B’s team at MIT and we’ve never done an episode on that but maybe we should because I think it’s a really really interesting approach for um program synthesis so what does a good solution to the arc challenge look like well to me it’s all about the efficiency of reasoning which is like at the time when we have a test instance we need to search this big space and we have to do it as efficiently as possible so the original winner back in 2019 was just basically a Python program just searching this space of combinations of all of these knowledge priors selecting one that worked well based on an evaluation function so that’s the first thing but what about using language models well you can just get gp4 to generate a Python program or some kind of intermediate DSL which is what the dreamcoder thing did and when you use gp4 to generate the python code directly you get around 10% that was the previous state-of-the-art what the guys did that we’re talking with today so Jack Cole’s team uh which was the previous winning solution with 34% they fine-tuned a language model so before you do anything fine-tune it on data set augmentation on Arc tasks so that has the um you know the behavior of kind of like moving archetype problems into the language domain if you encode it in a certain way and then doing test time augmentation or active inference so when you have a test instance you then do a bunch of augmentation around that you find Chun the language model again and then you get the language model to to generate the program so they got about 34% with that it’s worth noting as well that they were limited in compute and they were using an older language model which was unimodal so it’s entirely possible that their um uh their results will continue to improve in the future so this guy at Redwood research Ryan greenblat he has put a substack article out claiming to get 50% on the hidden test set on Arc and he said that he has achieved this in six days um so he said I recently got 50% accuracy on the public test set for Arc AGI by having gbt 40 generate a huge number of python implementations of the transformation rule which is around 8,000 per problem so you know essentially the idea is um we have the base GPT 40 we send the the image up of the problem although he did say that the vision model wasn’t particularly good although it is interesting to have a multimodal variant um he used a grid asi2 encoding to represent the arc problem itself he did a bu bunch of prompt engineering but of course you can’t fine-tune GPT 40 so the interesting thing about his solution is that there’s actually surprisingly little domain specific tuning other than prompt engineering he did say that he did some feature extraction on the arc problems which is to say finding connected regions and describing in text how those connected regions were related to each other but it’s certainly much more broad than the other solutions that I’ve seen so far which are very psychology orientated and very handd design orientated he said that he uses a few shot prompts which perform meticulous step-by-step reasoning he’s used gbt 40 to try and revise some of the implementations after seeing what they actually output on the provided examples he did some feature engineering providing the models with considerably better grid representations than the naive approach of just providing images now what’s interesting about this is It’s a neuros symbolic approach so it’s not just using language models it’s um using the language model to generate a whole bunch of candidate programs so yeah he said is is approach are similar um to Alpha code which a model generates millions of completions attempting to solve a programming problem and then Aggregates over them to determine what to submit he said he’s getting a 50% with this main idea took him six days of work so he said that the high level method he used is provide the arc AGI problem to gbt 40 with an image representation with various text representations for each Grid in the problem the text representation showing which cells are occupied by different connected components of colors and showing diff between the inputs and the output in cases where the grid shapes are the same and instruct gbt 40 to reason about what the transformation is reason about how to implement the transformation as code and then finally implement the transformation in code he said he uses a few sh prompts with carefully selected handwritten examples of step-by-step reasoning to actually get gbg4 to do this reasoning somewhat effectively the resulting prompt is usually around 30k tokens uh including images he then says out of those 8,000 um completions take the 12 most promising ones and then try to fix each by showing GPT 40 what this program actually outputs on the examples and then getting gbd 40 to revise the code to make it correct and he does 3,000 completions that attempt to fix per problem so he’s doing 8,000 completions originally and then 12 times 3,000 completions on the most promising candidates he says then finally they select three submissions to make um which is based on a majority vote over programs which get the examples correct so then he says what are the returns to more sampling and this is quite interesting so the more samples that you take in terms of the original completions and the refinements there seems to be a scaling law where the top threee accuracy just goes up and up and up yeah so here here he’s got a graph showing his scaling law and he’s essentially saying what is the relationship between the top three accuracy and the number of completions it’s not entirely clear to me how he’s done this I think he’s kind of added up the completions for refinement and the original generation but I guess it doesn’t make a huge difference and then he’s got some fit lines where he extrapolates that scaling outwards and based on that extrapolation he’s kind of saying that it would get up to 70% performance which is roughly human level when he would take two to the power of 21 samples so around you know 2 million programs which is you know kind of kind of interesting and and as you can see the the accuracy is significant higher when you do refinements so when you do refinements you’re getting significantly more accuracy even with significantly fewer um completions he said that the vision model was terrible on grids it often fails to see the input correctly and States wrong facts around what colors are in some location and what shapes are present he goes on to say later that he thinks that this might be a significant Improvement in this approach when the multimodal vision models actually have better visual recognition so he’s kind of saying here that the um the multimodal aspect isn’t actually a big determining factor of of why this works he said that gpg 40 um is not good at coding especially not for the sort of geometric manipulation problems that we’re talking about here and um he also said that they don’t do multi-round debugging because it’s probably cheaper and more effective just to do more samples than the current regime he also says that gbt 4 is worse using long context and that it kind of degrades significantly after 32k and this also affects the kind of the utility and generalization of this method because when you have problems that can’t be be encapsulated in a 32k context then presumably it wouldn’t work very well he was he was quite generous in saying that he used about a thousand times more runtime compute per problem than prior work on this Benchmark so maybe Jack Cole’s approach could have scaled better if he had access to this amount of compute he also said that the the submission is ineligible for the rcgi prize and the main leaderboard as it uses a close Source model and too much runtime compute and I think you know in principle it does go against the spirit of the arc challenge because you know all he’s done is kind of like used the um the memorization power of the underlying language model to generate a whole bunch of programs and then done a Brute Force search um with an evaluation function over the generated programs so you could argue that it’s not really um you know reasoning efficiency in the spirit of chal’s metric so he goes on to predict that he thinks there’s a 70% probability that a team of top three research ml Engineers with fine-tuning access to gbg4 and a year of time could use that model to surpass you know typical human performance on rcgi he also reckons that human performance is actually around 70% not 85% another big thing that that he talks about is this definition of in context learning so chle of course says there is no in context learning language models are databases and you know what you’re doing is is you’re retrieving a program based on the prompt and um uh Brett is making an argument here that these things are doing in context learning so shle said that um if you were right that llms can do in context learning llms would do really well on Arc puzzles because Arc puzzles are not complex each one requires very little knowledge each one of them is very low on the complexity scale you don’t even need to think very hard about it they’re extremely obvious for humans even children can do them but llms cannot even llms that have 100 thousand times more knowledge than you can still not the only thing that makes Ark special is that it was designed with this intent to resist memorization that’s the thing like Arc challenges are designed to be representations of problems which the language model does not know about so they are an Exemplar of this knowledge Gap yeah so he said that based on his results you have to reject one of the following claims the first one is getting modest performance on rcgi so let’s say 50% plus requires at least a little bit of runtime learning on each problem the second one program selection on only a moderate so let’s say 6,000 number of programs like the way he does it doesn’t count as learning in the typical way people think of learning or finally current llms never learn at runtime which is to say the in context learning they can do isn’t real learning so he concludes that claim three is false he thinks that llms can do some relevant learning when doing Inc context learning even though overall performance is incredibly weak which is to say he has to draw 8,000 um completions of of programs um but he still thinks there is some learning nonetheless anyway let’s not forget that the whole reason cholle invented the ark challenge was that he was exasperated that many solutions to kagle challenges were so overfit to the particular instance of the problem that they would not generalize to other instances of similar problems now um that doesn’t really seem to be the case with these Arc um you know recent winners and even with Ryan’s solution um I can see how even though they require an a prodigious amount of computation they could be used for other similar classes of problem on other similar types of geometry and domain but they are certainly task specific you know the whole point of the measure of intelligence is developer aware generalization which is that the algorithm you know the meta learning system or the intelligence system can generalize to problems which the developer of the system was unaware of this is still in my opinion somewhere between a specific problem and a domain or a collection of problems so I do think that this approach could be used for solving you know fairly similar domains of similar problems but um it’s definitely quite domain specific I would say okay so to sum up solutions to the arc challenge doing some kind of you know let’s say a Python program which does test time augmentation using the application of domain specific prior to you know and an evaluation function to find out some kind of reasoning pathway of transformations of the problem to produce the answer so that’s this kind of program search type thing and there’s a um a neurog guided version of that which is dreamcoder which is when you use a neuron Network model to guide the generation of programs and the generation of programs can also be a neuron Network or it could be a DSL so that’s very interesting there’s just a plain program generation with language models solution or there’s something like what um Ryan has done here which is generating many many completions and then um evaluating those you know using a Hamming distance and then refining those you know like let’s say the the top 10 ones of those with um you know with a with a language model and then picking the top three um there are some interesting solutions from Jack Cole and his team and they’re going to be on the show later another thing that we will talk about in the conversation later today is um Lang anguage models they they have a certain type of Prior an inductive prior because if you think about reasoning you have to think about where does the reasoning happen because the reasoning always happens somewhere so the reasoning could happen in the um you know in the data generating process so in a way we’re doing reasoning as a collective intelligence right we’re a whole bunch of humans and we’re trying many different combinations of things and we’re evaluating them and we’re rinsing and repeating and the things that work stick they get embedded in our culture as kind of social programs and our language and then we use them again so we’re doing reasoning in our native gen you know data generation process when we do um data generation algorithms for machine learning models that’s also a form of reasoning right so you know essentially we’re augmenting many different types of combinations of things we’re producing a data set and then we’re memorizing them in a language model the inductive priers themselves in a language model are a form of implicit reasoning so in a self attention Transformer it does this permutation symmetry transformation right and essentially what it’s doing is it’s permuting it’s generating many symmetry transformations of your data and it’s embedding it into the neuron Network so that’s a form of reasoning another form of reasoning is inference time reasoning where you have a base model and then you Traverse it in many many different ways with an evaluation function and you can do reasoning there you see where I’m going with this there’s many many places in the predictive architecture where you can do reasoning and all of these different approaches to Arc kind of play with doing reasoning at different points in that process now one of the things I’m interested in seeing is um you know self attention Transformers they’ve got this permutation symmetry right works really well works well in many situations but what’s really difficult to do in a neural network is to mix different types of symmetry transformation in a single model and this is what the gdl you know the geometric deep learning blueprint is all about so it would be interesting to have a hybrid approach where you actually have different symmetries you know so different um models with different inductive prior that represent different symmetries and you use those symmetries in some kind of an ensemble or combination to guide the program search I think we’ve not seen that yet most of the approaches I’ve seen so far just use a single model which respects a certain type of symmetry transformation and um I think another big thing that we’re going to see in the future is multimodal models so um uh Ryan said in in his substack article that the VIS the vision capabilities on gbt for osak and I can imagine a future where multimodal models with you know very good visual comprehension actually improve the situation much more so I think that’ll be a big source of improvement going forwards but it’s quite interesting right now that we we have reasoning which is diffused at many many different steps of the predictive architecture and as the models become better and more General then they’ll do more implicit reasoning because the models they are basically databases they’re interpretative databases but the more that they do implicit reasoning right then the more reasoning we need to do pre and post Hawk which means that the model itself is more intelligent and that’s something that’s really exciting because right now we have to generate thousands of completions and we have to refine them and we have to evaluate them and all of the reasoning is post Haw but as the models get better and better over time the post hoc reasoning can be you know compacted because the models themselves will do more implicit reasoning now I think it’ll take a very long time for that to happen and another thing is reasoning is just difficult machine learning models they have a finite capacity and and that means there is a limit to the depth of reasoning that they can memorize because the models themselves they’re not doing reasoning on the Fly they’re just memorizing reasoning because all of the reasoning in a model was implicit reasoning which was embedded into their weight during training the models don’t do any reasoning at inference time it’s a database lookup right so that means because the models have a finite capacity they are finite state or autometers there is a limit to the complexity of the reasoning Pathways that can be crystallized into the model and that’s why I’m a fan of active inference because I think the real world is more complex than anything that we could ever memorize in a crystallized model every single situation has novelty and complexity so the way we deal with that is by having a predictive architecture that actually leans into novel complexity by doing active inference and what that simply means is I’m in a novel situation and I have to figure out what works in this situation but I can still use my base knowledge in my language model I can do some reasoning by searching through a whole bunch of Primitives and then I can um bake those into the model which means when I’m in an you know an analogous situation in the future I don’t need to perform that reasoning because it’s crystallized in the model so this process of doing like you know reasoning in the moment baking it into my model for me that’s how we do it and I think that’s how we should build our predictive architectures so yesterday evening I was interviewing at the time the worldwide winners of the Ark challenge with their 35 sorry 34% entry um on the private set um as of this announcement this morning um by Redwood uh research I guess they’ve technically lost their number one spot but um anyway so it it’s Jack Cole uh Michael huddle and Muhammad Osman abdulgani so Jack Cole has a company called mindwar Consulting Incorporated and um he’s also created a few successful apps that he says has over 30 million downloads he has a PHD in Clinical Psychology and he maintains a part-time Psychotherapy practice he says he’s got a background in cognitive testing and neuros pychology and um he was inspired by yanuk kilcher’s original exposition of the Ark Challenge and I think he hangs out a lot in yanet kilcher’s Discord so also joining us this evening is Michael hoddle um who’s worked very closely with Jack colen and Muhammad and he’s got a paper out called addressing the abstraction and reasoning Corpus for procedural example generation so this data set generation is really really important the entire team used this in order to um fine-tune the language models both uh at pre-training and at inference time so in in this paper Michael talks about this task generalization so essentially a procedural algorithm that uses all of this core knowledge to um you know extend the arc data set because it’s it’s really important in in deep learning models to have a lot of data to have data density um because otherwise it’s just sparse and and the model won’t won’t actually learn generalizable patterns from that thing so one of their approaches was to generate a much larger Arc um data set fine-tune the base model on it as well as doing test time augmentation and it was Michael’s work in particular that the team used in order to get their winning solution so Michael will be joining us today also joining us today is Muhammad Osman abdulgani and he is a senior AI researcher with four years of work experience developing deep learning models uh he has multiple successful real world machine learning deployments under his belt using Transformers to tackle challenging problems in computer vision reasoning and natural language processing now he’s a really really smart guy I was very impressed with Muhammad and he’s looking for a PhD position at the moment so if there are any University professors watching this who wants to have a very smart motivated guy uh do a PhD under your wing please reach out to Muhammad and his email address is in the video description yeah um Muhammad said that him and Jack are hugely um you know kind of informed by this psychology approach which is really interesting so that means core knowledge and um you know handcrafted uh cognitive architecture design you know like the the psychology approach to AI is basically um based on this idea that that we have a base set of fundamental reasoning Primitives and we kind of compose those together as part of our reasoning process and that that is the the core of of symbolic and neuros symbolic AI is is this idea that you know we we compose core knowledge into novel programs to suit the situation so anyway I had a really good conversation with Jack Muhammad and Michael um we spoke about their um their submission what their thought process was um I clearly challenged them a little bit on whether it is in the spirit of the measure of intelligence because I think of you know all of these solutions to the arc challenge as being very inefficient frankly but um they gave me quite a bit of push back and and actually they gave me a lot to think about I definitely updated quite a bit based on this conversation I’m going to be filming with SH very very soon um that will be a special edition so we’re planning to release that show in August so um this is a very interesting development actually I’m glad that all of this has happened if you have any questions that you would like me to put to Chalet based on these recent developments let me know in the comments and I will be sure to ask Chalet um all of your best questions the other thing is um you know you should think of this as an office hour show so I’m not you our production cycle on mlst is running into the months now uh it’s it’s getting very very difficult just to get anything out there so this is just a very quick uh you know just think of it as a fairly off-the-cuff conversation um because given the the timely nature of this I think it would be better just to get it out as quickly as possible so you know take that hedging as you wish and enjoy the conversation I don’t know I don’t know where where we should start with this who who would like to describe in in Broad terms what the arc challenge is and what Chet’s measure of intelligence is I think in uh in Broad terms what the arc challenge is is a a benchmark for uh I guess you could say artificial intelligence systems I don’t know it’s not really uh for machine learning per se but it’s just a benchmark to try uh to provide a a really strong challenge that kind of goes beyond what you would see with the typical Benchmark in that he tried to make it where it’s not gameable meaning that uh you can’t just like memorize a bunch of facts or things like that and perform very well on the challenge and so essentially that comes down to like looking at it as a kind of like a measure of intelligence and so I don’t think he intended for it to be like a perfect measure and he like he knew from the beginning that it wasn’t like an ideal measure but it’s a good kind of starting place and I think it’s he’s trying to get people to like focus on metal learning in a sense so it’s but basically he wants people to be able to develop solutions that can learn from a few examples and be able to gener generalize from those few examples and and just just to add to that a little bit um the uh there exists other meta learning data sets right and and kind of just a couple of interesting things is uh there’s been there’s been papers that show that actually on these even Vision based metal learning papers you find you know you try to to to have model you look at the models that do really well and you look at the meta learning methods and you find that they’re actually CH uh shortcutting right they’re zero shot methods so there’s a distinction uh uh that you have to make between zero shot models right where uh what they what they know is the these shortcut kind of statistics and and they they have really good knowledge about uh textures and objects and and so on like for example it’s it is kind of difficult to do to have a very good meta learning data set uh uh with with something like Vision like even if you have held out classes you know the the classical F shot problem framing is you get five completely new classes at test time and you have to classify them but you could still be a very good zero shot uh very you still achieve very good performance as a zero shot model right by just having really good features now the thing that makes Arc so special and so so interesting that we’ve been working on it uh you know all of us for so long is is that it uh it really has this focus on on acquired knowledge in a way that’s different to other metal learning data set and uh one aspect of that is the reduced PRI uh number of priors right it’s it’s kind of everything is kind of simplified in a nice way another aspect of that is the fact that it’s a transformation that you have to learn I think that’s is also really interesting um rather than a classification transformation is is uh is much more information dense there are more things to keep track of and having the the instance right be a transformation is something well it’s it’s more complicated there’s more to learn there’s more to figure out right so it’s it it has a much higher focus on uh uh learning rather than uh pre-existing knowledge and and I think that’s how it achieves it by focusing on complex uh Transformations right Chet was a genius he knew back in 2017 that these models were essentially memorization machines and reasoning is knowledge acquisition efficiency so he reasoned that if they didn’t already have the knowledge in the system then they couldn’t given a couple of examples learn that knowledge unless we threw lots of knowledge at the system you know during inference time or whatever and um there’s an interesting point you made about in context learning as well in language models I think that’s a bit of a misnomer because if you think of language models as a database it’s more like in context database query but Arc is really interesting because it has let’s say 800 um tasks and we know that those tasks are not in the database then it’s got nothing to retrieve so shal’s genius was knowing that even in principle this problem wouldn’t work on on a neuron Network what what do you think Michael well I think that may be a bit uh preoccupied I mean it’s kind of open to which degree um you can via however you want to call it say interpolation or memorization uh game this Benchmark right and I think that’s actually a nice thing about Arc uh the hypothesis like performing well on Arc implies something close to AGI or otherwise super useful is falsifiable right so we can or that’s at least how I like to view it we can see it as a means to an end end or not just as a means to an end actually but as the end itself or we just pretend as if it were and then we try to maximize that score and then hopefully we learn a lot from from that endeavor yeah so I mean I think I wouldn’t agree with the take that in principle you can’t learn to perform well on Arc like tasks yeah oh I mean um something like a deep learning model so so just to comment on what you said before so I I agree it’s a it’s a proxy for intelligence Chet’s measure of intelligence is not computable so we have something like Arc and we make the assumption that if we can solve something like Arc which is to say if we can reason which means if we can basically construct acquire knowledge um efficiently on the fly to fill in a skill Gap and do the thing then we must be intelligent but with the caveat that Arc is leaky there’s lots of perceptual shortcut as Muhammad was talking about there are presumably lots of ways to cheat Arc but the base assumption is a ponus good solution for Arc would be more intelligent like if you look at like the the Viewpoint that llms are like just a database lookup table like is that known for a fact or is that basically an analogy and so to to well to my mind if you integrate some other like Recent research if you look at like some of the interpretability work maybe out of like anthropic and you see that these models are actually learning Concepts and seem to be grouping things based on Concepts and if it is true that these models can learn Concepts than they can then it then it may be that they can guide like the the combination of the little programs that they learn based on Concepts and if you can do that then essentially you have kind of a form of generalization and so it like these models kind of are prone to like shortcuts and memorization and with our approach we try to like basically not let the model do that and so that’s one reason basically that we use so many training examples it’s it’s not to necessarily teach it so many con Concepts it’s to teach it a space around the concepts and to also prevent it from kind of using the shortcuts that the models are prone to a quick comment on that I think um you know as an example if you trained an llm on chemical structures and then you did a tisy plot it would reconstruct the periodic table uh espe especially if you if you trained the model on language of scientists talking about chemicals because we have discovered the knowledge you know socially and then we’ve kind of embedded it in a certain way in our language and and in in in the Corpus and then that’ll be represented but reasoning to me I think the interesting thing I kind of agree that it’s more than a database because it’s not a simulacrum it is some kind of interpolative generalization around the Corpus using this kind of vector space and I saw that anthropic paper and and it was very interesting but you know again I think that the semantic categories are a reflection of ones that we’ve already found but reasoning is really interesting because in Arc it’s about traversing this space of combinations and given neuron networks have a finite capacity there is a limit to how many combinations they can memorize during training um just just to add I think there’s an interesting uh dichotomy here that we can kind of establish and then move forward from that and I think we can we can make some some some great points here so um so that so can these models just to speak a little bit more to what Jack is saying can these models generalize well we we know they can right starting from scratch in the training phase right they can classify the classes correctly and right and and and they’re able to kind of form these concepts of classes and and subconcepts and so on so we know they can do that now at test time um what these models do and I think this is something we can all agree on this is uh a more common ground between us is like at test time they see a novel uh example right and they’re able to uh using the frame they’ve already obtained they’re able to kind of do a little bit more processing right but it’s all within distribution right we’re not talking about out distribution you can do a little bit more uh processing right and uh and it’s that type of processing that is that that we’re speaking to right so so there is uh like when you see a completely new example of a mug and the background there is maybe like a a different type of cup and you’re trying to classify mugs and you have to kind of disambiguate so there’s a little bit of reasoning there’s a little bit of search right that happens uh there right there’s there’s no argument about that and the same in the same way when you look at NLP model right uh given a brand new sentence there is a little bit of figuring out uh uh uh what happens there and I think the the position me and Jack uh kind of uh takeen and and why we uh were teaming up and uh we’ve been teaming up for so long we were kind of talking about this from from the beginning like system one and system two in our view is kind of the the same thing right uh it it’s just happening on a different scale so we don’t see actually a distinction between system one and system two that type of inner processing and Michael also I think U may may agree with this but yeah uh that the the that type of processing uh I would say is not so different from system to processing now it’s time to kind of scale that up and then there is uh and so that’s kind of what we were speaking to it’s not like uh llms uh are very good at generalizing out of distribution or they they are not going to memorize they’re going to memorize certain things but you can kind of get them out of that and if you look at the nature of the tasks that they were already able to do really really well right and um um and if you kind of uh kind of take this perspective you can kind of see okay they’re really good at perceptual tasks and perceptual tasks are task that have kind you have to kind of deal with infinity you have to Wrangle uh and infinite ways of looking at the problem and uh and figure out a way of U using a like first figure out the objective right and then uh Implement that to find a solution um now there’s another thing here that’s really interesting the process of going from scratch to generalization that I described in the training that’s also a very interesting process it’s a process where you were able to learn that frame and produce these generalized classes right the SGD or Adam or Optimizer right now that’s also uh another thing that uh should interest you when when thinking about this okay maybe you can incorporate that as well into into the uh the test time such that you achieve more generalization could we um distinguish system one and system two I kind of agree that well I think I agree let me just reason this out I think a lot of people think the difference between system one and system two is discret versus continuous and challe of course made a distinction between those two because he designed the arc challenge to show it as an example of system two generalization which you know presumably wouldn’t work well with with um deep learning models but to me system 2 is more about um it’s more about the long taale so it’s about when you have discret compositions of rules you just get into this very sparse space and that’s the reason why it doesn’t work on deep ly models but just um I mean maybe Jack in in your articulation what is the difference between system two knowledge and system one I think maybe like one way you could think about it is so like like I don’t know that you could make a a great Neuroscience argument for they’re actually being like a system one and a system two but but I think it’s pretty good like analogy and something that is useful for understanding you know the difference between like deliberate conscious reasoning in system 2 versus more perceptual automatic and non-conscious processing in in system one and so I think kind of from our view there might there might be like more of a a Continuum between these things rather than there being like some kind of a well kind of a magical division or something between like the the two kinds of systems and so like like one other thing that I think is Maybe me to to bring in some Concepts from psychology and like like one way I look at it is in terms of like the catel horn theory of intelligence which is kind of a hierarchical uh way of looking at it where you have like the the G factor and I I think that’s kind of backed up by a lot of research and and but beneath that you have uh fluid and crystallized intelligence and I think sh viewed Arc as being a measure of fluid intelligence and I think it’s it seems like it’s serving pretty well as that now kind of to the point earlier of are these llms like a database lookup table so in in some sense I would say I would say it like probably differently than that I would say an llm is a repository of crystallized intelligence and the to a point I think Muhammad made earlier is that the the optimizer and the gradient descent process is the fluid intelligence of the model or or is a major component of it so I think there’s there’s a little bit of fluid intelligence in these models that that happens across the context when they’re given an input and but but primarily I think what you’re accessing is the crystallized intelligence of the model now now what we do in particular with with with our approach is we try to bring in more of the fluid intelligence aspects of it to to be able to alter things at test time so that we we get more of that effect and some of the brittleness that is seen with llms is M mitigated by the processes that we use your now are a crystallized database but it’s a little bit more than that because you have to ask the question well where does the fluidity come from in the generalization of of a of a deep uh learning model and in a transformer for example it’s the inductive prior so there’s a kind of permutation expansion during training so that’s when because the end of the day every every neuron Network at at the end of the day is an MLP there is an MLP that that uh represents any any neuron Network so the inductive prior photocopies different symmetry transform of the input data all over this representation space and then at inference time we go and retrieve one of those representations and you’re saying well let’s flip this thing on its head and at inference time let’s do the fluidity so that’s when we actually combine different things together which and by the way shle used the term active inference I’m not a fan of this because I’m a you know I love active inference from the from the Carl friston point of point of view but it’s still it’s still analogous in the sense that you’re saying uh I’m going to act and then I’m going to you know get some sensory information back and then I’m going to update my model and what this does I think is a form of lazy evaluation so you’re kind of lazily building the equivalent of a thousand times bigger model but because you’re doing it efficiently by kind of you know sampling information from the environment it’s computationally more efficient but my my test to to you guys is that isn’t that against the spirit of the measure of intelligence because you’re still effectively building a big database let’s see I mean it’s an empirical question uh if the Benchmark turns out to be gameable such that you have something that scores really well but you haven’t gained any inside from that so be it I mean that would be a shame of course but it would still be a valuable inside so I don’t know I mean this if you try to think of the space of all possible Arc tasks is extremely big and and for sure extremely non-trivial to kind of write a a generator that gives you like randomly samples from that space so I think we’ll just have to see how well it holds up I mean it’s very it’s a very Min minimalistic framework with these 10 symbols and that most 30X 30 grids but it obviously allows for an extreme great or extremely great variety of possible tasks that you may encode this Arc format so I like to think of Arc not just as this specific set of thousand tasks but really as uh as a kind of style or format so I would love to see um if those specific 100 or 200 hidden test tasks seem to have been gamed or whatnot to to see Arc continue living and it being kind of an open or more open or dynamic um Benchmark yeah but I guess we’re still from that I kind of have a question so um you you say uh so how do you mean exactly uh game the system is it because we use the information acquisition IE SGD to create a database in your words is that kind of the the okay okay I see yeah so the augmentation yeah the augmentation so by by augmenting at inference time we’re just kind of filling in the database in the vicinity of where we need extra density well okay yeah well so it’s kind of well it’s a Arc is a SK skill acquisition task right and and kind of we really truly see the SGD as part of the model rather than something we’re using to update the model like this is the model it’s SGD plus the neural network right so I I I don’t kind of see see it as gaming if if we’re able to solve the task with the with the with the right right so if we’re able to use it’s skill acquisition efficiency yeah yeah exactly in in test time right so it’s all within within the arc kind of framework that that was laid out by by cholet um the uh yeah I mean the the thing is uh what’s the best way to do that and and and how to do that I think we ought to have uh really like kind of no priors and preconceived notions okay what what makes uh what’s the best way to do search to do reasoning like I think it’s something a lot of people have a lot of different ideas about llms but we try to come come at it from a very Blank Slate uh we we really see uh like I said we agree about the system one system two thing we we’ve seen that uh skill acquisition uh the best way to do that about perceptual tasks is is to kind of learn like we from the from the get-go we’ve been uh doing fine-tuning uh Jack uh did fine tuning over all the riddles and and his approach worked I was trying a different type of fine tuning from the start right fine tuning is a very common idea but the the reasoning behind that is because we believe we agree with you 100% that reasoning is a surge problem right I think that’s what what uh that’s kind of one of one of the priers that that people have one of the preconceived notion okay reasoning there has to be search we agree that about that but it’s like okay what type of search and how wide how many paths in parallel should you search for a perceptual problem right so if we if you see this as a perceptual problem with infinite possibilities and infinite ways of mapping input output okay what well then how how how do you search in the best way possible right and uh one thing uh I will uh I will say just to give an example is uh what what object you look at in a riddle depends on the objective so what object is relevant to you depends on your objective and that’s a that’s a key uh Insight from psychology right the the gorilla experiment kind of showed that like if you’re if I’m if I want to unlock my phone then my unlock key is the object but if I want to throw my phone at someone then suddenly the whole phone is the object and the same thing applies to Arc you don’t even know what object to begin searching from if you’re not looking at everything at once and then finding out okay what’s going on what’s the objective so uh it’s like okay you need to do a search and reasoning is search but uh this is how we’re doing uh a search and this is kind of the the best uh most effective way uh to do it we think yeah I I agree with you that that reasoning is um is search but I think there’s more to it than that so you describing the kind of the the frame problem you know the the relevancy problem in in respect of of some objective and you know when we look at IQ tests and riddles and so on we have core knowledge and you can get really good at riddles because you start to build a toolkit you start to know that there are certain tricks certain tools that you use I was speaking with some guy who’s an expert in mathematical optimization earlier and he just said oh yeah after a while you just start to recognize this is a bin packing problem this is a scheduling problem and also you can recognize analogies so you can say well this is actually analogous to this and I already know how to solve that so I can use this strategy in order to do that thing and uh I mean I’ve written down some examples here I mean like the concept of connectedness the concept of inside the concept of monotonicity these are all like you know psychological prior that we have and I think the the spirit of the psychology approach and what sh was going for is that we do actually have this core knowledge right we have all of these base priers maybe they are natively imprinted maybe we learn some of them and we do this search process and and over time we build a library so we start to kind of internalize and and compact combinations of of these programs into bigger programs and then over time we start to use the bigger programs heris and we have this amazing ability just to look at a problem and to just very quickly compose together these these skill templates that we have to construct a program that generalizes really well I mean it’s really remarkable I mean maybe Michael you you want to comment on that yeah it’s a interesting point so I definitely think that this compositionality question is something that should be more heavily addressed by more people in general and specifically in terms of Arc I mean what I basically did in the beginning I followed exactly what fro has suggested in his paper like I built out the DSL and then I tried to use that um basically tried to search the space of all possible programs that I can write within this DSL but I had kind of two realizations there so one was a DSL if you think of it as a toolbox or a set of functions which can rotate an object or merret or whatnot um they’re really just that so I would say what we bring as prior or or knowledge there’s a lot more to just knowing individual Transformations or such functions there’s a whole lot um that was missing in my DSL relating to to how do those components interact or how do you compose them together uh all that sort of stuff and eventually I just felt like hey come on what I don’t want to end up doing is have this handcrafted de cell and then write some search where I try to manually Infuse some heris or hard code some rules on how to search that’s just like a static system that can possibly be all that intelligent so the kind of main realization that I had was oh we we I’m interested in machine learning um can we try to view this as a learning problem um obviously all those cor knowledge priers or many of them they occur elsewhere they occur occurs on text but they don’t occur in this Arc format anywhere on the internet so um you can’t directly learn from that and then transfer to the arc domain but obviously that’s what I would love to see right some system that has never been trained on Arc like data uh and then still being able to perform well on Arc but my thought was just mainly yeah we we probably do want some kind of data generation process which is also something that F has suggested in his paper um to that not that that’s not static but such that we can train within this domain that’s actually a really interesting question that you just raised which is I mean we’re getting to the core of what does it mean in in spirit to solve Arc and I think the first level is the geometry of the domain so it’s a it’s a 2d grid world and then there’s the priors which are things like rotations and Reflections and you know like things like agness and and and so on um and I think personally assuming that we are sort of nativists and we think there is core knowledge I think the perfect solution would be in respect of a domain and the PRI for that domain it should generalize to any other sets of tasks on the same domain because I don’t think sh actually cares that much about having a purely Blank Slate approach isn’t that against the principle of psychology I mean maybe Jack you can talk to that yeah I think well the whole Blank Slate idea is not really something that has been kind of born out in terms of like Research into human beings and so I don’t think we’re we’re not really a blank slate and like uh I don’t think it’s possible to have a blank slate system that has like no set of like knowledge where it can enter a domain and like just automatically understand what is happening there and it has to be composable from the the knowledge that you have and so to to my mind like some of the symbolic approaches I think they they can do some deep generalization but it’s narrow and I think a lot of ml systems can have a a broader kind of generalization but it’s shallow and what what we try to do is try to deepen it some like through the the test time procedures um just just to play back what you said you said that um you know symbolic systems don’t generalize very well Machine learning systems do but but you also said the Converse that machine learning systems are are narrow so are you getting to the to the core of in a continuous domain and on certain types of problems like recognizing cats machine learning models generalize very well but my intuition is that symbolic systems generalize in certain domains much better than machine Learning Systems because they are touring machines they can address a potentially infinite number of situations they’re completely different in in I I basically kind of outlined two dimensions and so I don’t know if I’m completely right on this but this is just kind of my intuition so the I see the symbolic systems as being able to do a deep generalization but it’s narrow and so like Michael was saying earlier when he developed the DSL for Arc he eventually kind of realized that there was well when when you hardcode a system like that it’s only going to work in that narrow domain but but it might work in a very deep way and so I I see llms as having a wider kind of range of generalization but it’s more shallow and what we try to do is we try to to deepen it and so like like our particular model and what I’ve tried to do is I’ve tried to go beyond Arc in the training to train it on like a a wide variety of tasks so if you take like a crafted DSL system for Arc in no way could it ever generalize beyond that that that I can think of unless like you actually design it very specifically to be able to do that kind of thing and so I guess that’s my point on it yeah maybe we should talk about um your your your solution so so Jack and and Muhammad you worked on a solution and and and Michael you were the the previous winner if I understand maybe we should start with with yours Michael and and then we’ll we’ll move into Muhammad and and Jack but you know let let’s sketch out what your approach was it sounds good yeah basically I set out to work on Arc and first I built a DSL which was an iterative process where on the one hand I added or modified DL Primitives and on on the other side I used this DSL to construct solution programs for AR tasks as a kind of in develop m test of the DSL so I do have reference solution programs for the training tasks and then after that I thought about okay now now I have this DSL how could I use it to try to reach some points on Arc so it was it was a semester project where the emphasis was on the DSL itself uh so I did not do all that much on the actual search side at the end of the day I would probably summarize or describe it as a glorified breath first search what I did um where I appli all kinds of Tricks such that it runs a bit faster and uses a bit less memory and correspondingly that also only got like 6% but which happened to be enough to win the alathon since there was not all that much or all that many teams and also people did not just reuse uh KAG notebooks from two years before yeah it’s also when I kind of um thought or asked myself okay how much further could the score be pushed by yeah just trying to scramble together what’s already out there namely the kaggle notebooks from the initial competition which I then did which raised the score from like 28 to 30% but that I also like yeah just put aside because it’s not super interesting right uh I want to work on something that I believe is more promising and that’s also my own code so and ever since then my main focus was actually on data generation and that’s also kind of how me and Jack and muhamad got to yeah better know each other yeah I mean we met on the yic yic Discord server in the in an Ideal World or what I wish to be doing eventually is kind of write a piece of code that can sample from the distribution of all possible Arc tasks right because Arc that should solve The Benchmark uh to some degree maybe let’s see but I think this is extremely hard to do because how do you make sure that whatever you generate like you sample random DSL program uh and Sample random inputs you’re almost guaranteed to to get garbage as a result right so I actually have not really set out to do that as of yet what I did instead at first was trying to look at the AR tasks and think of some kind of not categorization but tagging of tasks via Concepts and then for a certain number of Concepts like uh object level Transformations or grid level Transformations or cell try to come up with some schemata or generator that could produce tasks belonging to such a class but obviously that cuts it short by much because you cannot just classify all the AR task into something finite classes right and then something that I have done mainly been working on was actually the rearc data set where I said okay we know that uh well there’s kind of two reasons why Arc is hard right so the great diversity between tasks if they were all kind of the same it should be easy but also the the few short nature we only get a very few demonstration examples arguably if either of those two or both were not the case it should be easier so I just figured okay how well could we do uh even just something very simple like learning a task in isolation if we had an unlimited number of examples uh for a given task so the idea with this rearch data set was really try to yeah make some data set that would allow people to run some more fundamental experiments because I really believe that’s a bit lacking in terms of Arc a lot of people are trying a lot of things but there’s some very fundamental questions that we haven’t really properly addressed I feel like can we solve a simpler version of Arc basically and then uh Muhammad and and Jack you guys have got a very interesting solution which is completely different to that using a a large language model if I understand so could you I mean maybe Muhammad can can you bring it in yeah yeah yeah so um uh basically um there are kind of two parts to this there is there is prior knowledge that we need to incorporate and there’s also kind of Arc skills like General Arc skills that’s another also kind of key thing like okay L-shaped boxes or just grids in general transforming them uh moving them around recognizing an object over here and an object over there and those are the same or uh you you just manipulating and playing with Arc and then there’s also the the generalization piece which is okay now solve a completely new task so uh the we we employ uh a bunch of techniques uh Jack has been uh amazing at kind of just trying a bunch of things out and uh uh in terms of training and uh uh yeah we there so many techniques but basic the the gist of it is uh we do multitask training uh uh and that kind of we think has it has its benefits we we train on generated Arc tasks um not as many new riddles as as people would think like uh but we do train on obviously we pre-train on Arc tasks and then at the test time uh uh we obviously fine tune and uh we we augment and and and we do whatever we can to uh get the model to to basically uh generalize and do this kind of novel Finding of transformation and so on but it’s all using neural uh approach basically okay so so just to play that back so so you fine- tune on a whole bunch of Arc data which you you’ve augmented and then at test time you do this kind of what we call Active inference in quotation marks which is where you augment around the test instances you fine-tune the model and then you do inference on that model and then that’s your prediction very interesting and that got you guys up to 34% which was the the the winning score on on the private set which is amazing we didn’t cover as well maybe Jack you can explain this how did you because obviously language model has a bunch of tokens right how how did you kind of set up the problem so that a language model would I did you have to try many different types of representation or what was it obvious which one was best yeah that was kind of a long process actually of imperical Investigation to come up with what is the best way to represent Arc to a language model and I did probably 20 different kind of mini experiments in formatting the data in various ways and this so we we started working with language models before they became super popular this was kind of back in the the gpt3 era before like everyone knew that it was going to be like such a hot thing but in anyways I kind of experimented with gpt3 and even did some fine tunes on gpt3 with some some different ways of encoding the riddle and so went from that to doing fine-tuning with various other models and so it’s it it’s presented in a text form uh to the model and the the data set itself is in a a text kind of Json format and we we reformat it in a way that it can fit within the context and so that that’s one thing I think that I think cholay really did kind of a brilliant job in the construction of it and even the like the size of the riddles can be like a really hard challenge of how do you get this into the model so that you can get a a good response from it and so part of my approach to it was to think about the construction of a sentence and to uh set up the construction of the riddle in as close as possible a parallel construction to what a sentence might be and so it it’s kind of like the whole thing is like a paragraph and so it’s trying to uh minimize the number of characters but it’s also like trying to put it in a form that would be the most familiar to the model that the model could make the most use of could you speak to that so this is a um self- attention Transformer and it’s really good at this kind of permutation um invariance and I guess the thought occurs on on the one hand something like a CNN Vision model do you think that would be better for this kind of reasoning or do you think there’s something interesting about the language model uh in in particular and presumably that was the reason why you encoded it in this way to pick up on the strengths of the language model I think that that would be a very interesting like empirical question to answer I think that if you look at some of the research on universal Transformers and there there’s even like a version where they have Incorporated elements of RNN and cnns into those models and they they do find like some pretty interesting things in terms of like algorithmic learning with that kind of approach so it it may be that using something like a a CNN would be something that would be like a it might enhance the performance now on the other hand I would probably want to focus more on just using data to try to get the model to to be able learn as much as it possibly can without putting something like a a CNN in front of it because when you do that I think you’re going to limit the generalization of the model now now you might you might be able to do it in such a way that it could be more multimodal and where you could you know take both text and maybe have like uh some kind of a cross attention thing and so so I think it would be possible to maybe preserve it and alter the architecture it it’s just not really something that that we’ve found necessary to do and so that that’s kind of one thing that I try to do too is see how far you can push the the existing kinds of models without without tacking something else different into the architecture and I think that actually has been a pretty fruitful thing in terms of like discovering the weaknesses of the models but then trying to find ways that actually mitigate them and I I think we found some several interesting things so so this gets to a really interesting point so you made the comment before that um symbolic methods are incredibly deep within a narrow domain and machine learning models are very um shallow within a general domain and so you’re basically saying that because you are augmenting a general model to make it deep in a specific you know like in in this active inferent sense then you have the benefit of it being able to transfer to lots of different situations but the thing is though the model still has even that you know it might be a multimodal model it might be trained on lots of different types of data but it still has the inductive prior which is this self attention Transformer which is quite limited the beauty of you know writing programs and composing together program templates is you can mix these PRI together in a very principled way so you can have a combination of you know permutations and um you know translations and sorting and all of these different things mixed together so there’s always the question of well what do we do here do we only use programs or do we have a hybrid approach where we have different models with different inductive priors or do we do what you do which is we have one Foundation model but then we kind of fine-tune it with different um knowledge PRI on the top informed by the test instance it just feels like there are so many different approaches here I overall agree with for a sentiment or take that there are those kind of two types or classes of approaches and arguably maybe we would benefit from trying to combine them in a neat way right that could have many different forms you could say let’s use a deep learning model to try to inform our search or or what not not all of that is obviously completely new or has never been done so I’m sorry we this a specific question again um the question was the kind of the trade-off between on the one hand just having pure uh discrete program search versus having a single monolithic Foundation model versus having a hybrid of different models with different inducted prior yeah I’m not completely opposed to eventually trying to reintroduce some the symbolic aspects like try to work with the DSL in some way or another but as of now I’m pretty happy with what we’re doing and in some sense going fully end to endend is also really beautiful so yeah but maybe Muhammad you have something better to say here I was I was just GNA say like at least to me this kind of problem of uh wide surge is is very difficult to do uh in many other ways and we kind of have seen The Proven way to do it uh in some sense is to use SGD plus a model the the the the problem of finding the object in in in an arc riddle is not so easy because you need to know about the whole objective so so again this this comes back into play I think to me so uh I I think neural networks are are really attractive from the sense in some sense you can say that the the goal it’s it’s really interesting there’s there a way to look at the whole ml Feld as uh the goal of the whole ml field as having having the goal of one algorithm Optimizer plus neural network that’s able to learn I wouldn’t say anything but a lot as much as many things as we can right so that’s kind of that’s one way to look at the whole field uh you could look at it as making very a lot of many different very skilled and narrow neural networks but if you take a step back you could kind of see okay but they’re all trying to figure out one algorithm and uh applying that um uh I think it it I think there’s been a lot of progress in terms of if if you see if you agree with kind of that these are perceptual problems then you kind of have to uh see that there’s a very flexible sech going on it’s very very difficult to do uh flexible sech over over priors with kind of programming compositionality is extremely important right and it’s something we think about and we try to incorporate and uh in our models uh but we try to incorporate it within the neural network uh we we we we there are some interesting specifics like we could also kind of like Jack said just look at what improves these models and and look at uh look at this as as a way to improve machine learning rather than uh finding a new algorithm and we do have a paper uh that’s going to come out about that goes into these specific and kind of how to how to go deeper as Jack puts it and um uh yeah there are some really interesting things there I think we could uh we could really dive into uh maybe when when the paper comes out yeah what’s the name of the paper or is it is it secret squirrels at the moment uh name of the paper we don’t have a name yet uh we we we we haven’t published it yet right so but uh in the works could we just come back to this you know you said earlier that you don’t really see a distinction between type one and type two and you know like the the concept on top of she’s on top of the world or I’m inside the television or this this blob is inside the container you know it feels to me like we have this perceptual processing and that allows us to transform a problem into this abstract representation and then the magic happens then we do this compositionality and for me that that’s really fundamental and when you look at the state-of-the-art vision models they’re incredibly good at at perception but the the the worlds that they generate don’t make sense things are in the wrong place because they’re not they’re not respecting these axioms about how the world works and you could argue that that there are Universal rules about how the world works and and to me you know like when mathematicians do deduction um a lot of people just think that all mathematics is is deduction when you do conjecturing because you we have all of these base axioms and we just kind of mix them together and we we create this path creatively and then and then that’s the result of the conjecture but this magic thing happened that that we figured out how to trace this path through this space so anyway like that’s a bit of a rambling way of saying like to me there is something very important about this kind of compositional type two space you raised a very good point I’ll just uh I hope I’m not talking too much but you do you did raise a very good point about both of them I guess the point where we disagree or where we could dive in even deeper is what’s more important in this case the the point I was trying to make uh like you said there is this perception involved and then there is there there are two types of of of kind of uh uh thinking right there is uh okay logical inference adding this and this mathematical equation itical proof going through it and uh obviously you place a very high importance on that but there’s also okay dealing with uh dealing with things to F to frame things into a perceptual problem like uh I cannot I need to first see the apple and then see that I have five apples and then I abstract all of the Infinity of perception into okay five objects and then I can do math on the apples and give three to Michael or whatever right so you need this uh perception first now okay the the real question is what’s more important for Arc what’s more important is it the perception or is it the type two system and this is where um and and and and also can you do type two things with perceptual like systems right that that was the point I think you raised at the beginning where you asked um the distinction between system one and system two like um can you like does perception involve this ambigua thing does perception involve putting importance on a certain transformation and and reasoning about which transformation is more important why because of the current objective is that like and and how important is that in solving art right that’s that’s the question first of all I mean if you look at language constructions for example I mean language is is also an infinite space but colloquial language is very structured and language models do a very good good job of of learning the manifold of of language because you know like what we’re you know putting the the pin in the middle of of the dartboard what we’re talking about is meta learning the patterns of composition in in this kind of type two space right so we have the knowledge prior um if you think about it if if it were possible for us to just search the space of all possible python programs and we had some kind of constraint on minimum description length and we had some kind of evaluation function we would be able to solve Arc really well but you know we can’t do that because the Space is really large so we abstract it and we do library learning and then we do like meta search her istics and there’s actually a lot of perceptual interpretative structure in how those programs get composed together so we we kind of I think that’s what we do but the other thing I wanted to point to point out is yes um deep learning models you know that they they do apparently do colloquial reasoning but it’s not inventive reasoning it’s kind of memorized or crystallized reasoning and I think that was the point that Jack was saying at the beginning but the remarkable thing is even though language models are finite in their capacity they still have enough capacity to capture a hell of a lot of colloquial reasoning sort of compositions Jack you’re not in your head go for it yeah I think uh think you rais some good points there and um yeah I think a point you made earlier about like if you see how these models kind of end up with a lot of distortions and but I think like also though you see like this pattern over time of the models like getting better and better with respect to those things and so there’s something to me that that I would call like cognitive resolution meaning that uh there is like a a way where cognition can be more granular or it can be more fine grained and so that the smaller the model or the less training it has the more granular it is and the fewer the number of like Concepts that can come into play as a part of like the generation process but but the more training that you have with a model and the more data that you have to train the model on you get this higher and higher level of cognitive resolution and you can even see it in kind of a literal sense with some of these Vision language models that if you kind of explore what the model is able to perceive in an image that the better the model gets the the more it’s able to perceive like fine grained kinds of features and to to my mind like that’s going to continue over time where these models have more and more fine grained kind of resolution so if you take that back to maybe thinking about like mapping all these patterns over like the colloquial spaces and and things like that that like even though these models might have primarily crystallized intelligence you can go a long long way with that it seems like and we we don’t know kind of what the limits of that really is and so if few like think about that 10 years from now and they they learn more and more about what kinds of data to train these models with and that resolution gets higher and higher in a in a cognitive sense that I think you you get some some really interesting things and it it’s sort of you know so you you could reduce these models to like thinking about it from a mathematical sense that it’s it’s just mapping a like a a continuous distribution and therefore it it can’t deal with any like discreet kinds of reasoning and but I think there’s a lot of examples that even language models themselves are I mean the the task is a discreet task and language in a sense is discreet two things first of all yes so that the language models they have a high level of cognitive resolution and this is part of the reason why they they overfit and they learn these weird alien surface statistics which surprisingly generalize very well for natural data so for example they can just look at the texture or you know the fur of an animal and they know that it’s a cat and I agree with you that it’s not only surprising how far it’s come it’s presumably going to go a lot further but that means these models are working in a very alien world of cognition which is nothing like the way that we do cognition and I don’t think it’s necessarily the case that they can’t do discret interpolation because language models are a continuous representation of of a discrete space I think what Chalet set out with The Arc challenge was simply a knowledge Gap it was designed to show if the model doesn’t know about something then it won’t work and in the natural world because there are these continuous interpretative manifolds they kind of do know almost everything because almost anything you could say colloquially that could be understood by something else is on the manifold it’s already learned therefore apparently it generalizes but Arc is the existence case of something which was not in the data set that it was trained on because it’s something completely new therefore it doesn’t have anything to say about it yeah I think like it is it is great for that really and and even like if if you take a language model without anything extra and you run it on Arc they don’t do that well and like when when we started working on this we went at least 6 months without getting a single point on the the hidden test set and after lots of fine-tuning and work and nothing then finally we got one point and that was you know that felt really good after all of that time and but it wasn’t until like we started using some additional techniques that we were able to get the gains and you know some of the gains are like over a th% over what the model really can do itself and I mean could you guys comment though on you know we spoke about whether this is a violation in principle of the um measure of intelligence so the measure of intelligence basically says that the amount which you go outside of the bounds of your knowledge and and experience that generalization margin um you know the efficiency of the expansion of that margin in respect of how much experience you have that is your intelligence and what you guys are doing in in my view is you’re kind of like you’re adding in a lot of extra experience which means the conversion ratio is lower so would that mean a common sense interpretation is your system isn’t demonstrating the intelligence or could you argue that because you’re doing this active inference and you’re doing it in the neighborhood of the test examples then the expansion is still significantly more efficient than it would have been if you trained a base model from scratch I don’t see a lot of difference between like doing this with a an llm versus like doing that with like a DSL or where you have to actually engineer the the solution to the problem and like you you could completely frame frame it in a different way and say that the amount of human design that goes into making a DSL is cheating on Ark because you had to design the the systems behavior in a sense and and it’s also like our system what we’ve tried to do and have actually literally done is to induce within the model A the functions of a DSL and so if you take a DSL and you think about like it’s kind of search space and then you use that in order to generate data and then you take that space and you induce that into the model I don’t really see like how that would be something that would be like cheating if that makes sense in terms of the spirit of yeah no F first of all I agree with you in principle that I didn’t mean to imply that there’s a difference in kind between um you know purely symbolic and and deep learning so for example if you did a Brute Force search of all of the possible python programs that would not be intelligent for exactly the same reason because what you’ve done is you’ve generated loads of experience and your conversion ratio is very poor I think um in in your case what you’ve done is you’ve started with something which is already very inefficient because it’s a language model and then you generate generated lots of DSL augmented experience one one thing to remember is you cannot really learn uh without without having some kind of prior knowledge like what re we spoke about right like uh you cannot learn from scratch so you do need the prior knowledge embedded in somewhere somehow and it is it is really important right so um so yeah basically um you do need you do need priors to learn and uh and it’s I mean it it depends on how you uh you Define generalization it seems like the The Arc uh data set has defined generalization as exactly this learning uh a new transformation that you have not seen before and so if that has been achieved in some sense that is uh by definition generalization it’s it’s uh it’s kind of difficult to see the cheating argument uh to be honest um but I I will kind of go back to the to the to the reasoning piece where we do think uh you do need the prior knowledge but there is is that little bit of uh the cognitive resolution argument right and within the cognitive resolution there is flexibility there has to be flexibility for you to to e to to classify in distribution images there has to be some kind of flexibility and that that we think like that is the kind of the core of our idea we think that that little piece of flexibility is actually very meaningful it is composable there’s a way to get it to be deeper and and uh the there’s a way to apply that on Arc uh and it’s uh and you do need you do need arc training to to get there like completely unseen AR rle you don’t even know where the board starts or ends you don’t know what the objects are you don’t know what type of there’s also human distribution you don’t know what type of human riddles you could come across what board sizes you can expect the range of them at least there all sorts of uh you know things maybe even slightly similar those although I think uh like uh given what I know about the number of Concepts that we’ve we’ve trained on right uh versus the number of possible Concepts in Arc I also uh doubt the memorization or cheating argument uh would hold may maybe if if it’ be okay Tim if if Michael could talk a little bit about kind of a experiment that we did that was based on a a smaller amount of data like specifically on the the rearc data uh check has been training a model on this rearch data set which I mean to to maybe make it clear what what I did there is I took the 400 training tasks and for each of those wrote a generator which can produce new examples from the task and in doing that I made the task slightly more General uh sometimes the background color is fixed but it need not be fixed for the for you to be able to apply the same transformation function to yeah to figure out the result so basically um that’s was a data set of 10,000 examples for each of those 400 tasks um where check has been training a a model on that and this has gotten quite nice results actually check you can probably better describe um that in detail yeah when you run the model like zero shot on the the hidden test set on Arc it it doesn’t do well like you know maybe 1% or or something like that but but when you run it through our whole system on the hidden test set it can score up to 23% which which is you know basically a state-of-the-art solution when when you look at like the previous symbolic approaches where 21% was the highest that that a single DSL system did when we trained this model on a much smaller amount of data data that was limited to the set of Concepts that are in the training data set of Arc that it achieved basically state-of-the-art performance with that and when you combine it with the rest of our system so I think that kind of shows or or or maybe is like another argument that you know that these models are not just memorizing a an entire domain that there there is maybe something special going on here I guess yeah I guess would you either you say um they do more than just memorization or you say Okay 22% of the tested is a duplication of the Public Training set right or something in between uh you know so there’s the measure of intelligence which says um basically the the generalization must be kind of normalized by prize and experience so it’s a case of do you count the foundation model training as being that experience I think you absolutely should and and then just to bring in what um Jack and Michael you were saying um so first of all you’ve transformed the model to kind of augment the manifold of language which means you’re using the generalization power of the foundation model in in your system and then you’re doing this rear so rearch so you’re doing the augmentation of Arc because the other thing with these language models in in general is they’re not very good with low frequency data so if you expand the data set with augmentation and then you you put lot you know you basically put a whole distribution in which is augmented around the language distribution then these models are actually starting to stretch and generalize on on on this new problem so you’ve designed this This brilliant way to make language models generalize on Arc problems encoded and augmented in this particular way but I would still say that in principle this is still a model that has been trained on you know God knows how much data could could you could you make the argument that it’s that it’s intelligent I I’m kind of agnostic to using that word anyway or yeah maybe one small correction first so you talking about augmentation but with rearch I don’t do any augmentation I generate new examples from scratch so I kind of the generator specifies the distribution they kind of reverse engineered but either way I would say this is actually kind of something I I disagree with a bit here in the sense if you ask about like at the end of the day fundamentally we’re interested in in the behavior of some system and not necessarily in how it’s implemented or how it would look like if you would try to open it up and look inside so at and at the end of day that’s also only the only thing you can really measure right so at the end of the day you have some system it has some certain performance and then whether or not or like how you interpret that or whether or not to call it intelligence is kind of irrelevant um uh at the end of the you care about how something behaves and not how it got there I’m not I’m not a huge fan of behaviorism I think how we do the thinking and how we do the cognition is is very important I mean for me the the spirit of of this um measure of intelligence is we think and maybe we’re wrong but we think that humans have a bunch of based knowledge and we efficiently do reasoning and and language models have been given this vast amount of of data now I I get that you could you could say well actually we have a vast amount of experience and maybe we’re doing something that’s not too dissimilar you could make that argument but it it feels like you could think of our cognition as in the moment we are constructing skill programs very efficiently and we’re coming up with an answer just one one quick thing also I would say or at least I personally care more about getting something to work and L about trying to reverse engineer how we do it or at least I I very much expect any kind of more advanced systems to be yeah just very different from how we are well I I kind of agree and I don’t but let’s do the agreement bit first so I I think the reason Charle came up with this was on kaggle around 2017 or whatever he noticed that all of the solutions didn’t generalize beyond the instance of the problem so they couldn’t be applied to other types of the same problem now that is not the case for what you guys have done so you could in principle find another similar type of problem another archetype Challenge and you could do exactly the same thing because you would you know do this data set generalization and do this like active inference and fine-tuning and it would work so you could say oh this is just a no true SC Scotsman argument right you know your thing works and I could do it again in another domain therefore what’s the problem and so maybe it’s just a philosophical argument that I’m saying well that’s not really intelligence I don’t know I mean it is it is really interesting I mean Arc is a behavioral test right so it is a behavioral way to try to tease that problem your point is it might not be a very good behavioral test if uh llm solve it or or they don’t and so on but I I I I I do agree with you in that okay there is it’s it’s unsatisfying we have to dig deeper right I think that’s what that’s what you’re saying first of all we have to say okay what what do we Define as generalization do you define a completely novel difficult Arc problem being solved no matter what happened before is that generalization enough to you or not and then okay uh I think that’s also an important question for us to ask ourselves you know uh just so that we’re able to talk but then also um it’s it’s not super it may not be super satisfying and we want to dig deeper okay what made the llm solve this specific thing exactly what what’s the the important thing I think that’s a super interesting question that a lot of people uh want to answer and like uh I’m super excited to work on that to to kind of figure out why our neuron Network solved it and what exactly it did and and uh how it got to that solution you know trying to provide more than behaviorist responses because I think that’s also super useful um uh and perhaps it would answer the question one way or another I agree with you that it is unsatisfying basically but uh at the end of the day uh you know performance and behavior is also super useful that’s why uh you know the these things llms are useful in their specific uh domain and their specific Nar they do useful things so Behavior behavioral uh thinking is useful but I agree that that is unsatisfying and I hope that we can uh figure out why and I would like to work on that even it is it is interesting in the sense that your if your data set uh I’m not using augmentation generation if if it was just based on some fairly um first principles based Transformations and extensions of the existing example so something that could easily be applied to something else you know that I don’t have a problem with coming up with new prize for another domain on a different type of problem it’s all about the the The Meta learning and the reasoning but we have been thinking of that as a search process but in this particular case you’re saying well when we Leverage The Language model so we fine tune the language model on on all of this generated data and now when we sample from the language model the language model is doing the The Meta learning and the reasoning that’s interesting because you know the whole hypothesis that I’ve been going on is that that doesn’t happen in a language model a language model is a retrieval system it’s not doing this kind of patented meta learning but you’re saying that you think that can happen if you can transform this problem domain into the existing manifold of the language model yeah yeah yeah 100% the the the this uh very clear parallels of associative learning that that you could see as happening now there there’s no proof of it but uh you know there’s the the tokens themselves and the way s potential Works uh dynamically in the forward pass uh there there’s a there’s a dynamic model being built up in the forward pass and in just in just the forward pass there’s a dynamic model that that you could kind of see associations happening uh within it uh or you can see how that may may may come to be now whether that’s actually happening or not is a very interesting uh research question I and and that is kind of our pride that is how uh kind of me and Jack came together on this we kind of saw this as kind of the same um the same problem and we were like okay these are really good at at that and and we can probably get them to be better so um but yeah no I think we do have to dig deeper yeah I mean and to me this comes down to the notion of creativity in general so and and it’s also related to um an analogy making so you know part part of creativity is about being able to map seemingly disperate things to things we already know and it’s also related to this idea of core knowledge because you could argue that everything we need to know is represented let’s say in our language or in these existing models and there exists a mapping function which we could trivially create for something off the manifold to transform that problem into a problem we already have and then we can just use the existing language model but to me I’ve always felt that there are so many examples where we do need um you know ingenious creativity in the moment I think that is the case and we’re not saying that Transformers can do everything we’re not saying that llms are are far far far from it but we do think there is a little bit more uh just a tiny bit more uh than than the Skeptics would would uh would have you think and and we’re operating on that now whether they can do true true creativity I think I think there’s something interesting to be said about learning and and Arc and and and uh perhaps even creativity I think you do need some kind of learning uh to be able to form new Concepts because uh like you have this pre-existing map your pre-existing frame of and and of looking at things and the way the only way to kind of change that frame is you need to adap right and and so uh the there are many issues that prevent uh language models uh from doing that too much we think they do them a little bit in the attention Maps but um but uh they have for example a frozen tokenizer right they cannot come up with with new tokens uh and it’s not just a question of coming up with new tokens even if you add you could easily like go into the neuron Network and add another neuron and there’s your new token but has not not abstracted over its inputs in a good way such that this new token means something well and the way to do that is with feedback and with you know uh an Optimizer and and and so on and so forth and that’s how you get a new abstraction you know a new class and uh you do need learning for for abstraction yeah that’s really interesting and I completely agree in principle that there’s kind of agential feedback system so you can kind of like lean in to the specific instance so the problem is is very important but Ju Just to kind of like slowly wrap up um so first of all how much um generation do you do at inference time that’s the first part of the question the second part of the question is you’ve got up to 34% now how far do you think it can go and there are people like delip George who have said that it’s because of perceptual leakage or you know there might be some upper bound on how far you can go with this model so I mean how far do you think it could go well it’s it’s kind of a bit of an open question in a way and so we we do have like some projections that that we think might be possible and and I I wouldn’t be surprised if without a lot of innovation that it could eventually go to 50% or or higher and I think with additional innovations that well I don’t know if it could get to 85% or not but I think it can definitely get like somewhere over 50% and uh so about the like the perceptual leakage kind of detail like ultimately when it comes down to it nobody has seen the answers so regardless of whatever kind of approach you’re taking either a symbolic approach or with a machine learning approach you don’t have access to the answers of the Hidden test set so there has to be some degree of generalization that occurs and uh in terms of like how much data we use like relative to the amount of training data we use that’s like a tiny amount and so one thing we kind of see too there seems to be like a scaling law in terms of how good a model has become in terms of like how much data it needs in order to to be able to to fit to a test time distribution and the better the model is the less data it needs at test time in order to be able to solve the the problems and so I I would say you know there probably is some degree of that that perceptual leakage at the same time I I don’t think it it’s all that way just you know based on looking at at the items and so it’s there are like a number of items that are I would say more idiosyncratic rather than patterned following like a a clear kind of a pattern and I think cholet has said himself that like eventually he would like to have them essentially all be kind of idiosyncratic where like you kind of couldn’t predict the the content of another one based on the the content of one of the items and so yeah I think I think there’s probably a little bit of that leakage and even to the extent that like you can look at some of the existing symbolic solutions that are like really crafted for like a very particular kind of a problem like uh repairing a a mosaic or or something like that and so so there is obviously a little at least some degree of that kind of leakage but I don’t think it’s I don’t think it’s 34% I think it’s yeah interesting um another thing I was thinking about is um so the the ark challenge is actually quite restricted in in a sense so it’s been the the s have been designed with these core knowledge prior so objectness goal directedness numbers and counting and basic geometry and and topologies I mean maybe Michael you’d be a great person to answer this because you you’ve done this um data set generation thing but you know roughly speaking you can Whittle it down to I’m not I’m not sure how many Transformations you had Michael but you know let’s say on the order of 30 um or or maybe 50 or something I’m not sure and um so that’s not task specific it’s domain specific so this is an interesting thing right so now when we use chat GPT it’s not domain specific at all I mean I can ask it about almost anything and when we create this active inference thing that does this you know domain specific data generation it’s now it’s now a specialized model isn’t it so how how do you how do you think about that do do do you think it you know in the future we might have lots of specialized models for doing certain types of things I hope not I hope it will just be one one uh 50 line python python notebook or something that does it all um but maybe a little bit of input there so basically the data generation and the model they are not directly tied to each other there’s some more than one ways in which I do generate AR like synthetic data and then on the other hand there’s the the model in terms of the the concepts in the DSL I view it as a it’s just a toolbox not everything is a transformation of an object or a grid there’s also helper functions uh there’s also stuff like detecting a property of an object or a grid and most likely it’s a to complete language so it’s very flexible there’s 160 Primitives but I know I can reduce it down to like 30 uh if I try to or start rewriting the DSL within itself what I was getting at is for Arc there’s about 100 I think did you say 100 50 but if we wanted it to work on any problem wouldn’t that blow up and then and then we’ve got this exponential um generation problem yeah well I guess the question is kind of meaningless because as I said I can try to rewrite the DSL to use as few functions as possible but with the same semantic excivity so I would have suddenly less Concepts but the same expressivity of the language and I can obviously also do the reverse I can for each composition or combination of couple Primitives uh added as a new primitive so I not quite sure like this this question of number of Concepts or Primitives like is a combination of two concepts A New Concept I guess depends or up to the person to Define okay but but in principle though we are we we’re generating a data set to fine tune a language model and that data set grows exponential with the number of combinations of these Primitives and you know we’re not talking about Infinities here as we’ve described it’s highly structured and it’s more ponus than than we you know might have might have thought but it’s still huge and that to me suggests that just especially if we’re doing test time fine-tuning with active inference I mean just to give an example there was that tree of thoughts paper and open AI were talking about qar weren’t they they were talking about doing some kind of test time you know trajectory optimization and the reason we never saw that presumably is it’s too slow and they couldn’t scale it they couldn’t make it work you know so we’re going to have this problem that in principle we could do all of this clever stuff but the data set’s too big and it’s just too much test time inference so for Arc because there are you know relatively few knowledge informed priors we can do that in tractable time right but if if we if we start broadening that out to a very general model it would become intractable for us to do that at inference time I would definitely agree with the last point but I would also say that even within the arc domain I mean the the number of possible problems that you could generate or come up with it’s it’s extremely big uh already so yeah I I don’t know for certain about that like to some extent I can see your point that like I think one reason that we need so many training examples is that the the base models don’t have already a great number of priors that they can draw on so if you imagine that we could apply some of the process to like a very large model like a gp4 level model that already has within it a great number of priors the amount of data that you need at test time grows much smaller and so we even see that with so models that are trained to a different extent there’s a great variation in how much data they need at test time in order to uh give their best performance but there’s also a scale on the model size so if you go to a larger model you need a lot less data in order to be able to get the same performance out of the model so if you keep scaling that up and you have a model like let’s say gp4 Vision that has a lot of visual uh Concepts within it that it wouldn’t take nearly as much data in order to learn and perform well on Arc because it would pull in some of those Concepts that that it’s already learned if if that makes sense so I don’t know that like the the amount of data would explode at test time Arc is quite contrived in the sense that it’s just like you know one one shot but in the real agential world I think what you said makes a lot of sense because the first time I generalize to a new class of problem I might have to do lots of simul I mean this is what we do as humans we do lots of simulations of of the world and we kind of think about what things might be in the future and then it gets kind of baked in and the next time I see that situation I’m not dealing with novelty anymore so it’s not like every single time we’d have to generate loads of data you know the actual um agent that we build would would be durable it would have a memory and it would learn it would remember all the stuff that we did before so that’s another argument why it might not be that big of a deal and and your point is well taken that if the model is is bigger and it already has more knowledge then you don’t need as much generation to fill in the gaps no I just um I agree I just going to say it doesn’t have to be exponential with respect to uh the concepts as in you’re uh you’re going e to the N where n is the number of Concepts it could be uh I think Jack’s Point basically was that it’s actually o to the N probably if you have priors already right you could reuse them uh like from uh gp4 Vision you could possibly you’d have less lesser training time right it doesn’t have to be o to the to the it could be o to the N it depends on how you see uh compositionality within the model uh kind of working out basically um yeah guys this has been amazing what what what what what do you want to tell everyone before we go have a look at the AR tasks it’s a whole lot of fun very addictive and hopefully you can yeah produce some useful insightful in the the process of dabling around with it yeah yeah play around with h with Ark it’s it’s very interesting uh try out all the approaches see why it fails and uh uh yeah enjoy I have have to say that Ark is super fun and and addict addictive and so be warned ahead of time that if you if you like hard problems and you go down this Rabbit Hole well it might be hard to escape so so be warned and and uh I think cholet has kind of done a brilliant job with this and like so many people have amazing creative ideas that when they kind of apply even like their own personal knowledge to some of these things it it just kind of blows my mind sometimes what people are able to to come up with and so I’m interested to see what all the creative minds out there are going to do too well um anyway guys this has been amazing thank you so much and uh honestly great great work on on your um respective submissions for the arc challenge it is a beautiful Challenge and there may well always be people like me and Gary Marcus who will move the goalposts you know after you’ve solved it and claim it’s not really intelligence but um you you know I I think um you’ll accept I did Grant that what you’ve done is is very interesting because it could solve chal’s original concern which is that it could be applied to a new type of the problem and I think that is a difference in kind so my my congratulations to yeah than thank you very much thank you for uh inviting us to do this and and like to your point it’s I think we need skepticism and that helps like uh reveal blind spots that we might have and yeah so I think I think it’s all good and it’s uh yeah than thanks a lot and very nice to meet you