<no title>

I really think that computation or compute will like be one of the fundamental currencies in the future I mean it’s already that but even more so it will become even more so last week I was in Zurich to visit two for AI Labs every single month they have a physical Meetup General IDE is to uh talk about Recent research meets other cool people this was yonas hob he’s got a paper out talking about test time inference or transductive active fine-tuning think about the Google Map analogy as you zoom in you get more and more tiles you get more and more resolution to suit the specific situation that you’re interested in and that’s what he’s done for large language models so essentially you get a test example you do retrieval and then you fine-tune the language model and you get significantly better results for that situation you have some problems that you’re interested in solving and instead of trying to solve everything every possible problem at once of of course if you could do that if you could solve every possible problem at once you would also solve the problems that you’re interested in as a byproduct but I mean that’s in feasible we have limited time limited memory limited compute now instead of if doing all that just focus on the promise that you’re trying to solve actually the problems that you’re interested in and most likely you can do that much more efficiently than if you were trying to solve all problems at once the DNA and our form of life gives us some very hard constraints right we have to consume water we have to consume food right to survive but within that there’s quite a lot of kind of Pathways to achieve that and they are also very dependent on the society around us right so that’s and not just the society our environment right and that’s I think part of intelligence is coming up with these abstractions that regardless of the environment allow you to adapt right to these kind of to the environment and and be able to fulfill your fundamental desires and those will depend on do you want to run llama efficiently on smaller gpus what if you could run both training and inference on the same GPU with Cent ml’s breakthrough optimization technology you can maximize Hardware utilization and slash AI computation costs running llms at scale shouldn’t break the bank centel’s intelligent optimization platform helps Enterprises deploy AI models with maximum performance at minimum cost experience the difference I want to kind of start with kind of a brief introduction of what we’ve actually been working on um and we’ve been working on this pile language modeling Benchmark the pile is a big data set of natural language it comprises a lot of different aspects or a lot of different tasks that you can frame in natural language so this includes code this includes math this includes scientific papers but also like more General content that you could get from the web and now this pile language modeling Benchmark is all about trying to learn language models that can predict the next token or the next word if you will in this kind of big space of of text and this has really been around since early 2019 and the first kind of models that were evaluated on this were gpd2 and they were quite good at the time quite bad now if you if you look at them now and um kind of what you valuate here so what you can see on this y-axis is called the bits per bite that’s really the next token prediction error and what we’ve been able to show recently is that you can do much better than the previous state-ofthe-art by learning at test time and I’ll talk more about what that means and you can do so with a much much smaller model than the big models that were the previous state of the art so the previous state of the art was a 130 billion parameter model and now we’re at 3.8 billion so 30 times less and actually much better so I’m going to talk about how you can get this and there are like roughly two aspects to this so one is really this idea of learning a test time I’ll talk about that first and then I’ll talk a little bit about um some key aspect of that and that is kind of relating to data selection and sequential decision making cool so what is learning at test time in this framework looking like and I want to start with really a very simple picture it’s just the standard machine learning pipeline you all know it you start with some training data you use that to train a model and then you use that to make a prediction at some test instance and now one of the first things that virtually everyone is taught in their first course on machine learning is that you could that you should separate these stages of training and testing you should train your model once on all the data you should freeze it and then evaluate it right should not conflict training and testing but in fact whenever you evaluate your model on some test instance you know what you evaluate it on right latest before you do the forward pass right and learning at test time is really embracing that so learning at test time is about using that test instance to train or learn a specific model tailored to what you’re trying to predict and then use that model to make a prediction so I want to motivate this a little bit like with a very simple picture of why could something like this even make sense and want to talk about curve fitting so let’s say we want to fit this black curve and it’s just a linear curve so the simplest curve you can come up with and this is fairly easy right you can just use linear regression and do this fit this and be done with it right but what if you now want to fit this slightly more complex curve like that looks a bit weird and if we fit our linear regression this will somehow capture the trend in that curve but it will not clearly not fit it so well so a lot of machine learning has been about figuring out what methods can we come up with to get a better fit of that curve and really one of the fundamental ways that people go about this is parametric models right you could increase your model class that you search over to uh extend kind of the features space for example using polinomial features or periodic features to get a much closer fit here or you could use noal networks to learn these features from data right but what you do is you increase the model class which means you have to search this much larger model class and you need a lot more data right to to do that now another approach that people have looked at is nonparametric models so these are kind of approaches where your model itself is the data um you don’t kind of train anything on top so there will be kernel regression kernel Rich regression K’s neighbor for these sometimes a problem is that really if the data is large this can be very inexpensive I want to talk about local models and really very simply it looks like this so you train a kind of separate local model in this case still a linear model for every prediction that you’re trying to make right and you can see with this I mean just pictorially you can fit this much more like much more complex than linear curve much better than just using the linear regression so I want to argue that kind of these local models have fundamentally two components so one they have a parametric controller so that is kind of the model that we use to make these local predictions so in kind of the simple story that I’ve been telling it would be a linear model but you could go beyond that right and then the other key component is this non-parametric memory or your data of um from which you select to build your model locally for every prediction right traditionally people have been using K nearest neighbor in some representation space so they build a representation space and then for every prediction they select the K nearest neighbors to kind of your instance to your test instance and then build their local model uh using that now I want to kind of take a step back and think about what these local models might mean it seems kind of interesting that using the small model class so sticking with these linear models we can predict this much richer than linear function class um and another thing that seems kind of interesting is that just using one local or training one local model requires much less data than having to train this huge model that would fit this entire complex curve that seems kind of odd and maybe too good to be true right and there are a few things that I didn’t talk about yet and maybe one of the key things is the importance of representations or kind of abstractions and this really kind of I think comes up especially when you talk about how you select data from this nonparametric memory so to be able to kind of cheat and being able to just train a small model or search a small model model class using few data essentially how we could be able to cheat this way is by kind of leveraging abstractions leveraging representations that tell us okay only this information is important of kind of the entire piece of information that is stored in your memory only only these pieces of information are important and you can focus on them and maybe these abstractions already tell us how we can use them to make predictions right that’s kind of very abstractly speaking I’m going to more I’m going to go into detail more about how you could operationalize these ideas but that’s kind of the abstract theme that I want to go into so I want to kind of motivate one more like this this problem of local learning in kind of a pictorial fashion and to this I want you to consider kind of this space as the space of all token sequences and maybe this manifold in the space as this manifold that captures all of natural language and then really what most of state-of-the-art models are doing is they’re trying to learn this entire manifold or fit that entire manifold and you could call that kind of some inductive type of learning but then you realize quickly okay what you care about is not really all of natural language you don’t want to reproduce Reddit instead what you want is you want to solve some coding problem right so you look into kind of the interesting part of natural language maybe that’s coding and then you fine tune your model right and local learning is really the extreme of that so you realize okay actually what you care about is not all of code you care about only a subp part of code and maybe you just care about the specific coding problem that you’re trying to solve with your code base right you give it your code base you ask it your task and it tries to solve that right so that’s kind of the story of local learning and what it’s trying to solve and this idea is not really new um it definitely comes from at least the 60s and 70s and I think in the ‘ 80s this kind of famous statistician Vladimir vapnik put it in a very nice way he said when solving a problem of interest do not solve a more General problem as an intermediate step try to get the answer that you really need but not a more General one and I really like that quote essentially what he’s saying is that you have some problems that you’re interested in solving and instead of trying to solve everything every possible problem at once of course if you could do that if you could solve every possible problem at once you would also solve the problems that you’re interested in as a byproduct but I mean that’s infeasible we have limited time limited memory limited compute now instead of if doing all that just focus on the problems that you’re trying to solve actually the problems that you’re interested in and most likely you can do that much more efficiently than if you were trying to solve all problems at once so now I want to go back to llms and kind of paint stick with this kind of simple picture and this um kind of curve fitting example that we talked about and entertain the hypothesis that and let’s say this more complex curve describes all of natural language this more simple curve that somehow fits that Trend but not so perfectly corresponds to current language models could kind of language models which locally learn at test time fit this curve more accurately without increasing the representational capacity without increasing the model class right sticking within the model class that we are currently using but using that doing like doing this local adaptation at test time can we fit that curve more more accurately so um does this work I kind of preempted that in the beginning a bit and it does so kind of here you can see results with a bunch of different models at different scales I’ll go into that a bit more essentially what we do is we do a very mild form of local learning at test time so we just kind of Select some data I will talk about how we select that data we select some data from this big memory and then do a few gradient steps so do a few steps of back propagation 50 on that data and then um kind of use that as our final model and that’s kind of this red bar and you can compare to this gray bar which is the initial model and you can see it’s quite obvious it does much better right and significantly better and even or kind of especially with the state-of-the-art model 53 just using kind of this local adaptation local learning gives you much more than going to this largest 53 model which is the 14b model more than three times as large that gives you much less um much less than twice as the gain or maybe even more than than doing this local learning and then another thing that we looked at was comparing that to in context learning so in context learning is also a form of local learning so instead of doing propagation what you do is you just put your data into the context window um and really I I will show you some results really explanations for these results I don’t really have and we can entertain maybe some theories later but um interestingly with these like gp2 models doesn’t work quite as well as backrop with 53 it’s more on par um but some really cool thing that we can do is because we work with this pile data set which is the big metadata set of human language we can look more closely at where does this divide actually happen and so these is gpt2 gpt2 large 53 and just kind of the top four subd dat sets where you see kind of the largest difference between context and fine tuning and what seems odd is that across these very different models I mean especially gpt2 and 53 are very different it’s very consistent where you see the difference between uh fine-tuning and and putting stuff or data into context so in particular with like math this is like School level math questions coding or Freel law are Court opinions so probably like more complex problems the there you see like a big advantage of using backprop instead of in context learing at Le at least with these models and some other interesting thing is that with the math data set um so so by the way I should have said this earlier these absolute numbers here are relative to the base model so the percentage of kind of bits per bite that you have after you kind of do this local adaptation relative to the base model so 100 is really the exact performance of the base model zero would be you predict everything perfectly 50% would be kind of a 2X improvement over the base model interestingly with math you see um with math you see that context or kind of we found that in context learning doesn’t change the performance of the base model at all whereas using the same dat but backrop can drastically improve performance and actually the same story interestingly with archive it doesn’t show up in the top four here but it’s also kind of staying at 100% for in context learning and with fine-tuning it gets much below that so that’s kind of odd so I just wanted to kind of highlight that and maybe we can can entertain some thoughts later cool so now I want to talk a bit about this other kind of challenge um which is the challenge of what data should you actually select to build these local models and that’s really a key part because if your data is not good you’ll end up with a bad local model as I will show you and you really want this to work if you want to deploy this approach approach in practice because if this doesn’t work then things can really fall apart when you deploy such a model because very differently from when you train a model at train time in the standard train test split um here you don’t have control right when you deploys that inference time and don’t monitor the prediction every kind of prediction that your model makes um you really don’t know what is happening versus in the standard trained test split you can benefit from the fact that you just train your model once then you have a lot of chance to probe it see how it behaves right and then once you feel comfortable with it you can deploy it right in this case you’re kind of adapting your model as it is deployed and maybe that that makes you a bit hesitant um so so this is really a crucial aspect I want kind of motivate this kind of question of how should the kind of model make decisions of what data to select by the simple example so let’s say you are prompted or the language model is prompted with a question about the age of Michael Jordan and how many kids he has now clearly kind of these are two pieces of information and if the model has not remembered them it needs to be somehow retrieved if you will from the data from this memory and kind of we ran this experiment with the setup that you don’t know or that you don’t have kind of a piece of information that has these two data points in conjunction so you just somehow have to synthesize multiple pieces of information and the standard thing that people do is as I mentioned this K nearest neighbor retrieval in some dense embedding space in this kind of representation space and what that does is I can show it to you here it retrieves two pieces of information that are both about the age right so that kind of failed and why did it fa it’s because if you look in this latent space about kind of as to what is closest it will pick up this dominant frequency if you have multiple pieces of information that kind of have this frequency in high amplitude it will pick these repeatedly before kind of going to different clusters that are slightly further away um and that’s kind of a problem here so it picks redundant data so actually the total kind of Information Gain of these two data points is not so large of the first data point yes but adding the second data points doesn’t really add information relative to what we’re trying to predict now I want to talk about a different way of selecting data and we call that sift and in this case kind of this finds this kind of tries to select non-redundant pieces of information and gives you actually answers or the information necessary to be able to answer this question and we see this effect not just qualitatively but also quantitatively so if you look at the next token prediction error here against the kind of number of fine-tuning steps you do you can see that selecting informative data makes a real difference and if you just go by nearest neighbor selection maybe in something like a wellmaintained data set like this academic pile data set right while it’s still significantly worse than sift it’s not completely bad but if you go to maybe a more realistic setting where you have true duplic of information and we simulate by just kind of duplicating data in the pile when nearest neighbor would then repeatedly select the same data point right it’s not able to detect if some information is actually done it or not then the error can actually blow up right and in that case you clearly you don’t want to do any local learning at test time if your error gets that much worse so that’s kind of interesting now how does sift work sift follows a very very simple principle um this principle is simply to select the data that maximally reduces the uncertainty about what the model should respond to the prompt and I will talk more about how you can operationalize this essentially requires two steps so the first thing is we need to estimate this uncertainty and then the second thing is we need to minimize it so very simple I’ll talk about both these things so first off estimating uncertainty and that’s a very difficult task and a lot of people have looked at this in the context of large langu language models but also many other models and the way how kind of we approached this is we were somewhat trying to make this tractable so that in the end we can optimize over this quantity right and to make this tractable what we did is we kind of used the simplifying assumption so what we did is we treated the language model as a linear model right um so a linear model not in the kind of values of the distribution but a loged linear model so the log before they are pass through the soft Max to give next token probabilities is is kind of assumed to be linear in some known representation space so of course this representation space can be highly nonlinear in the actual inputs but we assume it to be known and and it’s an unknown function and this may seem like a very strong assumption but interestingly since kind of the beginning of language models people have looked at how language models May encode um Concepts as linear representations in this kind of um highly nonlinear representation space um some recent work kind of at icml was really kind of showing this very nicely and kind of people have been calling this the linear representation hypothesis and a lot of work in interpretability for example is is also using this idea and similarly if you look more like uh to data selection and for example comparing it to K nearest neighbor fundamentally these methods that are relying on dense embeddings and extracted from l from llms are also kind of somehow relying on this assumption uh fundamentally right they’re also kind of treating or looking at this fixed representation representation space so what does this give us we can kind of make sense of kind of these three key objects so the first one is being the ground truth so this is kind of the true distribution of natural language and then we assume we have some pre-train model so that’s the base model that we start with and then we kind of denote by SN the tune model after taking these n gradient steps and what we show is that you can relate the error so this here is the error of your model after taking these n gradient steps um you can relate this to some key object that I will talk about so how do we measure this error this error is here kind of the TV distance the total variation distance between these two distributions over next tokens first off the true distribution and then our model right so if this error is small then we’re very happy then we have a good language model if this error is Big then our language model screwed up somehow and now what we can do is we can under the simplifying linear linearity assumption we can bound this by some scaling factor which depends also kind of on the performance of the initial model and this key object Sigma n which just depends on your prompt and this key object in that sense is kind of a measure or can be interpreted as a measure of how uncertain the language model may be about what it produces as output to the prompt and the cool thing is that under the simplifying assumption this object is actually tractable so we can analyzes analytically and we will also be able to minimize it and this object kind of depends on the representation space that you have and also on the data right if you have good representations you can make sense of the data better that’s why depends on that and of course it depends on the data right if you have good data you can make better predictions you can be less uncertain if you have bad data you must be uncertain so how can we or how do we minimize this uncertainty so again the principle was kind of to choose the data that minimizes uncertainty of the model after having seen that data so how do we kind of operationalize this we just minimize this kind of uncertainty about the answer to the prompt given X n xn is just the kind of data that we have selected so far plus the new data point right so very natural object and I’m going to talk more about that in just a moment you can actually interpret that in various ways um what you can actually show um for this type of selection rule is that the uncertainty decays the uncertainty essentially vanishes to some kind of irreducible constant and I’m going to walk you through this so what this shows is that kind of as n grows as you select more data your uncertainty decays to this irreducible constant here and what is this irreducible amount it’s kind of very intriguing because it depends both again on the abstractions and also on the data that is available in your memory so what you can think of is that if you have a memory that kind of tells you about let’s say sports right and then you get asked a question about politics uh and your model does not have good knowledge of politics takes you won’t be able to answer the question well and that is essentially what this says so it says that you both need to have the data in your memory to be able to answer the question and you also need to have the abstract representations to be able to make sense of that data and those two things are really critical in conjunction right so essentially it says that predictions can only be as good as the data and the Learned abstractions cool so I want to kind of give you a little bit of a different interpretation of what C does from a probabilistic lens because I think in that way one can very well describe kind of what its properties are and also contrast it from what nearest neighbor retrieval does um because again as we discussed nearest neighbor retrieval really doesn’t take informativeness and especially redundancy possible redundancy in the data into account and kind of this probabilistic lens will allow us to really uh emphasize why CFT does this so what I want you to think about is the following I want to you to think about the language model as a probabilistic model which has a belief over what is the right kind of model that describes natural language so this belief is like an epistemic belief um kind of a probability distribution so this controller is kind of like has this epistemic belief and now it interacts with this memory it searches this memory for some data and then receives a response a noisy response from that memory and updates its beliefs right and then it searches again and what you can show or what is a fact is that cift actually Maxim maximizes some mutal information some information gain um in this probabilistic model and what is this Information Gain CFT maximizes The Information Gain of the data or of the response that it receives from memory relative to the prediction that it’s trying to make conditional on all of the data it has seen so far right and that’s really kind of a very simple statement right so you can express sift kind of very um very succinctly what is kind of nice about this interpretation is that you can write this information gain in a different way so you can split it into two terms and the first term is just the information gain of your response from your memory of your data relative to the prediction that you’re trying to make and that is maximizing your relevance of the data that you’re trying to obtain and you can think about this as the thing that nearest neighbor search optimizes right it’s just trying to find the most relevant data and in a needle of a hake thing that’s maybe not the not the worst uh thing if you’re just searching for something but if you’re learning you you need to synthesize multiple pieces of of information together and this is kind of um conveyed by the second term so the second term is also called it’s actually called redundancy in information Theory it’s also called The multivariate Information Gain so what this term measures is it’s measuring the redundancy of your new response y of X and the previous responses y1 through n relative to making this new prediction so in that sense it’s really now it’s very large if your information new information is redundant is already contained in the previous information and it’s very small if your new information is novel so we trying to minimize that right so you’re really trying to get novel information and in that sense really this kind of composition tells us why CFT works why CFT selects non-redundant data as opposed to nearest neighbor retrieval awesome so does it actually work um and does it work better than nearest neighbor retrieval it does again on all scales um so here we also compare it against something that’s called uncertainty sampling which is more like purely about minimizing redundancy and not really so much about relevance so it turns out that actually balancing these two things in an intricate way really helps and interestingly this helps lot more with stronger base models so here I mean relative to the performance gain um kind of of nearest neighbor retrieval sift does add something but not so much to gpd2 than it does to 53 so that indicates and it’s kind of intuitive that if you have a stronger model it is more important for that model to receive maximally informative data to make use of that than a weak model where maybe actually some redundancy is not so bad to make a better prediction and another thing that we find that I’m happy to talk more about is is also that we find kind of larger gains over nearest neighbor retrieval if you increase the size of the memory so it seems like if you if you have a larger memory taking informativeness into account is also more important and one other thing I want to point out and that’s maybe also very important is that we see that this performance Improvement or our nearest neighbor retrieval is robust across all of the pile right so regardless of what subdomain of natural language you work on it seems like taking this informativeness into account always helps and in particular if you look at this kind of column that corresponds to this failure mode of nearest neighbor where information is duplicated where pretty much always your fine tune model is worse than the base model really CFT achieves a kind of dramatic Improvement awesome so I want to zoom out a bit before we end and talk talk about one more kind of thing in this kind of context of learning a test time that I find really exciting and kind of a very interesting question and that is can we learn these representations over time it seems very important that if we want to end up with like open-ended systems that these systems are not held back by the abstractions and representations that they have learned once but that these systems continuously adapt and continuously learn from kind of new interactions new data actually the compute that we spend right that should not go to waste and I’ve said before that these representations and abstractions are really key to making uh good predictions so I want to again Mo motivate this by a very simple sketch so let’s say the black dot corresponds to the test instance that we’re trying to predict and these gray dots correspond to data that is needed to make a good prediction right and this kind of shape here is supposed to somehow demonstrate in our simple ukan space how our model thinks this black do these gray dots are related so good abstractions would put these gray dot very close proximity to the black dot but these representations are not good right so what we want to end up with is a system that over time can move these grade dos that are bad representations closer to the black dot so over time improve these representations as we encounter maybe bad representations and as the like as the model uh learns and really something that is Aston ising to me still and was astonishing to me back then when I found this was that bootstrapping these representations somehow worked or works and actually when I did this experiment first this was one of the first experiments I did when I was working on this project and that really motivated me to look into this deeper and see okay why is this actually working so well H was this and I’m going to tell you all about it so the setup is we’re working with mnist so mist is a data set of handwritten digits so digit 0 9 and you’re trying to classify what is the digit and now in this setting we were not aiming to classify all the digits we were just trying to classify I think the digit digits 3 six and N um and we started with a simple convolutional noral Network um randomly initialized very straightforward and then we trained this n network to um make these predictions um and we tried two ways of training it so the first way was just take this data from this kind of training data set of mist IID so kind of the normal way and that’s this dash line and then the second approach that we tried was to take the data from this data set in a non-id fashion instead take the data to maximize this information relative to the predictions that we trying to make so what I discussed uh earlier and that is this solid black line and you see I mean it’s quite obvious it does quite a lot better but it seems odd why does it do better because we don’t have some magical information that tells us that certain data is more relevant more informative than other right because we started with a randomly initialized model initially the First Data that kind of this um non-id method selects should be roughly random because there are no kind of there’s not much there’s really almost no information encoded in this model so um interestingly while the first piece of data is basically random that we select now we train that model on that first piece of data so what you can think about this model gets better predictions but also its representations get slightly better right they’re not perfect after seeing one data point but they get slightly better so the next time we select data we see slightly non random data so data that is slightly more informative than random data and that repeats right we update the model on that and then we get slightly better representation still and I think that is kind of maybe the main reason why we see this big difference between these two curves and that’s really encouraging me to say okay maybe we can scale this up maybe we can end up with truly open-ended systems that can learn representations over time improve their representations to make better and smarter decisions to learn more at test time or U yeah to get end up with better predictions so we scaled this up we also did this with C4 100 which is just more complex images um larger data space we didn’t start from random but we started from some um kind of pre-train model that was pre-trained on imet but with a random head and and kind of saw the same effect there um yeah so I wanted kind of to highlight this I think this is really a key aspect to making really open-ended systems work is to not train them once and then maybe do this local adaptation on top but actually let kind of the learning mechanism itself learn over time cool so I want to wrap up and I’m really looking forward to interacting with you guys already so uh what I kind of talked about today is I was trying to contrast these local models that solve one problem at a time to inductive models and these are kind of most state-of-the-art models currently and these try to solve all problems at once at least all problems that you can specify in a certain domain but if the domain is human language you can specify a whole lot of problems in human language right if you try to solve all of these at once that is bloody hard um and it seems kind of interesting to me it’s kind of an interesting hypothesis that local learning allows allocating compute to places where it is interesting and sort of as a compute allocation mechanism I think it’s quite a nice thing to think about cool I think that’s everything um happy to to talk to you afterwards [Applause] thanks so yeah before food we have some times for questions so which is C 10 yeah 100 your like a split in terms of all the classes I think it’s 10% so 100 is all the classes we looked at 10 classes so we Tred to just learn it for 10 classes out of the 100 did you try different splits but yeah here the big IDE is that all the the data from the other classes basically irrelevant yeah um we didn’t try other splits actually um yeah but I think you would probably see a similar effect of course if you get closer to 100 you see much less of a boost um and if you have a smaller sub kind of if the information that you’re trying to extract from the memory is more sparse then you will going to do much better than random right because random is just going to take a lot of pieces of information that is actually irrelevant relative to what you try to predict have you tried using this in U models like diffusion models for image generat and what happens there very good thing so we actually we are working on on something like that so um yeah that that is actually something that I also find very intriguing is that kind of this General framework while um kind of historically speaking this has been around for a long time recently with kind of in context learning this has entered people’s minds so now when I talk to colleagues you know for the first time about this project they are all saying are you doing context learning right because um in kind of the current I think uh kind of mindscape of people kind of this local adaptation is very much linked to context learning because this is the kind of the way in which people have seen this local adaptation approach to work recently with um with large models that kind of this emergent Behavior Uh comes up where models can learn from what you put into their context um but it seems interesting that with models that are not autoaggressive so that don’t have a context window like a diffusion model you could probably do something similar um and certainly not through in context because they don’t have a context so yeah it’s a it’s a definitely a very good observation yes I was just asking very early on you had the comparison between different motor sizes and their perplexity what’s the chunk size that they used for the local learning there like what’s the split um for like the stride you mean or or yeah like how how local are is the local learning ah um so correct me if I misunderstood your question but you’re asking kind of how local kind of what is the size of this local neighborhood um around the test instance that we try to learn this data manifold essentially um that’s a very good question and it’s a very intricate question actually um we are essentially so kind of as a proof of concept um we are just taking a fixed amount of neighbors for every data point so we’re just taking 50 neighbors and then just doing a single gradient step on all of these um but actually in practice you wouldn’t want to do that right in practice you would want to train on a different amount depending on the question right if you get like a prompt to a chatbot that says hello if you don’t want to do any local learning if you get like a very complicated question but you don’t have any data in your memory that can help you with that you don’t want to train at all but if you actually can improve your your response then you would want to train and so in practice what you want to do is you would want to do this adaptively depending on the on your memory depending on your model and depending on the kind of problem that you’re faced with and this also adapts then to kind of I think what I read out a little bit of your question that if you have I don’t know you have some parts of your data manifold where your memory is very sparse and you have some parts of your data manifold where your memory is very dense right if you if you do this with a fixed amount of kind of compute for everything you will cover much larger neighborhoods on the data manifold where the data manifold is sp you will cover much smaller neighborhoods where the data uh where kind of your memory stands and what we actually looked what I didn’t talk about at all what we looked at was to do this adaptively so you could think about these uncertainty estimates as actually being useful to telling you okay here we actually have information in our memory here we don’t have information like this irreducible term can may even be useful to to tell okay here for this response we’re able to make a better prediction than for that other response and so what we were also um looking into is kind of can we in a sense reduce compute cost by just kind of terminating local learning early for those points where let’s say the data manifold is sparse or I mean your memory on the data manifold is sparse or uh different cases where you don’t learn so much yeah how how the scal is for much larger models much larger sites because the biggest you show was like in the single digit yeah um I mean it’s it’s really tough to do in an academic setting like to go to multi-digit billions I think actually right now the single digit billions are quite good state-of-the-art models and actually the kind of best open source state-of-the-art models are in the single digit billions so um I mean there’s uh like there’s these llama 3.2 models which just recently came out which are quite good um and so on I think um of course there’s always some kind of hypothesizing going into whether this also works with larger models I am actually quite stoked by the fact that it worked on the scale of gpd2 and it worked pretty similarly on the scale of 53 and 53 is not like a bad model 53 is really state-of-the-art uh um very close cl to to a very good model and I mean on this one plot I also put Gemma 227b um which we didn’t F tune right but we also EV valued kind of the zero shot U version of that and that was also doing much worse than the fine tune 53 and it seems like if you go to a larger model clearly your error is going down but it’s going down at a certain speed right and what kind of this local learning does it’s really opening up a new way of spending compute that is of orthogonal to the way of scan of spending compute um at train time so and of course it’s hard to say like we didn’t do the experiment yet whether it would also improve the performance of let’s say GPD 40 um but I think yeah I’m I’m relatively positive that to some degree it will so when you were explaining like about the representations making them more General you mentioned that like initially we start with selecting random points and then these points slightly become to be less random and uh I was thinking about balancing exploration exploitation through maybe the lens of mutual information like um and if that’s you think like maybe a semi accurate description of what you’re doing like did you think about parallels between your work and similar Works done in reinforcement learning or if you could draw ideas from there because it’s a bit similar to like philosophically at least yes I would say yes and no um I mean in some way kind of these kind of pure search methods have also been entertained let’s say in in in kind of the context of reinforcement learning where you don’t have a reward signal and you just try to let’s say learn the Dynamics or learn yeah let learn the Dynamics learn the environment um but usually then you don’t have this exploration exportation trade-off right because that arises from kind of the necessity to balance kind of exploring and then kind of uh exploiting and doubling down on on what seems promising um so I think it doesn’t really like mathematically it’s it’s it doesn’t really arise in the same way in the representations context I do think that on some level there is a rep a kind of exploration exploitation trade-off there so I I would agree with you on that and I don’t think actually what we are doing currently is is kind of accounting for that adequately and there’s definitely I think opportunity to to look into that more um separated this like information tier Mutual information into these two terms where one was capturing redundancy and the other was capturing relevance yeah I thought that could be like redund capture Randomness and IR relevance could capture exploitation somehow yeah okay yeah I think that is an interesting interpretation I don’t think like in the liter literature people talk about it in that way but maybe philosophically I think philosophically one can think about it that way as like relevance being somehow exploitative like or finding information that is exploiting and this diversity term or this non-redundancy term trying to make you explore um but yeah I’m I think fundamentally like mathematically deep down these things are still a bit different um but it is on its own interesting kind of dynamic that if you want informativeness in the data that you select you need to do both like both relevance and diversity and that’s definitely very interesting yeah have one question on like the open like ESS part because for in the like last experiments that that you do on like the cipher uh 100 so basically said that you only like train like a subset of the class as well or like I was wondering like like openend if you like also like try to like once the model like learned like the like classes that you were trying she like try to like expand like the to like other classes as well and those maybe like regarding like the LM part if you could maybe like expand like the memory as well like why you train mhm um yeah very very good questions uh for the first question um we didn’t try that I think it would be interesting um generally I think about this kind of local learning as happening in two stages like there’s an inner loop and then there’s an outer loop right or it’s like similar to the idea of fast and slow weights so you have kind of fast weights that are changing at test time and then you have slow weights that changing on the outo loop and you are kind of resetting the fast weights sort of always when you’re faced with a new task and in that way I would probably try to just reset the model and do this kind of local learning before you then focus on different classes instead of kind of leaving off where you of focused in on these new tasks it’s like I don’t know philosophically I like to think about it you know you are of working hard on solving this one problem you’re hyper focusing your brain on that and then you’re kind of taking a break you’re trying to delete all that and kind of have new space to put new stuff in of course there’s some mixture of those things going on but that’s why I like to think about it in these kind of two levels like an inner loop and an outer loop um and then what was the second question again like like simly what you do is that you can like somehow uhuh increase the memory yeah exactly um we also we didn’t try that what we did try we tried to do it with um smaller memory of course we couldn’t increase in our memory because our memory was bounded by kind of the the data set of the pile and uh we didn’t do this dynamically so we didn’t like dynamically for every question adjusted but I don’t think that really makes like really an impact because essentially you can think about every test instance that we’re testing here as IID coming from this pile distribution and what we see here is that I didn’t plot the kind of absolute performance improvement over the base model what I plotted here is the relative performance improvement over nearest neighbor selection so kind of 100% is doing exactly the same as nearest neighbor selection and anything that goes below is you’re doing better than nearest neighbor selection and what seems interesting that if you have like as you get larger larger memory yeah this informativeness is more important um like I didn’t plot it here but you see the same thing also in absolute numbers um that you of course if you have a much larger memory you will be able to uh essentially kind of uh do better and I think what is really interesting here is that kind of this sift thing or also these representations um kind of allowing us today to search this really large memory very efficiently and essentially you can parallelize this to no end so I think you can people have done this at internet scale right this is really you can scale this memory up to include kind of almost every artifact that humans have ever um kind of produced and that would still run fairly quickly yeah thank thank you