what if a computer could look at an image and understand exactly what it’s seeing that’s the promise of using various Vision models in a tool like AMA but there have been a few issues in getting that to work at least for the latest model offering Vision all that is over as of a couple of days ago when olama finally released version 0.4.0 it was a pre-release for a number of weeks but it’s finally here and and it works pretty great have you ever watched any of Matt burman’s video I know it’s sometimes hard to keep track of all the Matts doing AI videos you got Matt wolf you got Matt Burman and you got Matt Williams and we’re all so dashingly attractive but I’m the Matt who was part of the original olama team and now I’m focused on building out my YouTube channel to help you learn everything you need to know about local AI Solutions now about Matt Burman you might remember a video that Matt Burman released a few weeks back I think that was done a little bit prematurely but then again llama 3.2 was just released and Matt does a great job of posting things that are relevant to the time it was on the first platform that started to really support llama 3.2 vision and the video is pretty negative based on the answers he was getting though it seemed to be more of a problem of the platform rather than the model unfortunately since there weren’t any alternatives available there wasn’t really any rebuttle that could be posted and folks generally thought that llama 3.2 was a terrible Vision model it didn’t help that llama CPP the the runner that olama and so many other tools use just didn’t support llama 3.2 Vision in fact it looks like they aren’t all that interested in supporting it in the future either that’s completely opposite to how I’m thinking about this Channel I want to support everything and everybody the best way you can help me to do this is by liking the video and subscribing to the channel if you’ve already done that then share it with someone else I am so excited about every new subscriber I get thanks so much for being here with the release of a llama version 0.4.0 I think everything you knew about Vision models changes so in this video we’re going to take another look at what the model can do now it’s not designed to do crazy things like read a QR code when there is much simpler code that can process those faster than a model would even be able to load on the fastest of machines plus the QR code Matt used wasn’t even a valid QR code my phone didn’t recognize it that said there is an amazing veritasium video on reading QR codes where he builds one on a board for playing Go not go L the language but go the old game the centuries old game you should definitely watch that one after you get to the end of this one so let’s take a look at what a vision model can do and I’m going to be using llama 3.2 at 90 billion parameters for at least part of this video I’m running this on my M1 Max MacBook Pro with 64 gigs of RAM for the video I wanted to show a more visual UI but when I tried this model in page assist and Misty and open web UI all of them just sat there and never responded I don’t know if there’s anything they need to do to update but the models work fine at the CLI which just uses the standard API so we’ll go there now before I started recording I showed the model the picture of Bill Gates and asked who it was and the model told me I can’t be sure but it thought it was Zuckerberg I told it it was wrong and asked it to try again and it told me Bill Gates I was kind of shocked that even tried to guess and and so I started writing the script and then remembered I shouldn’t actually write anything down till I have the recording of the demo so I started recording and I couldn’t get the model to ever say that again again instead after 3 minutes I got this though most of the time it was faster to give me the bad news there are two ways to look at this first you could say that the model is censored that may or may not be true but because what I think is the more accurate answer is that the model just doesn’t know I think the only thing these two people have in common in terms of looks is that they’re white men I wouldn’t be surprised if I showed a selfie it would say that I was theu too so maybe it just doesn’t know and we’re all interpreting the answer wrong but this picture isn’t a total loss we can still ask it to describe the image and it took 5 minutes to come up with this at first it seems okay but then um isn’t wearing a suit it’s like the model forgot what was in the image and is trying to describe it anyway I’m wondering if this model is just too big for my machine and can’t handle the image at the same time so I swapped it out to the 11b model and in 30 seconds I got this and the description is shockingly good not only is it describing the image well but it identified the person as Bill Gates so maybe the model isn’t censored like we thought and rather answers in a weird way when it just doesn’t know I I asked again just to make sure so next I gave it a capture and I and it identified it correctly so what are the letters in it it got close I could see how I could think that little squiggle there was a j next I gave it the sketch of a website Matt Burman used and it describes how to build it and at the end gives what looks like some convincing HTML I’d say that’s a win so now let’s look at the meme image that Matt used I don’t think this is much of a test because it’s pretty obvious what it means and yeah it got it here’s another one for mat a screenshot of his phone when did I last use mail April 29th and that is correct how much space does it take 689 mechs perfect and what apps take more and it gets that right too so then he showed a where’s Waldo image now one of the problems with finding something small in a big image like this that isn’t actually all that big is that it’s hard to see any details since it’s not a high resolution image it suggested Waldo is wearing a different shirt though that’s not Waldo and that he’s near a suitcase a suitcase I can’t find I clarified the shirt color detail and it seemed to double down about being near the suitcase I wonder if this is the suitcase and the green shirt it was talking about for reference here’s where Waldo actually is in this image I showed a picture of my pantry and asked it if I had any popcorn it didn’t find any but it’s kind of hard to see that this is popcorn so I asked about the appliances shown and they got some of them right how about the bourbon on the Shelf yep it got that right sadly the Yama is empty and the French C stuff is behind the bag of oatmeal next I wrote a couple of lines of text in a notebook and took a photo I asked what I wrote the answer is kind of right so I wanted the actual text and that is well perfect often folks test a vision model by having it recognize a screenshot of text regular text with a normal font but traditional OCR can always do it faster and more accurately handwriting on the other hand is is hard icr tries to do that and the good stuff costs money this is pretty great it’s dog slow compared to what a good icr solution could do even on lowend Hardware but this stuff is getting better and better and in a few years might be on par with more traditional Solutions so I asked it to do the same thing on some text with some intentional errors again it gives a weird summary but when ask for the actual text does a pretty good job the mistakes definitely confused it here’s another pick of the same text from from a weird angle and it still does a really good job now here’s one more image I took of a moth on the back door of my house the other night it’s kind of scary looking if you don’t know the scale but the model cleared things up and I’m mostly good now so that’s a look at what llama 3.2 can do I was mostly using the 11 billion parameter model because I think my MacBook with 64 gigs of unified memory just can’t hand that bigger model I think the big surprise is that the model seems to not be as censored as folks thought it’s definitely a big improvement over previous models and I can’t wait to see the next few models as they come out I am super excited that ama was the first tool to come out that really supports this too the team did a lot of work to make it run and that work should allow for a lot of other improvements in the future too what do you think of this new vision model what are some use cases that you have in mind for it I would love to learn more about them in the comments down below thanks so much for watching goodbye