- [
“Thanks so much Matt. Just to note to I didn’t have any luck by dragging the photo into terminal if not using the local machine. I too have a M1 Mac 64GB ram and it worked perfectly. However when trying on my LLM server (2 x 3090) it didn’t work - both running same Ollama version (except linux for llm server) and model.”, “It work with my web ui”, “The Llava LLM did that a long time ago. You should refer and make comparison in the future.”, “ - But llama3.2 is just so much better”, “Beautiful”, “This is an awesome tutorial matt! you are definitely instilling solid foundations into these videos and I love it! keep up the amazing work! Have a great week!”, “The 11b model runs fine inside of open webui on a 12 gb rtx3060. I will be exploring ancient symbols to modern art in an open discussion format, thanks Matt.”, “That means the Ollama nodes in comfyUI will be fantastic to use with this vision model. Downloading it now to play with it.”, “i tested it successfully converting a gnuradio dataflow diagram into a list of functional blocks and their interconnection. Maybe this can help migrating away from LabView.”, “LOL should i change my name to MattNLP?? LOL”, “Personally, I donu2019t think it got the meme photo right at all. It looked as though it assumed/hullicinated people were doing an aggressive approach to flag planting. Versus the corporate one that showed one worker digging one deep hole, which would be appropriate for flag planting. I think itu2019s obvious as a human is that itu2019s a joke saying that start ups generate more output due to more contribution versus corporations that have a lot of administrative overhead that does not add to the output.”, “ - I think that comment means that you haven’t worked at many startups.”, “Matt Williams > Berman > Wolfe nThose other guys…. I swear they have no idea what they’re talking about. ud83dude05”, “I am going to try it here soon but ive been trying to make a tesseract ocr script but been failing due to borders that are too close to letters and slight variations of where the text falls. I can give it to chat gpt and it can read it perfect but mine is to inconsistent to actually use! What would you suggest or how would you personally build a ocr tesseract office program that can read and detect invoices and there contents and company names and regular structure for invoices from different but consistently the same group of different company invoices”, “ - I havenu2019t worked with tesseract or any other open source ocr. But have heard lots of folks have issues with”, “ - @technovangelistu00a0yea it’s a bugger to make reliable, printer paper quality and ink and any wrinkles or slight misplacement throws boundaries off and can catch half of stuff causing weird symbols any suggestions?”, “I just downloaded and tried this model by myself (running Open WebUI in dockge with Ollama in a separate LXC container on Proxmox with a 20GB Nvidia RTX 4000 Ada passed through). I was flashed by the accuracy of the pictures being recognized! Even the numbers shown on my electricity meter’s display were identified correct. I am really impressed - especially about the correct guess of the age of my son on some pictures were the model suggested him being 10-12 years old … with him being indeed 11 years. WOW!”, “I love to feed an image to an LLM and ask it to creat a text to image prompt from what it sees.”, “Thank you, Matt. You’re definitely the most dashing and handsome Matt. To give you an n of 1 for researching your audience, here’s what I’m looking for when I view AI channels on YouTube: 1. Information that I can employ immediately or within a very short time frame, preferably no cash outlay. 2. Something I can employ locally on extremely low budget hardware. I scraped enough money together for a used i7 desktop with 32 gigs of RAM, 2 gig NVIDIA card (not useful). With this, the non vision 3.2 Llama on Ollama works in Docker on Ubuntu. An M1 Mac is a pipe dream for me, so information on hardware like that or better is not applicable to me in particular. 3. The very latest information, and that’s why I look for Matt Berman’s new releases each day. 4. Information that is accessible to a non-programmer, meaning information not in programmer jargon-y language. 5. Information that’s going to improve my life or help boost my income (but no "Make $5,000 per day with AI" bull$#|^"). 5. Something new and interesting, something I haven’t heard of or thought of. (End of list) To be honest, I don’t really like to sit through tests of the models, so I skip to recap at the end of test videos. I would much rather have brief bullet points of what a model can do compared to others and then get right to using it. Also, please consider a summary video short for each video you do. I’d love those and would watch both. What really draws me is something novel that a model can do. I deeply appreciate that you’re here to educate us. Also, thanks for your role in creating Ollama. That’s power to the people, especially cash-strapped people like me.”, “ - good luck on finding that ideal channel.”, “Enjoyed this sub’d a while back and will put bell on, so long and until I hear you call Llama Open Source”, “Why when I try to make a website in Ollama I can’t make it to make me an entire php page but Chatgpt and Claude Can ?”, “We want to use this Model to control Robot arm can you guide us?”, “Matt Williams is the best of the Matts! ud83dude06”, “I want to check one vehicle from one image and give if that same vehicle showed up in another video stream. And when was it seen last. n(same tech for this case) I want to check where was my TV remote seen last.”, “I want to check which if shoplifting is happening or not. This needs fast comparison of many images per second from the video. Will this work? If there a easier way to infer shoplifting?”, “I have a real estate use case that needs to identify problems with a house from a distance.”, “Biggest issue using the 11B is getting info of where words or whatever other shapes are found on the image file. In particular x,y coordinate or even better and width,height of the word block. So far ocr/opencv solution gives all this info in a second. To describe the image as a whole seems to be where the vision model works best with or without words found.”, “Thanks Matt for another great video! I would like to be able to use llama3.2 vision with openinterpreter in stead of chatgpt to control my computer”, “I would love to try ollama and Llama3.2:11B-Vision-Instruct and 90B too but the Ollama code wonu2019t compile properly on my Nvidia Orin AGX 64Gb it wonu2019t detect the CUDA device during the compile and installation process. If anyone has a fix I would love to know.”, “u00a1Gracias!”, “ - That is so nice. Thank you so much!!”, “I think your shirt dropped acid.”, “ - Itu2019s a bright one. The other ones I would wear outside in the real world, but this one is … special”, “Love the shirt Matt! It’s fun ud83dude0a”, “ - Its boldness almost reminds me of the shirt Theo asked his sister to make for him (played by Lisa Bonet) on the Cosby Show decades ago. Though that shirt ended up being a disaster.”, “ - Huh?”, “My dude, the sweeping hand gestures are getting out of control.”, “ - If you don’t like how I naturally talk, it was nice of you to stop by, and I will miss you, but I won’t change who I am for YouTube.”, “ - @ jesus dude ud83dude02”, “ - I often mis read comments….”, “he doesnt even answer the comments”, “ - One of the other matts? I answer almost all of the questions that are questions. Which Matt doesn’t?”, “ud83dude00ud83dudc4dTHANK YOU MATT.”, “I tried using it to tag some photos to make them easy to search. It didn’t seem to do very well at that task.”, “I have a question. I have a M1 MacBook with 16Gb of ram… is even the smallest model too heavy for the machine or am I doing something wrong?”, “ - I have similar setup, I can run. 7-8b models but at quite slow speed, so I think 11b vision would be too much. I suggest try groq playground if it’s only for testing.”, “ - Its not hard to try, but I would imagine it will be difficult”, “Tried the llama3.2-vision:90b with OpenwebUI on my MacStudio M2 Ultra 128GB - unfortunately it doesn’t work (Error fetching website content: 404 Client Error: Not Found for url: https://www.example.com/image.jpg" - plus even if it had worked: 1) ollama logs are saying "multimodal models don’t support parallel requests yet" —> hence it would process only 1 img at a time plus 2) the moment I tried from the CLI …it processed the img but replied in a totally censored manner. 1 + 2 = K.O. for my current use case (automated reporting on VM evidence sshots). It’s a pity though ‘cause the very same use case works flawlessly on my OpenAI CustomGPT. Any other opensource model I could possibly use (vision + OCR + websearch)”, “ - You have to use a valid url. And with OpenAI you have to be ok with the security and privacy issues”, “ - @technoevangelist yeahu2026.that is exactly the whole point: Iu2019d like to port my CustomGPTs away from OpenAI, possibly on my local lab. Iu2019ll do a few more tests using different frontendsu2026thx for the yet-another-excellent-video btwu2026.”, “with my mac 16gb ram is impossible to use this model ud83dude1e”, “I would like a vision model to be able to assist in grading handwritten exam questions. (This is a thing in my country, not sure if it is continued in the west). A good assistance function would be the capability to "average" handwritten answers to a particular knowledge or essay question. So not just OCR’ing answers, but comparing all the answers in some form against the perfect answer perhaps, and seeing which sub-element of the answer was right. A question related to social sciences such as literature would be particularly difficult, but useful.”, “ - @samibilal It is still a thing in the U.S. as well. This is a fantastic idea. I bet a fine tuned model for grading could do it.”, “What I try to do? Very simple: describing/tagging/grouping the own pictures without upload them to a cloud like google. Basically a Raspberry Pi would be enough, as it doesn’t matter how long it takes until it went through all pictures…”, “ - Nice”, “Good video. Thanks Matt. By the way, you are my favorite Matt, but donu2019t let the others know. Youu2019re the one with the best videos and the best taste for shirts. nNow seriously, your content is truly technical and educational. Thank you.”, “@0:45 you forgot Matt from Mattvidpron(I am also called Mat, but I don’t regularly produce AI content)”, “Top Notch production as always!ud83cudf89”, “Works great for me”, “Sometimes it repats the same ouput again and again(11b). I set penalty but didnu2018t work. Do you have any other suggestion?”, “ - Did you update the modelfile? use modelweights from HF? Those are behaviors I expect from a model with a bad prompt. How much vram do you have?”, “Hi Matt, Iu2019d like to use a local RAG to find pictures of my kids and other people in my personal photo library. I previously spent a lot of time identifying faces and labeling them under Digikam, so that the bounding boxes and face names are stored in the Exif metadata. Unfortunately Ollama still completely ignores the image metadata and generates an anonymous description of my photosu2026 I would then need to use an external library to load these data but I donu2019t know how Iu2019d tell who is where on the picture.”, “@Matt, loved your explanation and conducted tests… do you think that ollama in feature would try MLX? If so, do you think it would increase performance?”, “ - I don’t know. LMStudio added it and it is now marginally faster on limited models. May be a lot of work for not a lot of benefit. But it’s hard to know what the team is going to do. I have been gone longer than I was part of the team after we pivoted to building Ollama.”, “great shirts too, not sure llama3.2 can handle those lol.”, “Is there reason to expect that a Mac with more than 64GB of RAM would be able to perform better with the larger parameter models? I’m thinking of upgrading next year.”, “ - faster? No, inference speed is mainly determined by GPU speed.nhigher quality? possibly. More RAM allows you to choose larger models or less quantized version.”, “ - I think so. The m4 is the first one that has perf significantly better than the m1 so would love to try it.”, “ - I donu2019t think I agree with the higher quality statement. If that were true then llama3.1 70b would always be better than 7 or 8b. And that is often not the case. But since there is a big gain in performance on the m4 it should be better. The m1 maxes out at 64 so more implies getting a different system.”, “ - @@simonosterloh1800 GPU speed, OK. How about number of GPU cores? I have an M1 Studio Ultra with 64GB. Presumably the M4 Ultra will have more GPU cores”, “ - Tests I have seen have shown that the m4 max is nearly twice as fast as an M1 Max with ai models in ollama.”, “Thanks for the update. I have a question on the best way to run local AI models (with low resource settings) Is the new Mac Mini base model good for these? Mx Pro/Max are expensive machines. Will a PC with xx90s do? Pointers will be helpful with some sort of budgeting. much confusions with Apple MLX/ Ollama etc”, “ - So a PC with a good Nvidia card will go faster but will be more expensive than the comparable Mac AND use a lot more power. A mac will be a good machine for you for 10-15 years. The shortest life span for any Mac I have owned was 8 years. I would recommend getting at least 32GB RAM on the Mac and at least 1TB disk.”, “I tried llama 3.2 90b vision on groq (API provider) and it 99% of the time refused to describe image for "security purposes" or such and the rest 1% is getting it completely wrong.”, “ - I donu2019t see why folks are excited with groq. Itu2019s fast but fails to work on so many things.”, “ - @technovangelistu00a0 I don’t understand what you mean, it’s just a fast api for llama models and few others. Though I did see some difference from other hosts in following instructions.”, “ - What I mean is that the fail rate for groq is pretty high. I hit limits on it all the time. I tend to not waste my time with it.”, “ - @technovangelistu00a0 The base models have big limitis, but the instruct or other modified models have higher limits. Though groq made those limits smaller, from 20k to 8k for 70b veristaile. But honestly I barely anytime hit even such limit but I’m slowly switching to samba nova (mostly for 405b and speed)”, “ - @@technovangelist same for me, i rather pay some for an actual working api, there are good and very well working ones out there. My Environment cannot handle like 70,90b sou2026nnThanks for this Video, always like your style and demos, subbed! Cheers from AT ud83dude4bu200du2642ufe0f”, “Do you drink at the end of the videos to draw attention to the importance of hydration? You are absolutely right, I support your mission :) Cheers!”, “ - I did it in a few videos at the beginning and just kept doing it. Some folks really like it and comment when I leave it out.”, “you just need 128G of memory instead :) you’re welcome. Joking aside, m4max fully loaded might get better results. once people get their hands on them, I’m sure we will see people try”, “I look forward to trying this. I’ve been getting great results with LLava already.”, “is there a local model that you consider "good" for vision?”, “ - Good? This one. It seemed to work well for most things”, “ - @@technovangelist thanks for answering”, “i’d love to see if i can give it an image of a floor map along with icons of access points spread across it, give it a scale of pixel per meters and see if it can say what’s the density of those ap’s per map (not sure if anything can do that)”, “I wanted to see what you think about this technically. I was doing a proof of concept around a rag system and research documents that had a lot of charts images and tables. I decided to covert the pdf pages to images and then ask the Llm to pull all the text out of it and to summarize the details about any chart or graph adding that to the output.nThis seem to do better than the langchain api we had been using to extract the data. Then the second part is I asked the LLM to return JSON representing the chunking of the data breaking it up not by word count but by meaning to see if that would do better than chunking by size with overlap. nAnyways it went well on this small POC just curious about your thoughts on this type of process. I know pricing is higher for vision but this is just a couple of hundred documents without a ton of changes over time.”, “ - I would save as images and use a good traditional ocr tool for most of it and use the model to try to interpret the charts. But there are often many ways to interpret a chart so that may be a challenge.”, “Thank you for the video. I have the same MacBook and I will try out that 90b parameter model. Lie you said, it might just be too large to run on it considering you have to run the OS too”, “Sir,nYour video was awesome and very informative. I need a suggestion from you, sir.nI have tested Ollama’s new release with the LLaMA 3.2 Vision 11B model, but itu2019s not working on my GPU. I tested some other models between 11B and 16B on the same device, and every model except LLaMA 3.2 Vision utilizes the GPU. However, the LLaMA Vision model is only running on the CPU.nCould you suggest any way to run this model on the GPU, like the others?nnI’m using an NVIDIA 3050 with 4GB VRAM, updated drivers, and the latest version of Kali Linux OS.”, “ - Thatu2019s an easy one to solve. You need a better gpu with more memory.”, “Very nice video, i was waiting for Llama Vision on Ollama since the release of the models, i would love to see support for Pixstral as well”, “So my use case is professional, we have a bunch of procedures that we try to use with rag to answer questionsnProblem is there are many screenshots so the test is finding the right prompt to ask vision model to create a text description of what is in the image”, “thanks Matt for your clear explanations. it’s hard to separate the wheat from the chaff on YouTube, but with your help, that’s what I manage to do.”, “Thanks Matt for the great video! I also have tried the vision model on ollama and got results close to yours. Can you clarify the context size? You told previously all ollama models are capped at 2k tokens except embedd models capped at 8k. What about this vision model? I saw in the model file its context size is 128k. Is it capped at 2k or it uses 128k? Thanks in advance for replying!”, “ - Doh! I just made that video and forgot…so yeah, it will be 2k until I change it. Its not capped, because its easy to change.”, “I tested it yesterday and unfortunately the model doesn’t support tools ud83dude1e”, “I tried it this morning and was quite surprised (the smaller version). It was a cartoon caricature of a blind man on a galloping horse ( a saying my father-in-law uses). the description it gave was surprisingly good. It didn’t get some of the details and misread the blind man’s facial expression ie it saw fear as having a great time.nnYou videos are extremely helpful. I appreciate the effort.”, “You need to get molmo running in llama, it is severely underrated and probably one of the best models for automation tasks.”, “ - @@IvarDaigon agreed, that pointing feature op”, “Cool, But unfortunately the model doesn’t load for me. I got this error:nnError: llama runner process has terminated: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failednnmy ollama version:nnu2022uf444 udb80udd59 pacman -Q | grep ollamanollama 0.4.0-1.1”, “Is there currently any vission-capable ollama model suitable for ordinary domestic PCs?”, “ - I just showed one. As long as it has a decent GPU you are set. Or an Apple Silicon Mac is perfect too. I would say both are very ordinary machines these days”, “Sorry for being so off-topic but where did you get that shirt? It’s freakin epic”, “ - Link in the description. Amazon”, “As far as I can see the model works really good in Open webUI, It gives me quite accurate answers and I would like to thank the Ollama team for setting this up. Really cool!”, “Iu2018m confused, Matt, on how to use it correctly. The goal is to ask iteratively Questions about the image. What are the best practices? The documentation implies to submit the path to the image. Other models need a base64 string. I cannot find working examples. (Python library)”, “I try to remember this stuff doesnu2019t cost us a cent in licensing and hosting and keeps on getting better”, “Awesome, I wonder how this model would go with "computer use" locally, faster and much cheaper than claude.”, “ - As much as I saw of Claude computer use, it specifies exact pixels location and I don’t think it’s so easy for model to find exact pixels to click some button”, “Love your videos and course, which I am currently following. Keep it up. u2764ufe0f from London.”, “ - And as you can tell from my accent, I was born in London….Kingsbury to be exact. And no, most can’t tell. But it was nice for the 10 years I lived in Amsterdam to have an EU passport (pre Brexit)”, “ - @ Well if youu2019re ever in London then let me buy you a warm beer.”, “ - There will definitely be some visits. I want my daughter to meet more of her extended family”
]