Ollama LLaVA 7b performance

Started by monochrome, February 05, 2025, 11:30:51 AM

Previous topic - Next topic

monochrome

Downloaded and installed and got up at 5am to play with the new auto tagger.

Anyone have performance data for Autotagger with Ollama + LLaVA 7b?

I get about 30 seconds per image because I have to run it in CPU mode, and I'm considering upgrading my graphics card to something that Ollama can use. However, for daily use 30 seconds per image is OK - it's just for massive batch autotagging that I'd like to use GPU acceleration, but then I would like to see the time drop to something like 0.1s/image before I drop €500 on a new GPU.

Mario

#1
That's to be expected. If Ollama (or any of the other AI runners) have to fall back to CPU and normal RAM, things will be a lot slower than using a GPU.

On my 4060TI (16 GB VRAM) Ollama (about 400$-ish) it takes between 1 and 5 seconds per image, depending on how complex the prompt is. Asking only for keywords takes maybe 1 or two seconds per image.

The 4060 mobile version I have in my notebook is maybe 25% slower, which is quite impressive and perfectly usable.
It "only" has 8 GM VRAM, which is too little for running LLaVA 13B or the LLama Vision model in VRAM. It will run, but parts of the model will be swapped to RAM and CPU, and this slows things down. the 7B runs fine, though.

But, you must set that in relation to the response times you get from OpenAI, Mistral and other cloud-based AIs.
Depending on how busy these are and your prompt, response times from these servers also vary (in Germany) between 2 and 5 seconds. Sometimes even longer if they are swamped.
They may have 100,000 H100 GPUs in their data center, but they also have millions of users.

Or, how long it would take you to enter keywords and descriptions for a file. Maybe you're faster than 5 seconds, I don't know. If you do some pre-sorting and run AutoTagger once for a bunch of images, you can get a lot done in a short time.


Quoteto something like 0.1s/image before
Ah, that would be nice :)
But, today, you won't get that response time even when running two NVIDIA 4090 in parallel, each costing you roughly 2,000US$.

A 30,000US$ H200 plus periphery would probably give you that response time. :o ;D

monochrome

QuoteA 30,000US$ H200 plus periphery would probably give you that response time 
You know what? I think AWS might rent me one for an hour for less than that. Since IMatch just needs an endpoint I could use CPU (slow but cheap) for daily use, and when I feel like tagging a lot I can rent something big. I have no idea how this works out cost-wise in the end, but I'm going to try it.

Mario

You could rent a H200 GPU and VRAM and run Ollama on it at AWS, even with large models that have not been quantized (to make same smaller). If you know how all this works, understand the billing and all that. I surely don't.

Or, you use the OpenAI or Mistral AI and rent some time on hardware in their data centers and get access to their their massively large models.

It's really affordable and takes a few minutes to setup. You pay only for what you use, and you can disable auto-renew and thus have full cost control. If the money you have uploaded is used up, they will stop processing requests. I did a lot of testing over the past months, processing thousands of images. And I still have 4,20€ from my initial 10€ payment.

Note: OpenAI has an initial hard rate limit (3 requests per minute) (see my comments in the AI services help topic).
They need to make sure you don't abuse their services.

In my case, this was lifted after two days and since then I can run 4,6,8 requests in parallel. AutoTagger does this automatically, unless you are on the entry-level tier. This means AutoTagger gets down to maybe one second per image or so.

monochrome

> OpenAI 

It seems like they quote prices in "tokens". Do you know how many tokens are used up per image, on average, for the gpt-4o-mini model? I've seen anything from 33 to 33000 tokens being thrown about.

Mario

See my Some Notes about Pricing in the IMatch help, with the example given in Pay per Token.
"A good average is about 75 tokens per 100 characters in English text."

And here is the link to OpenAI's token calculator: https://platform.openai.com/tokenizer

monochrome

#6
FYI: I ran Ollama on Google Compute Engine, standard N1 with an NVIDIA T4 and the "Deep Learning VM with CUDA 11.8 M126" image. Performance was average 2.6 seconds/image for keywords, description and place names. At $0.48/h that works out to $0.344 / 1000 images.

OpenAI with gpt-4o-mini uses about 3300 input tokens per image and 160 output tokens, which at $0.15/1M input tokens  and $0.6/1M output tokens works out to about $0.603 / 1000 images.

So it's about 42% cheaper running it on your own. Probably not worth it for the inconvenience, but on my 200k images it's about $50 ($70 instead of $120).


Mario

3,330 input tokens seems to be a verrrry long prompt?
200,000 images for 120$ seems affordable to me? I doubt many people in the IMatch user base can setup compute engines at Google or AWS. But whatever works for you is good!

monochrome

Quote from: Mario on February 05, 2025, 10:54:45 PM3,330 input tokens
I think OpenAI have changed how they compute "tokens". The number is taken straight from the "Activity" tab in the dashboard. I'm using default IMatch prompts and settings (just cranked up the rate limits to what I have as Tier 1).

David_H

Quote from: monochrome on February 05, 2025, 10:22:43 PMFYI: I ran Ollama on Google Compute Engine, standard N1 with an NVIDIA T4 and the "Deep Learning VM with CUDA 11.8 M126" image. Performance was average 2.6 seconds/image for keywords, description and place names. At $0.48/h that works out to $0.344 / 1000 images.

OpenAI with gpt-4o-mini uses about 3300 input tokens per image and 160 output tokens, which at $0.15/1M input tokens  and $0.6/1M output tokens works out to about $0.603 / 1000 images.

So it's about 42% cheaper running it on your own. Probably not worth it for the inconvenience, but on my 200k images it's about $50 ($70 instead of $120).



Or put the money to a better graphics card, if you can?

(Ollama 7b locally for me, 399 images in 10 minutes or ~1.5s/image) and I've still got the graphics card at the end...

Mario

Quote from: monochrome on February 05, 2025, 11:12:07 PM
Quote from: Mario on February 05, 2025, 10:54:45 PM3,330 input tokens
I think OpenAI have changed how they compute "tokens". The number is taken straight from the "Activity" tab in the dashboard. I'm using default IMatch prompts and settings (just cranked up the rate limits to what I have as Tier 1).
What makes you think that?
As I wrote in another thread (to you?) they deliver the number of input tokens and output tokens calculated on each response and AutoTagger accumulates that and displays the token counts. If the information returned by OpenAI does not match how they actually count tokens, that's on them?!

monochrome


Mario

#12
On my 400$ NVIDIA 4060 TI*, Ollama takes less than one second per image, for keywords only with this prompt: [[-c-]] Return five keywords describing this image.

The first image takes several seconds, when Ollama must first load the model (the model is unloaded automatically after 10 minutes of no use to conserve GPU VRAM and energy).

On my Dell notebook with the built-in NVIDIA 4060 mobile GPU, it takes about 1.2 seconds per image.
I've spent 100$ extra for having this GPU on-board, which was a good decision. I can work comfortably with local AI on the notebook.

* I picked a TI with 16 GB VRAM, which allows me to load larger models. It has only a 128 bit memory bus, though.
From the 4060TI up, things get a lot more expensive if you want more than 8 or 12 GB VRAM. 16 GB VRAM and a 256 bit memory bus starts at about 850$ or so - that's gamer land.

4090 cards with 24 GB VRAM and super-fast processing cost 2,000$ and I don't need AI that much ;D