Gemma 3 12B on NVIDIA GeForce RTX 4070

jch2103 · March 23, 2025, 10:07:04 PM

I've run into some bumps using my RTX 4070 (12GB RAM) with IMatch/Gemma 3 12B. It seems that Gemma 3 12B requires every bit of the available memory on the GPU. There are times when I try to run AutoTagger with 12B after the computer's been used for a while and I get an error message (full log attached):

Code Select

03.23 14:36:33+  282 [26F0C] 02  I> AIConnectorOpenAI: 1 HTTP Status Code: 404 'No models loaded. Please load a model in the developer page or use the `lms load` command.'
03.23 14:36:33+    0 [2684C] 01  W> AutoTagger: Aborting because of error 
 'No models loaded. Please load a model in the developer page or use the `lms load` command.'  'V:\develop\IMatch5\src\IMEngine\IMEngineAIAutoTagger.cpp(369)'
03.23 14:36:33+  15 [29594] 01  W> UpdateQueue (AutoTagger): Service error 0 'No models loaded. Please load a model in the developer page or use the `lms load` command.' for file [143645]  'V:\develop\IMatch5\src\IMEngine\IMEngineUpdateQueueAutoTagger.cpp(233)'

If I switch to the LM Studio/Gemma 3 4B or OpenAI model, AutoTagger runs OK. I suspect the issue for 12B is lack of available GPU memory.

If I reboot the computer and immediately run IMatch, I can run AutoTagger with Gemma 3 12B w/o problems. If my issue is indeed lack of GPU memory, is there a simpler way to clear its memory than a reboot?

Gemma 3 12B does seem to return more useful/accurate information than 4B, although OpenAI has expected advantages for things like more obscure landmarks.

bekesizl · March 23, 2025, 10:40:35 PM

I have a much less powerful GPU in my laptop, but the 12B was running fine on it after raising the timeout.
I am using LM Studio for this.
I haven't really tried much, raised it straight to 120s.
Fine means, it eats up all the resources available, but is running at least without problems.
I have an RTX3050 with 6GB RAM and an AMD Ryzen 5 7640HS with 16GB RAM.

Mario · March 23, 2025, 10:57:46 PM

12GM is quite low for Gemma 3 12B. Windows needs about 1 GB or so for itself, requiring Ollama or LM Studio to swap the model between GPU RAM and normal RAM. LM Studio seems to be better with that. Windows GPU VRAM usage increases over time, too.

Make sure to configure LM Studio as explained in the help. AutoStart, Developer Mode, Service on. It will then load Gemma 3 as soon as IMach requests it. When there is not enough GPU RAM it will swap with regular RAM.

You can see the amount of free GPU memory in the Performance tab of Task Manager after clicking on your GPU. The amount may reduce, depending on how much Windows Desktop Manager needs. Web Browsers alo use s lot of GPU memory.

QuoteGemma 3 12B does seem to return more useful/accurate information than 4B, although OpenAI has expected advantages for things like more obscure landmarks.

That's normal. Larger models know more, and the Mistral/OpenAI models have probably about 500B parameters. And require a TB of VRAM.

jch2103 · March 24, 2025, 12:32:35 AM

Thanks. It appears that I hadn't loaded the 12B model into LM Studio, although I had downloaded it onto my computer and selected it in IMatch Preferences. It seems to be running OK now although perhaps slower than ideal, despite the 12GB memory.

Mario · March 24, 2025, 09:07:09 AM

When you have downloaded the model, it should show in the list of models:

and when you open the properties, it should look like this if there is enough VRAM:

The CPU Offload tells you if LM Studio can fit the model in memory. If the value is less than 48, LM Studio needs normal RAW and has to swap, which reduces performance considerably.
The 4B model will of course fit nicely into 12GB VRAM.

jch2103 · March 24, 2025, 05:14:54 PM

How do I view the Properties?

Mario · March 24, 2025, 05:59:33 PM

They are shown when you load a model in the drop-down at the top, or when you click on the gear icon in the "My Models" view.

jch2103 · March 24, 2025, 06:53:33 PM

Thanks. Should have realized that but was misled by your dark theme screenshot. GPU offload for my card defaults to 45. I see I can use the slider to change the number (including to 48), but I don't know how that affects performance and other things.

Mario · March 24, 2025, 07:57:43 PM

The offload (45) being less than the maximum (48) means that LM thinks the model does not fit in VRAM and it uses regular RAM instead - which costs performance. You can increase the value and keep an eye on GPU memory in Task Manager, Performance tab. Close VRAM-heavy software like browsers to make more room.

On my system, all LM Studio processes take about +12GB VRAM (of 16), so this might be a tight squeeze on your system with 12GB in total.