Are You Using Gemma 3 with AutoTagger? If yes, what do you think?

Started by Mario, April 05, 2025, 11:37:16 AM

Previous topic - Next topic

Mario


photophart

Using Gemma3 12b. Results are impressive, more so than Llama. Accuracy of subject identification is quite good depending on subject matter. Will occasionally be incorrect on some locations particularly in the American southwest where, admittedly, landscape, topography, color scheme and some vegetation can be quite similar. also does quite well in forests and other pastoral scenes. Haven't tried it at all in urban settings. But then I tend to avoid urban settings. Oddly, I used it to add descriptions to images taken at several car shows featuring Ford Mustangs. It surprised me with its accuracy though tended to falter when presented with closeups of car design features especially when using odd or "Dutch angle" image framing. Overall my experience with it has been quite positive and I continue to use it. It is much, much faster than adding image descriptions the "old" way, to which I'm unlikely to return.
Regards, Mark

Mario

Thanks for reporting your findings.
Tip: Press <Enter> occasionally to start a new paragraph. This makes your posts easier to read, especially when on mobile with small screens.

Quotethough tended to falter when presented with closeups of car design features especially when using odd or "Dutch angle" image framing

That#s quite a specific requirement, for sure ;)
Probably the full-size model Gemini running in Google data centers will be better with that. Some details are lost when models are quantized to allow them to run on "normal" hardware.

The next release of IMatch includes support for Google Gemini (bring your own API key) and I wonder how good this model will work for your particular "Gemma 3 fails" cases. Maybe give it a try.
Gemini 2

jch2103

I'm pretty impressed so far (Gemma 3 12B), especially after some fine-tuning of my prompts.

AutoTagger prompt
This image was taken in {File.MD.Composite\MWG-Location\Location\0}, {File.MD.city}, {File.MD.state}, {File.MD.country}.
This image was taken at the GPS coordinates {File.MD.Composite\GPS-GPSLatitude\GPSLatitude\0} and {File.MD.Composite\GPS-GPSLongitude\GPSLongitude\0}. Consider this when you analyze the image if {File.MD.Composite\MWG-Location\Location\0}, {File.MD.city}, {File.MD.state}, {File.MD.country} are empty.

AutoTagger settings
Description: [[-c-]] Describe this image in the style of a news caption. Use factual language.
Keywords: [[-c-]] Return five to ten keywords describing this image.
Landmarks: [[-c-]] Return the names of known landmarks and tourist spots in this image. If you cannot detect any landmarks or tourist spots, return ''.

Custom traits
AI.ScientificName: Return the scientific name of the object shown in this image.
AI.BlackAndWhite: If this image is monochrome, respond with 'black and white' else return nothing.

I sometimes also run the same prompts with OpenAI. Usually very similar results, although OpenAI can (but not always) be more accurate about landmarks and animal ID. So far, I'm keeping AI results in the database, but am considering putting them in image metadata. 

I'll definitely try Google Gemini when it's available in IM.
John

Stenis

Quote from: Mario on April 05, 2025, 03:25:10 PM.....................The next release of IMatch includes support for Google Gemini (bring your own API key) and I wonder how good this model will work for your particular "Gemma 3 fails" cases. Maybe give it a try.
Gemini 2

That sounds very interesting. I will definitely try out Gemini but when I was about to start to use Gemini 2.5 Pro I read that it was not aimed to image analyzing but texts so I paused that and went for Open AI instead.

I have been playing with Gemma 3 4 besides my OpenAI with some old safari pictures and must say Gemma is really struggling despite some active prompting. If I know the specie and prompt is and asks about even the latin name it mostly fixas that but Gemma has never fixed to distinguish between a East African Waterbuck and a South African Nyala neither Thomson´s gazelle and a Grant gazelle. I have seen many other problems too if I don´t prompt it.

In this case it seems like OpenAI worked better when I prompted manually, OpenAI fixed all these animals if I sent in the name of them too. As I said Gemma didn´t fix the Waterbuck even with that kind of help, so I´m waiting for Gemini 2.5 and a large model run in the cloud instead.

Mario

Quoteto start to use Gemini 2.5 Pro
Different models are trained for different things. Gemini 2.5 is a reasoning model, for matfh, coding etc.
Gemini 2.0 Flux (Light) supported by IMatch is a multi-model with excellent image analyzing capabilities.

Quoteand must say Gemma is really struggling despite some active prompting.
Of course it is. You are comparing a quantized (downsized) 12 B model capable of running locally without any cost or privacy implications against a cloud-based 500 B model running in a data center. They cannot perform on the same level when every detail matters (like animal species). Size matters for AI models, at least for now.


Stenis

Well OpenAI and Ollama / Gemma 3 4 are mine alternatives until you will release a support for Gemini 2.x. When that happens I will instantly test if it will do better in theese respects that are impotant for me but until then I'll just adapt my prompting because it is remarkable what a little prompting can do to the results.

I have looked at some problemsolving examples with Gemini 2.5 on Youtube and what was impressive there was the dialogue capabilities but I have hard to see how to take advantage of that in Autotagger but who knows whe might communicate with voice with that function in Autotagger  too when prompting like they do with Gemini 2.5 that seems to be experimental still.

Gemma surprised me a little though that it often ignored when I explicitely supplied the animal present in my pictures while OpenAI on the other hand made use of that extra input.

Mario

As I said, Gemini 2.5 is not designed for image analysis, it's a reasoning model and still in beta.
And really expensive, 30 times the cost of Gemini Flash Lite.

QuoteGemma surprised me a little though that it often ignored whe
Rephrase your prompt. Not all prompts work the same for all AIs.

Ask Gemma 3 (via LM Studio) or the AI chat app how to provide context like animal species when prompting for an image description. May give you some insights and things to try.

jch2103

Quote from: Mario on April 06, 2025, 01:48:52 PMAsk Gemma 3 (via LM Studio) or the AI chat app how to provide context like animal species when prompting for an image description. May give you some insights and things to try.
Using a prompt like this results in a description of how to use AI image generation (i.e., to create an image from a description). I'll need to play around some more to try to get it to explain how to generate a (scientific name) description from an image...

"how to create a prompt that returns animal species information from an image"
"Okay, this is a different challenge than generating images with context about animals. You want the AI to analyze an existing image and identify the animal species present. Here's how you can approach it, broken down by difficulty/AI capability and with considerations for different platforms.  Keep in mind that current AI models aren't perfect at this; accuracy will vary."

This looks more promising (test attached for anyone going down this rabbit hole).
John

Mario

If you need specific things, it often helps to be precise in the prompt,  given the AI more to "work with".
I've recently posted an example showing how to make Gemma 3 produce hierarchical keywords. And, after 5,000 images processed, it works very well. Saves a lot of time when setting up keyword mapping in my use case.

It helps to have a think about what would you include in the description and what style you use, and then explain to the AI what you expect. Chatting with the AI you use and asking it how to create a description in a specific style, containing specific bits of information often helps to create a better prompt.

Within the limits if the model of course. A downsized 12 billion parameters Gemma 3 model running locally won't have as much "knowledge" as a full-size 500 billion parameters OpenAI or Mistral or Gemini AI model. Details like specific taxonomies or similar may not be available with a local model.

And a fine-tuned (trained) model than can run locally and was trained specifically to identify animals, plants, insects etc. will probably perform much better than a cloud-based multi-purpose model.

Or, for really involved users or corporate, fine-tuning an existing model hosted by one of the cloud AI providers is always a possibility. Basically teaching the model new stuff and then benefit from it in AutoTagger. Still complicated and more or less expensive, but definitely something to look into.

The beauty of all of this is: When such models become available, AutoTagger can use them.

jch2103

Quote from: Mario on April 06, 2025, 08:50:21 PMAnd a fine-tuned (trained) model than can run locally and was trained specifically to identify animals, plants, insects etc. will probably perform much better than a cloud-based multi-purpose model.
For animals, plants, and other living things, it would be great if the iNaturalist database were available for AI use1. It does have its own, within the platform, which usually does an excellent job of making a preliminary ID from user-uploaded images; users of the platform add or correct identifications, so the identification is constantly being refined, I believe. iNaturalist is a not-for-profit. 

1 I do think this would be well beyond the INaturalist scope/mission. 
John

Mario

Quote from: jch2103 on April 06, 2025, 09:42:25 PM
Quote from: Mario on April 06, 2025, 08:50:21 PMAnd a fine-tuned (trained) model than can run locally and was trained specifically to identify animals, plants, insects etc. will probably perform much better than a cloud-based multi-purpose model.
For animals, plants, and other living things, it would be great if the iNaturalist database were available for AI use1. It does have its own, within the platform, which usually does an excellent job of making a preliminary ID from user-uploaded images; users of the platform add or correct identifications, so the identification is constantly being refined, I believe. iNaturalist is a not-for-profit.

1 I do think this would be well beyond the INaturalist scope/mission.

I'm aware of that platform. If they only would make their model (?) available in a standard format, Ollama, LM Studio etc. could run it. And you could use it with AutoTagger and any other software that can interface with these tools.

Lukas52

I am using the smaller 4B Model for now and it still does a lot better then the "old" local Imatch model :)

It seems to struggle to describe poses at all, but does a great job giving the general "vibe" of almost every photo i asked it to describe.
Im still experimenting with prompts to get more information useful to me out of it.

Mario

This is the "minimal" model of Gemma 3. It's still good for general purposes and runs on "low end" graphic cards, but there are trade-offs of course. I have this a quick try using this image:

Image3.jpg

and got

Image2.jpg

which ain't that bad? Not sure what exactly you mean by pose, though.

And for this image

Image4.jpg

I've got:
"description": "The image shows a woman seated at a table, leaning forward with her laptop. She is looking slightly to the right of the frame. In the background, other people are seated at tables, some with laptops and others engaged in conversation. The overall scene is casual and relaxed.",
"poses": [
{
"person": "Woman",
"pose": "Leaning forward, seated at a table, focused on a laptop, looking to the right",
"details": "Sitting upright, shoulders relaxed, head slightly tilted, engaged with the screen."




This is unstructured output from the AI Chat App, of course.

Maybe try the same prompt for your description. Or use a trait only for the pose (so you can use a pose-specific prompt) and store it in an AI traits tag or a XMP tag of your choice.



Lukas52

I set up the 12B Version on my local server and while it is slow (as expected given my hardware) it seems to work, processing about 2-3 Images per Minute running all three Prompts i currently ask for.

I'll give your suggestions a try, maybe i can get it to output something i can translate into keywords.

The more i think about it, the more it seems like i should change my tagging system tho...
Right now i use a lot of very detailed Tags. Often times one could express 95% of the same information in just two or three sentences...
Exporting those Information remains a problem tho, since almost everything still relies in text search, which does work a lot better with keywords.
I might need to have another look at Imatch anywhere :)

Mario

Look at my Gemma 3 keywording tip example here: https://www.photools.com/community/index.php/topic,15045.0.html

You could use something like:

Quote1. Create keywords describing the poses of the persons in this image in the following format: "pose|[your pose description]".
The "pose|" is a fixed part of each keyword.

In a quick test, this produces quite good hierarchical keywords for the poses:

Image1.jpg