write IPTC:CodedCharacterSet tag when creating/updating IPTC records

Started by joel23, October 12, 2014, 06:24:46 PM

Previous topic - Next topic

joel23

Sorry, I'm pretty late with my answer to your last reply - but unfortunately you moved the original thread this morning without a further comment. By this I am not sure if you solved it respectively fulfilling my request, or just believe everything is okay as it is...
Here's my answer I had prepared some time ago, but while by being short on time over the past weeks, I was not able to send it yet.

Quote from: Mario on September 15, 2014, 07:48:39 AM
Quote from: joel23 on September 14, 2014, 07:18:58 PM
IMatch creates IPTC records in UTF-8 when it was set to do so, but it does not write the mentioned tag, as MWG demands it.
IMatch does not create IPTC records in a specific encoding.
I can't tell what it does (in the context of internal -> external encoding), but with the Default setting it writes © (hex A9) and when set to UTF8 for r/w, it writes © (hex C2 A9) to the file for the Copyright symbol "©"
QuoteAlso see http://www.sno.phy.queensu.ca/~phil/exiftool/exiftool_pod.html under -charset [[TYPE=]CHARSET] for details. Also check Faq #10 here: http://www.sno.phy.queensu.ca/~phil/exiftool/faq.html
I know those FAQs and documents, but what sticks into my eye is (of course it does) that even Phil says:
QuoteNote that unless CodedCharacterSet is UTF-8, applications have no reliable way to determine the IPTC character encoding. For this reason, it is recommended that CodedCharacterSet be set to "UTF8" when creating new IPTC.

Just let me tell that all my images get this tag set before they are imported to IMatch ('cause Geosetter takes care of it), so I don't have mixed data nor a need for help.

But we got some reports from other users at beginning September - and their problems are not necessarily caused by other software.
There are environments on which IMatch is the Creator (in the MWG sense) of a full IPTC record, for example for RAW/DNG. Some user want it that way and they don't care what an application use to do its job.

I read all what was said here and is written in help and some of them is IMHO contradictory, respectively neither settings gives satisfying results.
Quote from the help:
QuoteIn the (Default) setting IMatch (and ExifTool) assume that the character set used for IPTC is UTF-8 if the IPTC record in the file does not explicitly specify a character set (see also ExifTool FAQ 10).
This is IMHO wrong.
When reading ExifTool always assumes IPTC is in locale code page when the marker is missing, as you said in here:
QuoteWhen reading legacy IPTC data, ExifTool checks for the UTF8 marker. [...] If no UTF8 marker is found, ET assumes the local code page.

Since I was pretty sure it was different in IM V3, I installed it again the other week and - yes! - when "Default" and "Always save IPTC data in UTF-8" was chosen for IPTC, IM3 nicely created this tag.  The explanation at that time was: "this allows you to work with non-ASCI/ANSI characters, including Japanese or other wide character encodings"  

But this does not work anymore when IPTC is created by IMatch.

Let me tell what happens with V5 - at least on my systems -, when IMatch takes care of the code page. All metadata is applied within - and IPTC created - by IMatch for new CR2.

1. as said above when the Default setting is used, it seems that IMatch always reads and writes in locale code page (here Latin1)
This usually is fine, but we'll get an ET output like "Warning: Some character(s) could not be encoded in Latin" on write, when diacritical characters are used (e.g. in Location data like "City")
Umlauts and e.g. "©" work (well, AFAIK Latin1 contains both) but no way to use for example Cambodian or Japanese characters. "Kâmpóng Kântuŏt" in the "City" tag just don't work.

When using reverse geo-coding IMatch writes it as "Kâmpóng Kântu?t" to the IPTC "City" and "XMP IPTC Extension\Location Shown City" tag. Users might not be aware of this, because they only look to the XMP data in the default panel and see the right values. See attachment #1

1.1 When such a file is re-imported or "force updated", the value from IPTC (Kâmpóng Kântu?t) is displayed for all tags in the panel. See attachment #2
If users noticing, this might be highly irritating for them and at this point they may try to solve it by changing the IPTC encoding settings, which would mess their files even up more.
Next time (other) metadata is edited, the XMP values (only XMP) are getting back to normal.

1.2 From here all might be good for the next decade, as long XMP don't gets lost. ;)
But in the rather rare cases when users lost/delete their embedded XMP or sidecars, of course IPTCs' "Kâmpóng Kântu?t" is also written to newly created XMP.

1.3 Additional to the above: when such wide characters encodings are used in Keywords and "write IPTC" is enabled, IMatch loops. As some users have reported it in the discussion area ~ two weeks before I started this thread. Same ET output as above.

2. Switching IPTC character encoding to UTF8, both for read/write
As said above here IM/ET writes © to IPTC for the copyright symbol, but it does not create an UTF8 marker.
When a forced update is done or the images again were imported, it seems IMatch reads IPTC as locale  Which first messes up IPTC "City" and gives me Ã,© for the Copyright after two edits. And with every edit IPTC gets messed up more. See attachment #3
My reading is: "tell IMatch about your IPTC encoding (here UTF8) to avoid garble. Which is what was done here.

3. Switching IPTC character encoding to Default for write and UTF8 for read
Quote from the help:
QuoteYou should always use the (Default) value, which means "write as read".
The results are even worse. See attachment #4




IMHO you should consider to set this tag as it was in IM3. The question is IMHO not why some might use r/w UTF8 in his settings, but why this tag is not set anymore, when all the mentioned problems could be solved - respectively would not appear - when this tag is created. Very easy.
I even believe the "write" option in the "IPTC character encoding" dialog would be obsolete when doing so, but I might be wrong here. 

The results of my tests are similar to what other users have reported. When they migrated their images some got weird characters in the Copyright, other in Keywords and for other IMatch looped, because wide characters were used in Keywords.

Quote
Or to use ExifTool once to convert the existing IPTC data from whatever code page is was written in into UTF8, with the marker, for example:
exiftool -tagsfromfile @ -iptc:all -codedcharacterset=utf8 your file name.jpg
Yes, users could do so - but since this is not just a displaying issues it might be to late to use this command afterwards, respectively may mess up data even more.
IMHO it depends if IPTC was written in UTF8 and only the marker is missing or if the data already was converted.

I expect IMatch to do so (once) when it creates an IPTC record, not only to be MWG compliant. I can't see a reason why this shouldn't be done whenever IPTC is created. Geosetter, which btw. uses ExifTool as well, creates a valid IPTC record incl. the IPTC:CodedCharacterSet tag. As PS does for JPG, PSD and TIF.

Another thing: seems I am running out of the "CodedCharacterSet" variable ;)
I don't know anymore when it was, but once I was able to choose this variable in a metadata template; I am not able to do so anymore. Still no problems using it in the Apps panel and Value Filters.

[attachment deleted by admin]
regards,
Joerg

Mario


joel23

Quote from: Mario on October 12, 2014, 06:48:48 PM
Is this for a bug report? If so, which?
See the link above in my first sentence. You moved it this morning to the archives.
regards,
Joerg

Mario

Ah, I see. A link hidden behind the word reply. Sorry I did not see that right away. But I often work with a tablet or a smart phone, and it's sometimes hard to see such things.

Instead of filing a new bug report or asking me to re-open the bug you've made a post in General Discussions. Not a good idea. I only keep an eye on bug reports and feature requests. I don't search General Discussions for bug reports.

I have re-open the original bug report for you. It seemed complete to me so I archived it this morning...

Please be so good and copy your post from there into a reply to the bug report so we have everything together. The bug report is a mile long already, with replies from me and others. Difficult to understand already, especially when I revisit it in a few weeks time.

I explained how ExifTool behaves when reading and writing IPTC data, and when mixing different encodings can lead the problems.
IPTC data should always be in UTF-8 and marked as such. IMatch 3 did that since 2006 or 2008 when I recall it correctly.



joel23

Quote from: Mario on October 12, 2014, 07:04:41 PM
IPTC data should always be in UTF-8 and marked as such. IMatch 3 did that since 2006 or 2008 when I recall it correctly.
Yes, and this is what is missing in IM5 and IMHO unnecessarily produces problems.
I will append my post to the bug report later.
regards,
Joerg

Mario

QuoteYes, and this is what is missing in IM5 and IMHO unnecessarily produces problems.
ExifTool by default writes IPTC data in UTF8. No need to do something special from IMatch.
Open the ExifTool output panel, do a write-back and then copy the contents of the panel to your bug report. I need to see if special character sets are contained in the commands IMatch sends to ExifTool.

sinus

Quote from: joel23 on October 12, 2014, 07:12:03 PM
Quote from: Mario on October 12, 2014, 07:04:41 PM
IPTC data should always be in UTF-8 and marked as such. IMatch 3 did that since 2006 or 2008 when I recall it correctly.
Yes, and this is what is missing in IM5 and IMHO unnecessarily produces problems.
I will append my post to the bug report later.

Hi Joerg
It occures to me, that you did a lot of reasearch about the topic.
Unfortunately my English is not that good, that I can understand all fully - and also my technical understanding is not that big  :-[  :-\

In short: Want you say, that is it a good thing, related to Umlauts, that we enable in the preferences of IM5 "read and write NOT default, but UTF-8"?
Because I use Umlauts (I did also so in IM3), this is interesting for me.
Best wishes from Switzerland! :-)
Markus

joel23

Quote from: sinus on October 13, 2014, 09:11:45 AM
Quote from: joel23 on October 12, 2014, 07:12:03 PM
Quote from: Mario on October 12, 2014, 07:04:41 PM
IPTC data should always be in UTF-8 and marked as such. IMatch 3 did that since 2006 or 2008 when I recall it correctly.
Yes, and this is what is missing in IM5 and IMHO unnecessarily produces problems.
I will append my post to the bug report later.

In short: Want you say, that is it a good thing, related to Umlauts, that we enable in the preferences of IM5 "read and write NOT default, but UTF-8"?
Because I use Umlauts (I did also so in IM3), this is interesting for me.
No, no Markus. I don't want to advice something. Leave it as it is at the moment.
Umlauts work fine, but for example Czech characters don't as some user reported in the past. And for example some Cambodian or Japanese characters don't - anything in wide character encoding.

You might have used UTF8 already in IM3 and by this the -IPTC:CodedCharacterSet=UTF8 tag may exist - in this case everything is fine anyway. I had that set in IM3 and by this no problems migrating my images.
The problem rises on new images and when a IPTC record is created by IMatch or when ingesting older images, in which wide character encodings were used, write IPTC is set to yes and the above tag is missing. Somehow like this.
regards,
Joerg

sinus

Thanks, Joerg, fine to hear!

BTW: are Umlauts the reason for you, to write Joerg instead of Jörg or is Joerg your real name? Just a kind of interest ... I spell my last name "Hässig" also often Haessig. Ahhhhh, are these English-speaking people happy, I think, they even do not know (mostly), how lucky they are!  ::)
Best wishes from Switzerland! :-)
Markus

joel23

Quote from: sinus on October 13, 2014, 11:08:33 AM
Thanks, Joerg, fine to hear!

BTW: are Umlauts the reason for you, to write Joerg instead of Jörg or is Joerg your real name?
Yes, umlauts are the reason. Jörg is my real name of course.
But it's easier for international writers to write Joerg and pronounce it like Georg. ;)
regards,
Joerg

sinus

Quote from: joel23 on October 13, 2014, 11:44:30 AM
Quote from: sinus on October 13, 2014, 11:08:33 AM
Thanks, Joerg, fine to hear!

BTW: are Umlauts the reason for you, to write Joerg instead of Jörg or is Joerg your real name?
Yes, umlauts are the reason. Jörg is my real name course.
But it's easier for international writers to write Joerg and pronounce it like Georg. ;)

:)  :D
Best wishes from Switzerland! :-)
Markus

Richard

Quote from: sinus on October 13, 2014, 11:08:33 AM
Ahhhhh, are these English-speaking people happy, I think, they even do not know (mostly), how lucky they are!  ::)

I may be lucky that English has become something of a common language around the world but that does not mean that I am happy with the situation. In my opinion English is a hard language to learn. For example: I have read that 17 different words, with different meanings, in ancient Greek all translate to "love" in English. That makes the meaning of "love" nearly impossible to know.

Worse yet is the fact that native English speakers will use the wrong spelling of words. Instead of writing "you're" they will write "your". Like "your wrong" when they really mean "you are wrong" and could use the contraction "you're".

Just in the 48 contiguous states of the United States a word may be pronounced differently in different areas. Add our neighbors to the North (Canada) and things get worse. Add all the other "English" speaking countries and the  problem gets worse.

How nice it would be if we all spoke a language where a word had only one meaning and would be pronounced the same the world over. Where a name like "Jörg", "Hässig", or "Brögger" would be spelled the same and pronounced the same the world over.

cytochrome

I stumbled on this diacritic character problem very early (was in late beta test) because I write headline and description in french, and most of my localization data is in french too. Was really annoying, and difficult to make any sense out of it.

I did not do such an in-depth analysis  like Joerg but my tests showed that one cause was that IM failed to raise the IPTC UTF-8 flag at metadata write back. So I raised it (ECP or Photomechanic) in all my raw files, it is then transcribed to the versions by iM. And I kept the Read/write in metadata panel at default.

I have now very little gibberish in my location/headline/description, but from time to time it creeps back to the surface when I modify something in a file and IMatch does an update. I correct it by hand...

With new files I have no problem at all since a year, when I realized that I had to set Photomechanic to output everything as Unicode.

Francis

joel23

Quote from: cytochrome on October 13, 2014, 09:58:30 PM
I stumbled on this diacritic character problem very early (was in late beta test) because I write headline and description in french, and most of my localization data is in french too. Was really annoying, and difficult to make any sense out of it.

I did not do such an in-depth analysis  like Joerg but my tests showed that one cause was that IM failed to raise the IPTC UTF-8 flag at metadata write back. So I raised it (ECP or Photomechanic) in all my raw files, it is then transcribed to the versions by iM. And I kept the Read/write in metadata panel at default.

I have now very little gibberish in my location/headline/description, but from time to time it creeps back to the surface when I modify something in a file and IMatch does an update. I correct it by hand...

With new files I have no problem at all since a year, when I realized that I had to set Photomechanic to output everything as Unicode.

Francis
You might follow us up in the Bug report section, same subject.
regards,
Joerg