Multiple data records for one pdf file

Started by UdoL, April 10, 2015, 09:00:52 PM

Previous topic - Next topic

UdoL

I'd like to manage articles in pdf files of complete newspaper or magazine issues in imatch. Therefore multiple data records (for different articles in one issue) in imatch should refer to the same pdf file. Is this possible? Moreover it would be ideal to refer to distinct pages in the pdf.

BanjoTom

I would think you could accomplish this with careful development of categories and also using attributes, with which you could build multiple (searchable) records for each PDF file that you need to reference or retrieve.
   
— Tom, in Lexington, Kentucky, USA

Richard

#2
I would consider one category per article and then assign the PDF to the proper categories. As Tom said, Attributes can be used for more data.

An added thought: I would tend to keep the categories for articles somewhat generic. Let's say that a category for articles is labeled "Crime". Many PDFs could be assigned and thus aid finding articles on crime.

Mario

IMatch does not perform 'text extraction' and does not 'look' inside PDF files. It extracts the standard metadata via ExifTool, which is usually sufficient when you just want to track PDF files in general and you use proper metadata in the PDF file. Magazines, eBooks etc. usually fill PDF metadata so you can automatically group your PDF files by publication, headline, year etc. using data-driven categories  (IMatch by default creates data-driven sample categories for PDF files).

As outlined above, IMatch categories allow you to group/cluster/organize PDF files in any way you can think of.

If you want to record additional data for each PDF file, e.g. article summaries or snippets, you can do this with IMatch Attributes. Attributes allow you to create your own metadata and manage it in the IMatch database. You could, for example, create an AttributeSet (per file) which stores a summary text and a page number. You can then add as many of these records for each PDF file as there are articles in the file, specifying a summary and the page number for each article. IMatch displays this data in the Attributes Panel, and you can use it also everywhere IMatch supports variables, for import & export, in scripts etc.

See the help topic on Attributes in the IMatch help for information and examples.

monochrome

Quote from: UdoL on April 10, 2015, 09:00:52 PM
I'd like to manage articles in pdf files of complete newspaper or magazine issues in imatch. Therefore multiple data records (for different articles in one issue) in imatch should refer to the same pdf file. Is this possible? Moreover it would be ideal to refer to distinct pages in the pdf.

Short answer: No, not really.

Oh boy did you set yourself a task...  :-\ I have a similar problem that I'm giving a lot of thought. Here's a bit of what I've come up with - it's not final and not the word from on high, but maybe it can help you.

First, IMatch deals with metadata for files. This is just what the tools is made to do. If you have a file (JPG) photo of a restaurant, then IMatch is built for keeping the information that it is of a certain restaurant - it's not built to maintain data about the restaurant itself. Therefore, in order to fit with what IMatch does, the solution must be formed as metadata in files that describes those files. This isn't a defect with IMatch. That's how everyone - the researchers who came up with the format, the architects who wrote the spec and the programmers who implemented it - thought about it. That anyone would want to apply multiple metadata records to a single file was considered something that dedicated metadata systems would maybe, maybe support - that being such an outlier in use, and so difficult to incorporate in a user interface, that it was essentially left for the interested reader to figure out and implement.

The easiest way would be to split the PDFs and in each part PDF include a dc:relation tag that refers back to the main PDF, a dc:relation that points from the main PDF to the small PDF. (Alt, just keep the small PDFs and put the issue title and number in the dc:relation tag) Then tag the parts as you want. This is in line with what IMatch is designed to do.

Doing this "properly" would require you to define your own metadata types, maybe analogous to the MWG-Region spec. But again, don't expect IMatch to magically grow a new UI for you to comfortably edit those tags in. It can be done, though.

Mario

I think you are over-complicating things in this case.

The IMatch Attribute database can handle this case easily. It allows the OP to record any article records per PDF file. This is not different on how this is done in the systems employed by the publishers. If I would need to store article data for a PDF file containing multiple articles, I would just setup an Attribute set with attributes like 'summary','page','author' etc. Probably I would use the official PDF metadata schema as an example. Then I would fill in the data as needed.

When I then select a PDF file file in the file window, the Attributes panel displays detailed information about each article in the file. I can use this data for display purposes, import & export it, search for it, filter for it and so on. This already gets you very far:



If you want to reply entirely on metadata, the native PDF metadata will not get you far. But you can use standard XMP metadata, including the IPTC tags, with PDF files. This allows you to record a lot of information for each PDF file and in the PDF file.

[attachment deleted by admin]

UdoL

Thanks a lot for all your answers. I'm diving more and more into IMatch. There is so much to explore - and it's fun!

@monochrome: I also thought of splitting the pdf's in order to get what I want. But this would be some effort for each article and I'd prefer to avoid that. But it would perfectly suit the metadata idea and it would have the advantage that the same metadata tags could be used for different media types, i.e. Author & Title für photos, videos and text documents. In this way a precise search over several media types would be easier to execute.

@Mario: I think your suggestion meets my needs nearly perfect. I haven't yet read the attributes section of the help because I assumed a simple 1:1 association between file and custom attributes asd didn't expect IMatch to be able to manage multiple sets of attribute values for the same attributes of one file.

Until now I've been using an old MediaDex Version of Canto Cumulus Single User. I always was reluctant to update because the price rose to four times the price of MediaDex after Canto took it back inhouse and I was always fearing that Canto would discontinue the support of the single user edition some day - and this now is actually the case. So I'm on the way to another system and this will be most likely IMatch.

Comimg from MediaDex I've been searching for a similar solution like they had. MediaDex has a somewhat different concept underlying as IMatch. There are records with attributes in the database where each one refers to a file. So every entry in the table or thumbnail view represents a data record with reference to a file whereas in IMatch every entry represents the file itself. So in MediaDex I could simply copy/paste a record and got a second record with reference to the same file as result.

Thinking this way I looked for a similar way in IMatch. But your solution, Mario, suits my needs also very well. As far as I can see it has only one disadvantage compared to MediDex, that is that the information is spread over two panels, the file window and the attribute panel whereas in MediaDex all information (thumbnail, metadata and attributes) can be viewed in one table which seems not to be possible in IMatch.

Btw. another point that speeks very much for choosing IMatch for me is your excellent support here in the community, Mario - many thanks for that!

sinus

I think, with the app-possibilities it is easy to viewing all what you want in one Single window, though I am not sure.
Best wishes from Switzerland! :-)
Markus

Mario

#8
QuoteAs far as I can see it has only one disadvantage compared to MediDex, that is that the information is spread over two panels, the file window and the attribute panel whereas in MediaDex all information (thumbnail, metadata and attributes) can be viewed in one table which seems not to be possible in IMatch.

This is, again, an area where IMatch excels!  ;)  IMatch is customizable in so many areas.  People coming from other software often miss that, because they don't expect that a software is designed to adapt that much...

You can create your own file window layout to display all the info you want to see per file, including metadata and attributes.

You have the App Panel where you can define your own HTML-based metadata panel, which displays all the information you want, in any way you want. For example, look at the "Category Dashboard" or the sample panels for HTML, GIF files, Nikon and Canon RAW files. Or the MP3 template (there is even a MP3 player!). The APP panel supports both simple HTML templates and can also run full-fledged Apps written in HTML and JavaScript. A pretty unique feature.

For a start, I suggest you look into customizing the file window layout to your needs. I made a quick test, and the results are already pretty cool (PDF covers from www.textnein.com):



This tabular file window layout uses standard fields for the file name and date.
For the 3rd column I used IMatch Variables representing the Attributes stored for the file.

My Attribute Set has the Attributes Summary, Page, Pages. It could have any number of other Attributes as well. Whatever you need.

When you use the variable for the Summary attribute (Tip: Use the Var Toy in the App Panel to try this out)

{File.AT.Articles.Summary}

IMatch returns all Summary Attributes or the file, separated by a ; for example:

Fashion Shooting in Mailand;Berlin for Aliens

This works the same for page and pages. IMatch variables allow you to use additional parameters, e.g. the index parameter which allows you to access a specific element in the list. To get the first Attribute in the list, I change the variable to:

{File.AT.Articles.Summary|index:0} (returns Fashion Shooting in Mailand)

For my test file window layout, I used that idea to create a custom template which displays one summary, page (pages) per row. I assumed a maximum of 5 articles per magazine (you just need to add more rows to cover for more articles). The template looks a bit unusual at first, but when you play with it a bit, you'll see how it works. See the Variables help topic in the IMatch help for all details.

Each row accesses one summary, page number and page using the index (0,1,2...)
I've even added a <Bold> formatting tag to make the summary appear in bold font.
When the variables access an index which does not exist, they don't output anything. So if there are only two articles, you get only two rows. Perfect!

<Bold>{File.AT.Articles.Summary|index:0;prefix:1: }</Bold> {File.AT.Articles.Page|index:0;prefix:p} {File.AT.Articles.Pages|index:0;prefix:(;postfix:)}
<Bold>{File.AT.Articles.Summary|index:1;prefix:2: }</Bold> {File.AT.Articles.Page|index:1;prefix:p} {File.AT.Articles.Pages|index:1;prefix:(;postfix:)}
<Bold>{File.AT.Articles.Summary|index:2;prefix:3: }</Bold> {File.AT.Articles.Page|index:2;prefix:p} {File.AT.Articles.Pages|index:2;prefix:(;postfix:)}
<Bold>{File.AT.Articles.Summary|index:4;prefix:4: }</Bold> {File.AT.Articles.Page|index:4;prefix:p} {File.AT.Articles.Pages|index:4;prefix:(;postfix:)}
<Bold>{File.AT.Articles.Summary|index:5;prefix:5: }</Bold> {File.AT.Articles.Page|index:5;prefix:p} {File.AT.Articles.Pages|index:5;prefix:(;postfix:)}


In the file window layout editor, this looks like:



Now you have something to play with. Try that in another DAM!

[attachment deleted by admin]

sinus

Mario, a very good and interesting answer, I guess, also for other users - like me.  ;D
Best wishes from Switzerland! :-)
Markus

Richard

Quote from: sinus on April 12, 2015, 03:55:28 PM
Mario, a very good and interesting answer, I guess, also for other users - like me.  ;D

Not only an interesting answer but one that should not lay hidden among over 9,100 posts by Mario.

sinus

Quote from: Richard on April 12, 2015, 06:34:05 PM
Quote from: sinus on April 12, 2015, 03:55:28 PM
Mario, a very good and interesting answer, I guess, also for other users - like me.  ;D

Not only an interesting answer but one that should not lay hidden among over 9,100 posts by Mario.

You are completely right, Richard.

BTW: I tried this and it works great. Attributes are fascinating, and I know, that one day I will add some feature requests for them  ;D ... but I am not ready now, because I do not know, what I want!  8)
Best wishes from Switzerland! :-)
Markus

jch2103

Quote from: Richard on April 12, 2015, 06:34:05 PM
Quote from: sinus on April 12, 2015, 03:55:28 PM
Mario, a very good and interesting answer, I guess, also for other users - like me.  ;D

Not only an interesting answer but one that should not lay hidden among over 9,100 posts by Mario.

+1
John

Mario

I have planned to reuse my post for a new KB article.

jeknepley

Quote from: Richard on April 12, 2015, 06:34:05 PM
Quote from: sinus on April 12, 2015, 03:55:28 PM
Mario, a very good and interesting answer, I guess, also for other users - like me.  ;D

Not only an interesting answer but one that should not lay hidden among over 9,100 posts by Mario.

Gems like this go into my jewel box - i.e., in a bookmark bar tab labeled IM5 Nuggets, a repository for photools.com Community posts that I know I'll be coming back to.


UdoL

#15
Sounds pretty cool - thanks! I'll definitly give it a try at a later time. For the moment I'm satisfied with your first answer because I know that there is a good solution.

Yet before I have to solve some other issues. Concerning the attributes I have now the question how to move my data from MediaDex to IMatch. I've already found the basic solution for the data transfer. I'm able to export data from MD in a text file. After some reworking I manage to import this into IMatch as csv-Import.

But if I have multiple rows with the same file reference, how does IMatch handle this during the csv-Import? Or asked the other way round: How does the csv have to look like so that I'm able to import my multiple records in IMatch attribute sets. To find out I exported file attributes and reimported them, but this didn't really work. This was my test:


  • I defined an attribute set "Article" with two attributes "Author" and "Title"
  • I added 2 values for a pdf file: "Author 1" and "Title 1" plus "Author 2" and "Title 2"
  • I made a text export for the file with {File.FullName} and both attributes. The result was "Author 1_Author 2" and "Title 1_Title 2" for "Author" resp. "Title"
  • I deleted the attribute values in the database
  • Then reimported the txt file as csv

Now I expected to get my two attribute sets back, but unfortunately didn't. The result was the value "Author 1_Author 2" in "Author" and "Title 1_Title 2" in "Title". So the semantics doesn't seem to be the same for export and import. What's the one for the import?

Mario

#16
The native import/export format for IMatch attributes is XML, not CSV. See the IMatch help for details.
Please use this format if you want to export/import Attribute data and schemata.

The CSV file format is a pretty simple format, defined 30 years ago to transfer 'records' or Excel table rows between applications. It cannot handle any non-trivial case, e.g multiple 'records' per entity in any standardized case.

The text export in IMatch is designed to transfer text data of any kind from IMatch to other applications, one of them being CSV. When you export multi-value variables (e.g. keywords, or Attributes), IMatch concatenates the values in the output, it does not produce multiple records for each file or anything. You may be able to trick this by using the same index trick I used above, but trying all this out will take quite some time that I don't have.

Neither the text export nor the CSV import are designed as round-trip facilities or to handle the non-trivial 'multiple records per file' case.

I would look into a custom script which reads the data exported by MediaDex and then imports it directly into the database. The scripting language in IMatch is perfectly capable to do this, and if you know a bit about programming or can ask somebody in your IT department or a friend, it should be doable in a couple of hours.


Mario

Wrote an knowledge base article today, which also covers the topic of displaying multiple rows of Attribute data:

http://www.photools.com/3808/tune-file-window-layouts/

sinus

Quote from: Mario on April 13, 2015, 05:23:23 PM
Wrote an knowledge base article today, which also covers the topic of displaying multiple rows of Attribute data:

http://www.photools.com/3808/tune-file-window-layouts/

Great knowledge - article, Mario! Really!

I am an was always a fan of the file windows layout - possibilities. We can really create a lot of things with them, and I use them all the time.
Best wishes from Switzerland! :-)
Markus

Richard

I may have spotted a typo.

QuoteSummary File Window Layouts in IMatch enable you to control the date displayed in the file window. You can easily control the attributes or metadata to display, and also choose font sizes and colors.
My guess is that you meant Data.

Mario


UdoL

Thanks. Mario, for your answer. The only reason to try the roundtrip was to possibly find out how the csv would have to look for. I hoped, that if I import the attrribute set in the same format that I get from the export it would work with attribute sets. It didn't and you told me now that there is no way to manage it without scripting. I'm fearing I won't be able to manage that so I'll probably live with the concatenated values for migrated data and will be using the full flavour of attribute sets only for new articles.