Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure information #26

Open
2 of 5 tasks
mikegerber opened this issue Oct 25, 2023 · 16 comments
Open
2 of 5 tasks

Structure information #26

mikegerber opened this issue Oct 25, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@mikegerber
Copy link
Member

mikegerber commented Oct 25, 2023

@labusch had questions regarding structure information (from METS metadata) and @joergleh already had suggestions regarding missing information (#23, #24).

While there is certainly information that I find out of scope for this tool (like the location of a title page → should use the original METS for this) there is certainly information we should include (like the count/presence of a title page).

(Edit: Moved the missing field documentation to #27.)

@mikegerber mikegerber added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 25, 2023
@mikegerber mikegerber self-assigned this Oct 25, 2023
@mikegerber mikegerber removed the documentation Improvements or additions to documentation label Oct 25, 2023
@mikegerber
Copy link
Member Author

We're currently completely ignoring the structMap, so this needs new code ;)

@mikegerber
Copy link
Member Author

@labusch told me there's interest in per-page information ("This is a title page" or whatever types exist). IMHO This should not go into the mods_info, but a separate file/table - and I would consider this after reviewing first (I am simply not familiar enough with mets:structMap yet.)

@cneud
Copy link
Member

cneud commented Nov 1, 2023

AFAICT, what we would need is this: for a given list of PPNs, we want the information which specific image files per PPN correspond to each (if any) of the Strukturdaten listed here (scroll down to "Strukturtypen - vollständige Liste"): https://digital.staatsbibliothek-berlin.de/features-und-hilfe/suchen-und-stoebern.

@mikegerber
Copy link
Member Author

Yeah that's per-page information :) Need to look at it for specifics.

I somewhat think it might still be an idea to read the METS but now that I understand the use case I might be able to come up with something that is easier to digest than 1. to deal with all that XML and 2. understand METS structure.

Just because of the grammar reversing it, what you want is this:

Given ppn=PPN12345 and type=illustration, give me the matching pages (and the images).

That should be possible, as far as I can see now. There will be some difficulties (omnibus volumes (Sammelbände) etc.), but it could work.

@cneud
Copy link
Member

cneud commented Nov 1, 2023

Given ppn=PPN12345 and type=illustration, give me the matching pages (and the images).

Exactly. The use case is to automatically ingest the structural tags for e.g. title pages as tags into the image search db, so that users can easily select all images from all PPNs that are title pages and annotate the according regions on those images.

@mikegerber
Copy link
Member Author

That's the other way around ;) But I think I understand now and will implement it.

@mikegerber
Copy link
Member Author

mikegerber commented Nov 9, 2023

The details are a bit tricky, but this seems a good way to do it:

Per PPN and page (as in structMap TYPE=PHYSICAL → TYPE=page):

  • Have the filenames for this page in all fileGrp
  • Have flags for every type if they exist (TYPE=LOGICAL div TYPE=)
    • These are hierarchical, e.g. we could have (1) an illustration in the (2) bookend of the (3) binding (just making it up). I don't see an elegant way to keep the hierarchy while still exposing all the types. I think flags (booleans) for the types are sufficient, and people wanting more need to read the METS.

This way there's an immediate link between a file name and a logical type. (Hardcoding a fileGrp or guessing e.g. filename from the ID in structMap[@TYPE=LOGICAL]would be only slightly easier, and probably fail here and there → prefer the correct version and resolve filenames using all fileGrps)

  • Implement it
  • Write some examples
  • Read up that I understood the structMap[@TYPE=LOGICAL] correctly
  • Look at some illustrations, just to make sure
  • New export
  • Sammelbände (I avoided those, and should probably know these better)

@mikegerber
Copy link
Member Author

I've been held up by the joy of Sammelbände...

Implementing it as above also is a bit trickier than I thought (e.g. need to read the fileGrp to know to which a file FILEID belongs and associating structMap PHYSICAL vs structMap LOGICAL is full of ID pointers, too), but I started it.

@mikegerber
Copy link
Member Author

mikegerber commented Nov 22, 2023

Alright, this looks like it's going somewhere:

{'ID': 'PHYS_0594',
 'fileGrp_DEFAULT_file_FLocat_href': 'https://content.staatsbibliothek-berlin.de/dc/PPN821507109-0000>
 'fileGrp_PRESENTATION_file_FLocat_href': 'file:///goobi/tiff001/sbb/PPN821507109/00000594.tif',
 'fileGrp_THUMBS_file_FLocat_href': 'https://content.staatsbibliothek-berlin.de/dc/PPN821507109-00000>
 'ppn': 'TODO',
 'structmap_LOGICAL_TYPE_illustration': 1,
 'structmap_LOGICAL_TYPE_monograph': 1,
 'structmap_LOGICAL_TYPE_section': 1}

This dict is going to be a line in a DataFrame and gives both the filenames (as they are in METS) and the associated TYPEs in the mets:structMap[@TYPE="LOGICAL"]. What the three indicator variables (the 1s) mean:

  • This page is part of a monograph
  • This page is part of a section
  • This page is part of an illustration ("part of" ... this is as fine as it gets)

The way our METS is set up, you get these TYPEs explicitly, but I also made it transitive, e.g. if an illustration is part of a section, you would get section in any case.

With this info you can get the structMap TYPEs for a given page and also have it backwards, i.e. get pages with illustrations on it.

@mikegerber
Copy link
Member Author

(Currently working in branch feat/page_info).

@mikegerber
Copy link
Member Author

mikegerber commented Nov 23, 2023

Implementation is too slow: For sbb-mets-PPN821507109.xml (~1300 pages), it takes 80s to process... For now, I'll ignore this and improve later. It's probably all the XPath in here.

  • Write a test
  • Test how to improve this (use XPath class instead of .xpath()? Avoid XPath? Resolve links the other way around)

@mikegerber
Copy link
Member Author

mikegerber commented Nov 28, 2023

Structure types were given transitively in my test files, e.g. a page that had type cover_front and was part of a binding had smLinks to both the cover_front and the binding logical elements. This was not the case for all documents in @labusch's selection.

I had code that would sanitize this, i.e. walk up the tree and add the types. Coincidentally this was buggy and failed for some hundred documents - so I noticed the - possible - inconsistency.

  • Investigate files

@mikegerber
Copy link
Member Author

mikegerber commented Jul 25, 2024

Feature is now merged into master; may need more work on performance (VERY slow interpretation of the structure information, slow as in "takes 2 weeks to export")

See above.

@mikegerber
Copy link
Member Author

mikegerber commented Jul 29, 2024

Saving per-page information (like this structure information) is now optional (--page-info).

@mikegerber
Copy link
Member Author

Implementation is too slow: For sbb-mets-PPN821507109.xml (~1300 pages), it takes 80s to process... For now, I'll ignore this and improve later. It's probably all the XPath in here.

* [ ]  Write a test
* [ ]  Test how to improve this (use XPath class instead of `.xpath()`? Avoid XPath? Resolve links the other way around)

I already fixed this months ago and forgot about it; mods4pandas is now 2 orders of magnitudes faster.

@mikegerber
Copy link
Member Author

structMap[@type="LOGICAL"]: should count the divs, grouped by their type. They are nested, so this needs to be accounted for.

This is done; Nesting is not done directly, but by having multiple structure types at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants