This repository contains layout analysis and OCR from 17th books in TEI files and their ODD.
Thoses files were created thanks to a pipeline :
-
Segmentation and transcription with eScriptorium, using models from datasetsOCRSegmenter17 github repository
-
Manual correction of ALTO4 files extracted from eScriptorium
-
Python script pipeline to transform those ALTO4 files in a unique TEI file (see Extractor repository) , adding some metadata (extracted from manifest IIIF and SPARQL requests in data.bnf.fr).
This TEI file tries to stick at most to TEI all documentation.
So it contains :
-
teiHeader
in which there is all metadata recovered with manifest IIIF and SPARQL request, some information about encoding (use of SegmOnto vocabulary, some information about book's printer(s) -
facsimile
in which is all layout informations about different zones, lines, and baselines, with pixels coordinates and links to IIIF images -
text
in which is all transcription, linked to the concerned line
Documents have been encoded by Claire Jahan with the help of Simon Gabay, as part of the E-ditiones project.
Claire Jahan : claire.jahan[at]chartes.psl.eu
Simon Gabay : Simon.Gabay[at]unige.ch
Claire Jahan, Simon Gabay. 2021. CORPUS17+ - Corpus of TEI encoded 17th French prints., Paris/Geneva: ENS Paris/UniGE, 2021, https://github.com/Heresta/CORPUS17plus.