Logo des Repositoriums
 
Konferenzbeitrag

Analyzing Historical Legal Textcorpora: German VET and CVET regulations

Lade...
Vorschaubild

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2024

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

The digitization of historical documents has gained particular interest in recent years. The majority of research endeavors aim at digitizing historical documents by extracting text from scanned images. A pipeline that transcribes scanned documents into fully structured texts was utilized to digitize over 900 German VET and CVET regulations. As a preliminary investigation, a basic corpus analysis was conducted to assess the usability of the digitized documents and the necessity for document digitization methods that can generate transcripts that maintain the logical text structure and hierarchy. This paper focuses on the processing of the transcripts created from German VET and CVET regulation images to demonstrate the advantages of fully structured text over plain OCR results and to illustrate that even simple analyses require more information for more comprehensive document understanding.

Beschreibung

Reiser, Thomas; Dörpinghaus, Jens; Steiner, Petra (2024): Analyzing Historical Legal Textcorpora: German VET and CVET regulations. INFORMATIK 2024. DOI: 10.18420/inf2024_174. Bonn: Gesellschaft für Informatik e.V.. ISSN: 2944-7682. PISSN: 1617-5468. EISSN: 2944-7682. ISBN: 978-3-88579-746-3. pp. 2007-2018. Digitalization and AI for and in Education and Educational Research (DAI-EaR'24). Wiesbaden. 24.-26. September 2024

Zitierform

Tags