Online Lecture by Svenja Wagner: „In Medias Res - Semantische Informationsextraktion in philosophischen Texten anhand von Kants Gesamtwerk mithilfe von Transformer Modellen“
The search for relevant and/or specific passages within ever-growing amounts of data requires increasingly precise search methods to achieve the most relevant results possible. If the goal is not just a similarity check of characters and letters between the search input and possible outputs, but rather a content-based analysis, new methods must be employed. Modern language models are of particular interest for such a "semantic comparison." Transformer models, in particular, promise good results with relatively moderate (computational) effort. However, these models tend to have a lower performance with non-standardized texts (e.g., dialectal, historical, or specialized) and face the challenge of being trained with small data sets. Since the Kant corpus is not particularly large and consists of historical texts, a method was developed that allows multiple models to be effectively trained with a small data set (without specific question-answer combinations), an evaluation to be conducted, and a simple and understandable usage option to be provided. This method will be presented, and potential pitfalls will be discussed. The focus will be on aspects of the underlying data processing, the theoretical background, and, finally, the provision of the results. The project is based on the master’s thesis "In Medias Res - Semantische Suche in philosophischen Texten anhand von Kants Gesamtwerk mithilfe von Transformer Modellen." Further information and the option to use the search tool can be found on the corresponding website.
Svenja Wagner is a research assistant at the research focus digitale_kultur at the FernUniversität in Hagen. She is also employed at the Trier Center for Digital Humanities in the project „Libraries of Princesses and Knowledge Practices in the German-speaking Area of the 18th Century,” where she primarily works in the area of research data management.