ATF: WTF?

Everything you always wanted to know about derived text formats but were afraid to ask.

ATF: WTF?

Date:

12.11.2025 bis 13.11.2025

Place:

virtuell

Categories:

Event

Contact:

Dr. Keli Du
Derived text formats (ATFs) are an exciting tool for research in the humanities—but what exactly are they?

This workshop offers a practical introduction to the concept, possible applications, and legal and technical framework conditions of ATFs. Researchers from various fields will provide insights into specific workflows, use cases, and standards such as DIN 19461.

The first day (November 12) will focus on keynote speeches and discussions, while on the second day (November 13, internal working group) teaching and learning materials will be developed together to anchor the topic in the long term.

The workshop is aimed at anyone, especially within Text+, who works with digital texts and is curious about how to get more out of them – whether in research, teaching, or infrastructure development. The second day is aimed particularly at people in Text+, but interested participants are welcome to join.

Wednesday, November 12

  • 1pm – 1:10pm: Welcome
  • 1:10pm – 1:30pm: What are derived text formats?
    Derived text formats are part of a strategy with which the digital humanities are responding to the following situation: Although copyright law allows for the analysis of many interesting, current text collections, it does not permit important open science practices for transparency, reproducibility, and reusability of data. Specifically, derived text formats are targeted transformations of full texts in which copyright-relevant information is removed, but in such a way that many DH methods can still be applied. These ATFs can therefore be made freely available to other researchers.

    Speaker: Christof Schöch

  • 13:30 – 13:50: What can derived text formats be used for?
    This presentation will introduce the general application scenarios for ATF as well as the following four steps: selection, preparation, application, and publication of ATF. The aim of the presentation is to familiarize participants with the workflows of ATF so that they can select and use ATF according to their research requirements.

    Speaker: Keli Du

  • 1:50pm – 2:40pm: What research questions can I answer with derived text formats?

    In this slot, we will use specific examples to address research questions that can be addressed with ATF:

    • Derivatives in the meat grinder: ATFs for large language models
      Arden Zimmermann (German National Library)
      What legal and technical problems do copyright-protected texts pose in the training of large language models? How can the insatiable hunger for data of machine learning algorithms and researchers be satisfied without revealing the original texts to users? Findings from the CORAL project will be presented and solutions using ATFs will be highlighted.
    • Magazine novels as ATFs
      Fotis Jannidis/Leo Konle (University of Würzburg)
      Three analyses of pulp novels will be presented: the identification of cultural references, the geographical processing of place names, and the investigation of the introduction of new entities in science fiction narratives. The examples demonstrate how ATFs enable scalable insights into narrative patterns and the cultural frame of reference of the texts, and where the limits lie.
    • Sentiment analysis with ATFs
      Keli Du (University of Trier)
      Texts as ATFs can be used to fine-tune a BERT model for sentiment classification, and the accuracy of the classification can be maintained to a certain extent as long as the reduction of information stays within certain limits. For example, when 40% of tokens are replaced by POS tags, the readability and recognizability of the texts are drastically reduced, but the performance of sentiment classification is only slightly affected.
       
  • 2:40pm – 3:10pm: Coffee break
  • 3:10pm – 3:20pm: Standardized derivation: DIN 19461 for ATFs
    This presentation provides insight into the structure, objectives, and possible applications of DIN 19461, a standard for describing and classifying derived text formats. These formats make it possible to prepare digital texts in a legally compliant and technically usable way for research and analysis – for example, through the targeted reduction or enrichment of information. Concrete examples will be used to show how the standard helps to promote transparency and reusability in working with text data, especially in the area of conflict between legal requirements and scientific reuse.

    Speaker: Thorsten Trippel (Eberhard Karls University of Tübingen)

  • 3:20pm – 3:40pm: What is the legal situation?
    In this presentation, we will provide an overview of the legal basis for the creation and use of ATFs. We will first address the question of which copyright restrictions allow the conversion of copyrighted material into an ATF, and then look at the conditions under which an ATF is no longer subject to copyright protection and can therefore be used freely.

    Speaker: Gianna Iacino, Paweł Kamocki

  • 3:40pm – 4:30pm: Workflows for the production of ATFs
    • Publication Workflows for ATFs
      José Calvo Tello, Mathias Göbel, Florian Barth (State and University Library Göttingen)
      This presentation will showcase two projects that have modeled ATFs in TEI and published them in the TextGrid Repository (TGR): “American Drama Corpus” and “CoNSSA: Corpus of Novels of the Spanish Silver Age.” One advantage of TEI documents is that they contain different types of data – text, structure, metadata, annotations, etc. – and these structural annotations can be mapped in ATFs. The presentation will also discuss how new publication workflows and specific metadata parameters improve the findability of ATFs.
    • Methods for creating ATFs
      Keli Du (University of Trier), Florian Barth (Göttingen State and University Library)
      The presentation will explain key methods for creating ATFs based on a combination of specific operations (“replace,” “randomize,” “keep”) at different levels of granularity in the text (e.g., sentences, clauses). Depending on the operation, certain structural elements can be replaced (e.g., POS tags). We demonstrate the concrete implementation of the methods using a component in the community-based NLP pipeline MONAPipe.
    • Evaluation of reconstructability and basic research
      Philippe Genêt (German National Library)
      This presentation uses pulp novels as an example to illustrate how experimental evaluation can be used to find the golden mean ATF that both meets the interests of researchers and protects copyright. The DFG pilot project “Researching with Derivatives” will be presented in an outlook, which aims to systematically identify suitable ATFs for specific research questions and evaluate them from a legal perspective.
  • 4:30pm – 5:00pm: Final discussion Day 1

Thursday, November 13

  • 09:00am – 12:00pm: Creation of teaching, learning, and documentation materials
    • Blog posts on the topics from Day 1
    • Learning materials on the topics from Day 1