Zeta and Company

Measures of Distinctiveness for Computational Literary Studies

Project Management: Prof Dr Christof Schöch (Universität Trier - Computerlinguistik & Digital HumanitiesUniversität Trier - Trier Center for Digital Humanities (TCDH))

Project Participants: Universität Trier - Trier Center for Digital Humanities (TCDH)

Sponsors: Deutsche Forschungsgemeinschaft (DFG)

Running time: 2020 - 2023

Contact person (TCDH): Prof Dr Christof Schöch; Dr Joëlle Weis

References:

Schöch, Christof: Zeta für die kontrastive Analyse literarischer Texte. Theorie, Implementierung, Fallstudie. In: Quantitative Ansätze in den Literatur- und Geisteswissenschaften. Systematische und historische Perspektiven, hg. von Toni Bernhart, Sandra Richter, Marcel Lepper, Marcus Willand, und Andrea Albrecht, S. 77–94. Berlin: de Gruyter, 2018. Open Access

Research Area: Digital Literary and Cultural Studies

Keywords: Quantitative Analysis

Technologies:

Python

XML

Website of the Project: The project website

Distinctivity measures are used to identify those words (or other features) of a text group that are characteristic of this group in comparison to a second text group. This project is about the modeling, implementation, evaluation, use and dissemination of different types of distinctivity measures that can be used in Digital Literary Studies.

Comparison as a methodological and epistemological paradigm is deeply rooted in the humanities. Whether in qualitative or quantitative research, it is through comparison that similarities and differences, affinities and contrasts can be highlighted; comparison sharpens the eye of the observer and analyzes gain in contour and expressiveness. Against this background, the project aims to improve our understanding of quantitative, comparative analysis methods of two or more text collections in the field of Digital Literary Studies.

The focus will be on a central method in the field of quantitative, comparative analysis: statistical measures of distinctiveness (also called 'keyness' measures), which allow researchers to determine elements (e.g. word forms or word types) that are characteristic of one text group in comparison to another text group. A wide range of statistical measures of distinctiveness has been developed in different fields such as Information Retrieval, Computational Linguistics and Computational Literary Studies. At least three types of measures can be distinguished, each based on different information. The first type compares the relative frequencies of features in each of the two text groups (e.g. in the log-likelihood test). The second type compares the distributions of the frequencies of characteristics in the individual texts of both text groups (for example, in the t-test). The third type examines the dispersion of the characteristics across all texts in each group, that is, it compares how evenly the characteristics are distributed in each group of texts (for example, in the case of Zeta).

In order to achieve a deeper understanding of the different measures of distinctiveness and to propose improvements in their implementation and application, we will create and publish appropriate reference corpora, analyze a wide range of existing measures of distinctiveness, determine and compare their statistical properties and formally present them in a common conceptual model. Based on this model, we will implement these measures in a common framework; we will also apply various evaluation strategies to empirically determine and compare the characteristics and performance of the measures. Furthermore, we will apply them to different subgenres of contemporary French fiction (canonized novels compared to popular literature such as crime novels, romance novels and science fiction novels) in a detailed application study. We will also disseminate the main results of the study in academic publications and in the form of an interactive, educational web portal.