Investigating Measures of Distinctiveness for the Genre-Based Classification of Entire Novels

SIG DLS Workshop at DH2023

Logo DH2023




DH2023, Graz, Workshop Venue 9

1:30pm – 5:30pm




Keli Du
Seven years have passed since the constitution of the ADHO Special Interest Group “Digital Literary Stylistics” (SIG-DLS). In the meantime, the research field has changed, with the advent of new methodologies (drawing on the constantly growing field of computational linguistics) and the affirmation of emerging theories and definitions (such as the modeling paradigm in “computational literary studies”). At the same time, some aspects and reference points have remained the same, such as the confirmed success of frequency-based approaches and the pivotal role of statistical methods. With this workshop, we would like to provide a renewed overview of the field of Digital Literary Stylistics, in ideal dialogue with the seminal workshop organized at DH2016, which allowed the constitution of the SIG-DLS. The team of the project 'Zeta und Konsorten' will also present their work at the workshop.

The study by Du et al. (2022) aimed to evaluate various measures of distinctiveness (or keyness, see e.g. Lijffijt et al. 2014, Paquot & Bestgen 2009), such as Zeta and Welch's t-test, for a classification task in the context of Computational Literary Studies. In this study, distinctive words identified by different measures of distinctiveness were used as features for text classification, where segments of novels were classified by subgenre. The classification was done on the novel segment level, because it was considered that human readers are able to determine the genre of a novel by reading only one or several paragraphs of a novel, without reading the entire novel.

The study results showed that when only a small number of features were used, dispersion-based measures (like Zeta) were more effective in identifying distinctive words and producing better classification results than frequency-based measures (like Log-likelihood ratio test). However, the study left an open question regarding the effectiveness of measures of distinctiveness for classifying entire novels rather than just novel segments.

To address this question, a strategy closely modeled on the previous work was used, but departing from it in the crucial parameter of novel segments vs. entire novels, to ensure comparability of results. By evaluating the effectiveness of different measures of distinctiveness for the classification of entire novels, our study aims to provide further insights into the use of stylometric methods for genre analysis in Computational Literary Studies (e.g. Calvo Tello 2021, Henny-Krahmer 2023).

Projects: Zeta and Company

Hashtags: #DH2023