Investigating Measures of Distinctiveness for the Genre-Based Classification of Entire Novels
SIG DLS Workshop at DH2023
The study by Du et al. (2022) aimed to evaluate various measures of distinctiveness (or keyness, see e.g. Lijffijt et al. 2014, Paquot & Bestgen 2009), such as Zeta and Welch's t-test, for a classification task in the context of Computational Literary Studies. In this study, distinctive words identified by different measures of distinctiveness were used as features for text classification, where segments of novels were classified by subgenre. The classification was done on the novel segment level, because it was considered that human readers are able to determine the genre of a novel by reading only one or several paragraphs of a novel, without reading the entire novel.
The study results showed that when only a small number of features were used, dispersion-based measures (like Zeta) were more effective in identifying distinctive words and producing better classification results than frequency-based measures (like Log-likelihood ratio test). However, the study left an open question regarding the effectiveness of measures of distinctiveness for classifying entire novels rather than just novel segments.
To address this question, a strategy closely modeled on the previous work was used, but departing from it in the crucial parameter of novel segments vs. entire novels, to ensure comparability of results. By evaluating the effectiveness of different measures of distinctiveness for the classification of entire novels, our study aims to provide further insights into the use of stylometric methods for genre analysis in Computational Literary Studies (e.g. Calvo Tello 2021, Henny-Krahmer 2023).