Deviation of Proportions as the Basis for a Keyness Measure

Lecture by Keli Du, Julia Dudar, Cora Rok and Christof Schöch at the "43. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS): Modell und Evidenz"

Keli Du, Portrait

Date:

25.02.2022 bis 26.02.2022

Place:

Online

Categories:

Event

Contact:

Keli Du

In the context of Corpus Linguistics, numerous statistical measures and instruments have been adopted to investigate and analyze large amounts of textual data, especially in a contrastive perspective (e.g. Rayson et al. 1997; Oakes and Farrow, 2007; Newman et al., 2008). Despite several important studies (e.g. Paquot & Bestgen 2009; Lijffijt et al. 2014), there is still a lack of in-depth understanding of their key characteristics and how these key characteristics impact the results.

In the context of Corpus Linguistics, numerous statistical measures and instruments have been adopted to investigate and analyze large amounts of textual data, especially in a contrastive perspective (e.g. Rayson et al. 1997; Oakes and Farrow, 2007; Newman et al., 2008). Despite several important studies (e.g. Paquot & Bestgen 2009; Lijffijt et al. 2014), there is still a lack of in-depth understanding of their key characteristics and how these key characteristics impact the results. In our project "Zeta and company" we aim to enhance our understanding of statistical keyness measures that are used for comparative, quantitative analysis of two or more text collections. Based on literary texts, we are going to implement these measures in a Python framework and evaluate which measures perform best for different tasks and kinds of textual data.

The most widely used statistical keyness measures are based on word frequency (chi-squared, log likelihood etc.) and do not consider how the particular words are distributed within a corpus. This means that a word can appear to be important for the whole corpus, although it is just used very frequently in a small number of texts in this corpus. To deal with this challenge, several dispersion measures were suggested (Lyne, 1985). Stefan Gries (2008) gives a detailed overview of such measures and develops his own measure deviation of proportions (DP). DP compares the difference between observed and expected relative frequency of a word in the individual documents contained in a corpus in order to quantify how this word is dispersed. This measure seems to have several advantages compared to other dispersion measures. For example, it can handle different corpus parts, it is simple, and can distinguish between slight variations in distribution without being overly sensitive.

However, there is still a lack of empirical evidence supporting the use of DP. For this contribution, we are going to implement this measure of dispersion in our keyness framework (see Schöch et al. 2018; for a use of dispersion, though not of DP, for keyness analysis, see Egbert & Biber 2019). First, using a collection of 160 French novels from the 1980s belonging to four different subgenres (sentimental novels, crime fiction novels, science fiction novels and high-brow novels), we will examine how DP works with different numbers of texts, words and proportions of particular words in the corpus. For example, we aim to understand DP better by examining whether DP values change when the number of texts increases and whether DP values correlate with the relative word frequencies. One of the open questions about dispersion is whether it can be used to compare two collections of texts, especially when document length varies. Therefore, we will also investigate how useful DP is as a basis for keyword extraction in contrastive analysis.


Projects: Zeta and Company

Keywords: Quantitative Analysis