Full Text Digitization

Making Basic Works Searchable

For the digitization of basic works in the humanities, the most accurate and error-free recording is essential. This can be achieved either through the use of fully automatic character and text recognition software (Optical Character Recognition) or through manual double keying. Both methods have their advantages and disadvantages, depending on the text. While OCR is usually cheaper and faster, but may not provide sufficient accuracy for older prints (even 99.9% can be too low here), the double-keying process is much more cost-intensive in practice and takes a lot of time, but normally achieves an acquisition quality of nearly 100%. Since the sources to be recorded in connection with our projects are typographically and layout-wise complex templates, we usually use double keying and have been working with our reliable and experienced partner 'TQY DoubleKe'y in Nanjing (PR China ) since the foundation of the TCDH. A major advantage of this cooperation is that the Chinese data typists recognize even the finest differences in fonts and characters due to the complexity and delicacy of their own script and, as non-native speakers, do not make any unintentional corrective “improvements”.

A complete digital copy of the template is made in two independently working teams. In addition to the actual text content, all typographical features such as italics, blocking, superscripts and subscripts, font size changes etc. are reproduced using unambiguous coding. The original line, column and page breaks are also reproduced. This process of character and page encoding provides an output diplomatic reproduction of the original. After the double copy, both versions are automatically compared with one another, and a line synoptic difference protocol is created. The recorded differences between the first and second version are compared manually using the original and merged into a final full text. This leaves only those errors that were made by both detectors at the same point and in the same form and that cannot be recognized by the automatic comparison. Random quality controls show that the overall result is text versions with an accuracy of at least 99.997% (i.e. no more than 3 errors per 100,000 characters are to be expected).

Examples for the Full-Text Digitization of Complex Templates

German Dictionary by Jacob Grimm and Wilhelm Grimm”: data volume 33 volumes (DTV edition) with approx. 300,000,000 characters, acquisition costs approx. 170,000 €, acquisition time approx. 18 months

“Economic Encyclopedia by Johann Georg Krünitz”: data volume 242 volumes with approx. 240,000,000 characters, 90% of which in Fraktur, acquisition costs approx. 150,000 €, acquisition time approx. 12 months