Automated re-typesetting, indexing and content enhancement for scanned marriage registers
For much of England and Wales marriage registers began to be kept in 1537. The marriage details were recorded locally, and in longhand, until 1st July 1837, when central records began. All registers were kept in the local parish church. In the period from 1896 to 1922 an attempt was made, by the Phillimore company of London, using volunteer help, to transcribe marriage registers for as many English parishes as possible and to have them printed. This paper describes an experiment in the automated re-typesetting of Volume 2 of the 15-volume Phillimore series relating to the county of Derbyshire. The source material was plain text derived from running Optical Character Recognition (OCR) on a set of page scans taken from the original printed volume. The aim of the experiment was to avoid any idea of labour-intensive page-by-page rebuilding with tools such as Acrobat Capture. Instead, it proved possible to capitalise on the regular, tabular, structure of the Register pages as a means of automating the re-typesetting process, using UNIX troff software and its tbl preprocessor. A series of simple software tools helped to bring about the OCR-to-troff transformation. However, the re-typesetting of the text was not just an end in itself but, additionally, a step on the way to content enhancement and content repurposing. This included the indexing of the marriage entries and their potential transformation into XML and GEDCOM notations. The experiment has shown, for highly regular material, that the efforts of one programmer, with suitable low-level tools, can be far more effective than attempting to recreate the printed material using WYSIWYG software.