An analysis of the named entity recognition problem in digital library metadata
Information resources in digital libraries are usually described, along with their context, by structured data records, commonly referred as metadata. Those records often contain unstructured information in natural language text, since they typically follow a data model which defines generic semantics for its data elements, or includes data elements modeled to contain free text. The information contained in these data elements, although machine readable, resides in unstructured natural language texts that are difficult to process by computers. This paper addresses a particular task of information extraction, typically called named entity recognition, which deals with the references to entities made by names occurring in the texts. This paper presents the results of a study of how the named entity recognition problem manifests itself in digital library metadata. In particular, we present the main differences between performing named entity recognition in natural language and in the text within metadata. The paper finalizes with a novel approach for named entity recognition in metadata.