W7 MODEL OF PROVENANCE AND ITS USE IN THE CONTEXT OF WIKIPEDIA
Data provenance refers to the lineage or pedigree of data, including information such as its origin and key events that affect it over the course of its lifecycle. In recent years, provenance has become increasingly important as more and more people are using data that they themselves did not generate. Tracking data provenance helps ensure that data provided by many different providers and sources can be trusted and used appropriately. Data provenance also has several other critical uses, including data quality assessment, generating data replication recipes, data security management, etc.One of the major objectives of our research is to investigate the semantics or meaning of data provenance. We describe a generic ontology of data provenance called the W7 model that represents the semantics of data provenance. Formalized in the conceptual graph formalism, the W7 model represents provenance as a combination of seven interconnected elements including "what," "when," "where," "how," "who," "which" and "why." The W7 model is designed to be general and comprehensive enough to cover a broad range of provenance-related vocabularies. However, the W7 model alone, no matter how comprehensive it is, is insufficient for capturing all domain-specific provenance requirements. We hence present a novel approach to developing domain ontologies of provenance. This approach relies on various conceptual graph mechanisms, including schema definitions and canonical formation rules, and enables us to easily adapt and extend the W7 model to develop domain ontologies of provenance. The W7 model for data provenance has been widely adopted and adapted for use within Raytheon Missile Systems and the iPlant Collaborative, as well as the US Army's ATRAP IV (Asymmetric Threat Response and Analysis Program) system.We also developed a domain ontology of provenance for Wikipedia based on the W7 model. This domain ontology enables us to extract provenance for each Wikipedia article. We present a study in which we use their provenance to assess the quality of Wikipedia articles. Assessing and guaranteeing data quality has become a critical concern that, to a large extent, determines the future success and survival of Wikipedia since the quality of Wikipedia has been continuously called into question due to various incidents of vandalism and misinformation since its launch in 2001. Our study shows that the quality of Wikipedia articles depends not only on the different types of contributors but also on how they collaborate. We identify a number of contributor roles based on the provenance. Based on the roles and provenance, our research identifies several collaboration patterns that are preferable or detrimental for data quality, thus providing insights for designing tools and mechanisms to improve Wikipedia article quality.