An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals—the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon’s Mechanical Turk, conducted to explore the ‘human’ characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.