Objects predict fixations better than early saliency.
Humans move their eyes while looking at scenes and pictures. Eye movements correlate with shifts in attention and are thought to be a consequence of optimal resource allocation for high-level tasks such as visual recognition. Models of attention, such as "saliency maps," are often built on the assumption that "early" features (color, contrast, orientation, motion, and so forth) drive attention directly. We explore an alternative hypothesis: Observers attend to "interesting" objects. To test this hypothesis, we measure the eye position of human observers while they inspect photographs of common natural scenes. Our observers perform different tasks: artistic evaluation, analysis of content, and search. Immediately after each presentation, our observers are asked to name objects they saw. Weighted with recall frequency, these objects predict fixations in individual images better than early saliency, irrespective of task. Also, saliency combined with object positions predicts which objects are frequently named. This suggests that early saliency has only an indirect effect on attention, acting through recognized objects. Consequently, rather than treating attention as mere preprocessing step for object recognition, models of both need to be integrated.