A new cognitive architecture for artificial vision is proposed. The architecture is aimed for an autonomous intelligent system, as several cognitive hypotheses have been postulated as guidelines for its design. The design is based on a conceptual representation level between the subsymbolic level processing the sensory data, and the linguistic level describing scenes by means of a high-level language. The architecture is also based on the active role of a focus of attention mechanism in the link between the conceptual and the linguistic level. The link between the conceptual level and the linguistic level is modelled as a time-delay attractor neural network.