CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Visualising biological data: a semantic approach to tool and database integration.

BMC bioinformatics, Vol. 10 Suppl 6 (2009)

X Abstract

MOTIVATION: In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customized for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. METHODS: To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. RESULTS: The toolkit, named Utopia, is freely available from http://utopia.cs.man.ac.uk/.

View the full article here:

DOI, Pubmed, Hubmed

This article has been bookmarked 8 times, initially on 2009-06-28.

2009-08-15 User flipip23
2009-07-10 Group Orengo Group Journal Picks
User sillitoe
2009-06-30 User mikel_egana
2009-06-28 User kshameer
User imrchen
User dullhunk , 2 notes

A classic example of an integrated database environment is the unified protein family resource, Inter-Pro [13]. In the beginning, InterPro amalgamated four different protein signature databases: PROSITE [14], which houses regular expressions and profiles; PRINTS [15], which exploits position-specific scoring matrix-based fingerprints; Pfam [16], which uses hidden Markov models; and ProDom [17], which uses automatically-generated sequence clusters. The diagnostic methods exploited by these resources are different but complementary, providing different perspectives on protein family relationships.

2009-06-28 15:40:01

Ten years ago, when the component databases were relatively small, integrating and rationalising their data was relatively straightforward; but as each resource has grown, the familial boundaries defined by their different approaches have been blurred, and the relationships between families have become more fuzzy. Over time, managing and representing these biological overlaps in a meaningful way for end users has consequently became a major challenge. Consider, for a moment, the entry for rhodopsin-like G protein-coupled receptors (IPR000276). According to InterPro, the superfamily contains 19898 members: of these, 19592 were identified by Pfam's hidden Markov model, 16868 by the PRINTS fingerprint and 16478 by the PROSITE regular expression. By contrast, the source databases themselves quote 16975, 1143 and 2029 members respectively. Clearly, these numbers are very different, and at least point to a synchronisation problem: InterPro tracks the latest version of UniProt, but the source databases lack the manpower to achieve this. Users are therefore left to work out the relationships between the family membership suggested by the source databases (16975, 1143, 2029) and that suggested by InterPro's implementation of the source database's diagnostic tools (19592, 16868, 16478), and the unified number endorsed by InterPro itself (19898), which is larger than the number identified by any of the component tools.

2009-06-28 15:40:16
User guhjy
Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.