| |
Cell, Vol. 118, No. 1. (9 July 2004), pp. 31-44.
by S. Mnaimneh, A. P. Davierwala, J. Haynes, et al.J. Moffat, W. T. Peng, W. Zhang, X. Yang, J. Pootoolal, G. Chua, A. Lopez, M. Trochesset, D. Morse, N. J. Krogan, S. L. Hiley, Z. Li, Q. Morris, J. Grigull, N. Mitsakakis, C. J. Roberts, J. F. Greenblatt, C. Boone, C. A. Kaiser, B. J. Andrews, T. R. Hughes
Abstract
Nearly 20% of yeast genes are required for viability, hindering genetic analysis with knockouts. We created promoter-shutoff strains for over two-thirds of all essential yeast genes and subjected them to morphological analysis, size profiling, drug sensitivity screening, and microarray expression profiling. We then used this compendium of data to ask which phenotypic features characterized different functional classes and used these to infer potential functions for uncharacterized genes. We identified genes involved in ribosome biogenesis (HAS1, URB1, and URB2), protein secretion (SEC39), ...
|
| |
Bioinformatics, Vol. 21, No. 9. (1 May 2005), pp. 2049-2058.
Abstract
MOTIVATION: The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well ...
|
| |
Nucleic acids research, Vol. 33, No. 5. (2005), pp. 1544-1552.
Abstract
Genome-wide techniques such as microarray analysis, Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS), linkage analysis and association studies are used extensively in the search for genes that cause diseases, and often identify many hundreds of candidate disease genes. Selection of the most probable of these candidate disease genes for further empirical analysis is a significant challenge. Additionally, identifying the genes that cause complex diseases is problematic due to low penetrance of multiple contributing genes. Here, we describe ...
|
| |
Bioinformatics, Vol. 17, No. 4. (1 April 2001), pp. 319-326.
Abstract
Motivation: High-density microarray technology permits the quantitative and simultaneous monitoring of thousands of genes. The interpretation challenge is to extract relevant information from this large amount of data. A growing variety of statistical analysis approaches are available to identify clusters of genes that share common expression characteristics, but provide no information regarding the biological similarities of genes within clusters. The published literature provides a potential source of information to assist in interpretation of clustering results. Results: We describe ...
|
| |
Bioinformatics, Vol. 20, No. 1. (1 January 2004), pp. 120-121.
Abstract
Summary: We present a biomedical text-mining system focused on four types of gene-related information: biological functions, associated diseases, related genes and gene-gene relations. The aim of this system is to provide researchers an easy-to-use bio-information service that will rapidly survey the rapidly burgeoning biomedical literature. Availability: http://iir.csie.ncku.edu.tw/~yuhc/gis/ 10.1093/bioinformatics/btg369 ...
|
| |
Genome biology, Vol. 4, No. 10. (2003)
Abstract
EASE is a customizable software application for rapid biological interpretation of gene lists that result from the analysis of microarray, proteomics, SAGE and other high-throughput genomic data. The biological themes returned by EASE recapitulate manually determined themes in previously published gene lists and are robust to varying methods of normalization, intensity calculation and statistical selection of genes. EASE is a powerful tool for rapidly converting the results of functional genomics studies from 'genes' to 'themes'. ...
|
| |
Biotechniques, Vol. 27, No. 6. (December 1999)
Abstract
The trend toward high-throughput techniques in molecular biology and the explosion of online scientific data threaten to overwhelm the ability of researchers to take full advantage of available information. This problem is particularly severe in the rapidly expanding area of gene expression experiments, for example, those carried out with cDNA microarrays or oligonucleotide chips. We present an Internet-based hypertext program, MedMiner, which filters and organizes large amounts of textual and structured information returned from public search engines like GeneCards and PubMed. ...
|
| |
Nature genetics, Vol. 28, No. 1. (May 2001), pp. 21-28.
Abstract
We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have ...
|
| |
Gene, Vol. 259, No. 1-2. (23 December 2000), pp. 245-252.
Abstract
We describe a system which automatically identifies gene and protein names in journal articles, an important and non-trivial first step in knowledge extraction of protein and gene actions. Our system uses a database of gene and protein names and is based on BLAST [Altschul et al., Nucleic Acids Res. 25 (1997) 3389-3402], a popular tool for DNA and protein sequence comparison. We describe a method that consists of mapping sequences of text characters into sequences of nucleotides that can be processed ...
|
| |
Genome Inform Ser Workshop Genome Inform, Vol. 9 (1998), pp. 72-80.
Abstract
Gathering data on molecular interactions to be fed into a specialized database has motivated the development of a computer system to help extracting pertinent information from texts, relying on advanced linguistic tools, completed with object-oriented knowledge modeling capabilities. As a first step toward this challenging objective, a program for the identification of gene symbols and names inside sentences has been devised. The main difficulty is that these names and symbols do not appear to follow construction rules. The program is thus ...
|
| |
Nucleic Acids Res, Vol. 32, No. Web Server issue. (1 July 2004)
by H. Pan, L. Zuo, V. Choudhary, et al.Z. Zhang, S. H. Leow, F. T. Chong, Y. Huang, V. W. Ong, B. Mohanty, S. L. Tan, S. P. Krishnan, V. B. Bajic
Abstract
We present Dragon TF Association Miner (DTFAM), a system for text-mining of PubMed documents for potential functional association of transcription factors (TFs) with terms from Gene Ontology (GO) and with diseases. DTFAM has been trained and tested in the selection of relevant documents on a manually curated dataset containing >3000 PubMed abstracts relevant to transcription control. On our test data the system achieves sensitivity of 80% with specificity of 82%. DTFAM provides comprehensive tabular and graphical reports linking terms to relevant ...
|
| |
J Bioinform Comput Biol, Vol. 1, No. 4. (January 2004), pp. 611-626.
Abstract
The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that ...
|
| |
OMICS, Vol. 7, No. 2. (2003), pp. 193-202.
Abstract
A large-scale in silico evaluation of gene deletions in Saccharomyces cerevisiae was conducted using a genome-scale reconstructed metabolic model. The effect of 599 single gene deletions on cell viability was simulated in silico and compared to published experimental results. In 526 cases (87.8%), the in silico results were in agreement with experimental observations when growth on synthetic complete medium was simulated. Viable phenotypes were correctly predicted in 89.4% (496 out of 555) and lethal phenotypes were correctly predicted in 68.2% (30 ...
|
| |
Bioinformatics, Vol. 16, No. 6. (June 2000), pp. 548-557.
Abstract
MOTIVATION: Genome sequencing projects are making available complete records of the genetic make-up of organisms. These core data sets are themselves complex, and present challenges to those who seek to store, analyse and present the information. However, in addition to the sequence data, high throughput experiments are making available distinctive new data sets on protein interactions, the phenotypic consequences of gene deletions, and on the transcriptome, proteome, and metabolome. The effective description and management of such data is of considerable importance ...
|
| |
In Silico Biol, Vol. 2, No. 3. (2002), pp. 213-217.
Abstract
GOBASE is a relational database that integrates data associated with mitochondria and chloroplasts. The most important data in GOBASE, i. e., molecular sequences and taxonomic information, are obtained from the public sequence data repository at the National Center for Biotechnology Information (NCBI), and are validated by our experts. Maintaining a curated genomic database comes with a towering labor cost, due to the shear volume of available genomic sequences and the plethora of annotation errors and omissions in records retrieved from public ...
|
| |
Pac Symp Biocomput (2000), pp. 517-528.
Abstract
EDGAR (Extraction of Drugs, Genes and Relations) is a natural language processing system that extracts information about drugs and genes relevant to cancer from the biomedical literature. This automatically extracted information has remarkable potential to facilitate computational analysis in the molecular biology of cancer, and the technology is straightforwardly generalizable to many areas of biomedicine. This paper reports on the mechanisms for automatically generating such assertions and on a simple application, conceptual clustering of documents. The system uses a stochastic part ...
|
| |
Bioinformatics, Vol. 18, No. 8. (August 2002), pp. 1124-1132.
Abstract
MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and ...
|
| |
J Biomed Inform, Vol. 35, No. 5-6. (c 2002), pp. 322-330.
Abstract
MOTIVATION: Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a ...
|
| |
Bioinformatics, Vol. 19 Suppl 1 (2003)
Abstract
MOTIVATION: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance. RESULTS: We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, ...
|
| |
Bioinformatics, Vol. 20, No. 18. (12 December 2004), pp. 3710-3715.
Abstract
SUMMARY: GO::TermFinder comprises a set of object-oriented Perl modules for accessing Gene Ontology (GO) information and evaluating and visualizing the collective annotation of a list of genes to GO terms. It can be used to draw conclusions from microarray and other biological data, calculating the statistical significance of each annotation. GO::TermFinder can be used on any system on which Perl can be run, either as a command line application, in single or batch mode, or as a web-based CGI script. AVAILABILITY: ...
|
| |
Genome Biology, Vol. 5 (2004), R43.
|
| |
BMC Bioinformatics, Vol. 3 (2002), 16.
|
| |
BMC Bioinformatics, Vol. 4 (2003), 20.
|
| |
BMC Bioinformatics, Vol. 5 (2004), 116.
|
| |
Bioinformatics, Vol. 21, No. 2. (15 January 2005), pp. 248-256.
Abstract
MOTIVATION: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs ...
|
| |
Bioinformatics, Vol. 21, No. 1. (1 January 2005), pp. 104-115.
Abstract
MOTIVATION: A major challenge in the interpretation of high-throughput genomic data is understanding the functional associations between genes. Previously, several approaches have been described to extract gene relationships from various biological databases using term-matching methods. However, more flexible automated methods are needed to identify functional relationships (both explicit and implicit) between genes from the biomedical literature. In this study, we explored the utility of Latent Semantic Indexing (LSI), a vector space model for information retrieval, to automatically identify conceptual gene relationships ...
|