CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics Export

Physical Review E, Vol. 52, No. 3. (1995), 2939.

Citation Format

[Posts]

View FullText article


soramame_0518's tags for this article

coding dna noncoding

X Reviews [Write a review of this article]

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0; as well as the recently published sequences of C.elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of the coding regions. In particular; (i) an n -tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval; while (ii) an n -gram entropy measurement shows that the noncoding regions have a lower n -gram entropy (and hence a larger ‘‘ n -gram redundancy’’) than the coding regions. In contrast to the three chromosomes; we find that for vertebrates—such as primates and rodents—and for viral DNA; the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n -gram redundancy analysis; we also briefly discuss the failure of zero- and first-order Markovian models or simple nucleotide repeats to account fully for these ‘‘linguistic’’ features of DNA. Finally; we emphasize that our results by no means prove the existence of a ‘‘language’’ in noncoding DNA.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.