scinapse is loading now...

Identification of protein coding regions by database similarity search

Published on Mar 1, 1993in Nature Genetics 27.13
· DOI :10.1038/ng0393-266
Warren Gish15
Estimated H-index: 15
(National Institutes of Health),
David J. States34
Estimated H-index: 34
(Washington University in St. Louis)
Abstract
Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1% query errors, a rate that is typical for primary sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors.
  • References (39)
  • Citations (1385)
Cite
References39
Newest
Published on Jun 1, 1993in Computational Biology and Chemistry 1.41
John C. Wootton30
Estimated H-index: 30
(National Institutes of Health),
Scott Federhen11
Estimated H-index: 11
(National Institutes of Health)
Abstract Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) ...
543 Citations Source Cite
Published on Jun 1, 1993in Computational Biology and Chemistry 1.41
Jean-Michel Claverie70
Estimated H-index: 70
(National Institutes of Health),
David J. States34
Estimated H-index: 34
(National Institutes of Health)
Abstract The improved efficiency of similarity search programs and the affordability of even faster computers allow studies where whole sequence databases can be the target of various comparisons with increasingly larger or numerous query sequences. However, the usefulness of those “brute force” methods now becomes limited by the time it takes an experienced scientist to sift the biologically relevant matches from overwhelming, albeit “statistically significant” outputs. The discrepancy between ...
141 Citations Source Cite
Published on Jun 5, 1992in Science 41.06
Gaston H. Gonnet39
Estimated H-index: 39
,
Mark A. Cohen7
Estimated H-index: 7
,
Steven A. Benner47
Estimated H-index: 47
The entire protein sequence database has been exhaustively matched. Definitive mutation matrices and models for scoring gaps were obtained from the matching and used to organize the sequence database as sets of evolutionarily connected components. The methods developed are general and can be used to manage sequence data generated by major genome sequencing projects. The alignments made possible by the exhaustive matching are the starting point for successful de novo prediction of the folded stru...
769 Citations Source Cite
Published on Apr 1, 1992in Genomics 2.91
Jean-Michel Claverie70
Estimated H-index: 70
(National Institutes of Health)
Abstract The search for significant local similarities with known protein sequences is a powerful method for interpreting anonymous cDNA sequences or locating coding exons within genomic DNA sequences at a stage where the average contig size is still very small. The BLASTx program, implemented on the National Center for Biotechnology Information server, allows a sensitive search of all putative translations of a nucleotide query sequence against all known proteins in a matter of seconds. From an...
40 Citations Source Cite
Published on Mar 1, 1992in Nature 41.58
John E. Sulston47
Estimated H-index: 47
(Medical Research Council),
Zijin Du8
Estimated H-index: 8
(Washington University in St. Louis)
+ 16 AuthorsL. Qiu1
Estimated H-index: 1
(Washington University in St. Louis)
The long-term goal of this project is the elucidation of the complete sequence of the Caenorhabditis elegans genome. During the first year methods have been developed and a strategy implemented that is amenable to large-scale sequencing. The three cosmids sequenced in this initial phase are surprisingly rich in genes, many of which have mammalian homologues.
407 Citations Source Cite
Published on Feb 1, 1992in Nature 41.58
Mark D. Adams45
Estimated H-index: 45
(National Institutes of Health),
Mark Dubnick6
Estimated H-index: 6
(National Institutes of Health)
+ 6 AuthorsJ. Craig Venter89
Estimated H-index: 89
(National Institutes of Health)
WE recently described a new approach for the rapid characterization of expressed genes by partial DNA sequencing to generate 'expressed sequence tags'1. From a set of 600 human brain complementary DNA clones, 348 were informative nuclear-encoded messenger RNAs. We have now partially sequenced 2,672 new, independent cDNA clones isolated from four human brain cDNA libraries to generate 2,375 expressed sequence tags to nuclear-encoded genes. These sequences, together with 348 brain expressed sequen...
649 Citations Source Cite
Published on Jan 1, 1992in Nature 41.58
John E. Sulston47
Estimated H-index: 47
,
Z. Le Thi Huong Du1
Estimated H-index: 1
+ 1 AuthorsRobert B. Wilson26
Estimated H-index: 26
58 Citations
Published on Jan 1, 1992in Bioinformatics 5.48
James C. Wallace4
Estimated H-index: 4
,
Steven Henikoff110
Estimated H-index: 110
(Fred Hutchinson Cancer Research Center)
A program has been developed that provides molecular biologists with multiple tools for searching databases, yet uses a very simple interface. PATMAT can use protein or (translated) DNA sequences, patterns or blocks of aligned proteins as queries of databases consisting of amino acid or nucleotide sequences, pattern or blocks. The ability to search databases of blocks by on-the-fly conversion to scoring matrices provides a new tool for detection and evaluation of distant relationships. PATMAT us...
80 Citations Source Cite
E C Uberbacher1
Estimated H-index: 1
,
R J Mural1
Estimated H-index: 1
(Oak Ridge National Laboratory)
Abstract Genes in higher eukaryotes may span tens or hundreds of kilobases with the protein-coding regions accounting for only a few percent of the total sequence. Identifying genes within large regions of uncharacterized DNA is a difficult undertaking and is currently the focus of many research efforts. We describe a reliable computational approach for locating protein-coding portions of genes in anonymous DNA sequence. Using a concept suggested by robotic environmental sensing, our method comb...
566 Citations Source Cite
Published on Aug 1, 1991in Methods 4.00
David J. States34
Estimated H-index: 34
(National Institutes of Health),
Warren Gish15
Estimated H-index: 15
(National Institutes of Health),
Stephen F. Altschul46
Estimated H-index: 46
(National Institutes of Health)
Scoring matrices for nucleic acid sequence comparison that are based on models appropriate to the analysis of molecular sequencing errors or biological mutation processes are presented. In mammalian genomes, transition mutations occur significantly more frequently than transversions, and the optimal scoring of sequence alignments based on this substitution model differs from that derived assuming a uniform mutation model. The information from sequence alignments potentially available using an op...
125 Citations Source Cite
Cited By1385
Newest
Published on Mar 6, 2019in Scientific Reports 4.12
Aya Satoh2
Estimated H-index: 2
(Graduate University for Advanced Studies),
Yohey Terai20
Estimated H-index: 20
(Graduate University for Advanced Studies)
The mangrove cricket Apteronemobius asahinai is endemic to mangrove forest floors. It shows circatidal rhythmicity, with a 12.6-h period of locomotor activity under constant conditions. Its free-running activity also has a circadian component; i.e. it is more active during the subjective night than during the day. In this study, we investigated rhythmic gene expression under constant darkness by RNA sequencing to identify genes controlled by the biological clock. Samples collected every 3 h for ...
Source Cite
Published on May 4, 2019in bioRxiv
Vasily V. Grinev5
Estimated H-index: 5
(Belarusian State University),
Ilya M. Ilyushonak1
Estimated H-index: 1
(Belarusian State University)
+ 6 AuthorsOlaf Heidenreich26
Estimated H-index: 26
The fusion oncogene RUNX1/RUNX1T1 encodes an aberrant transcription factor, which plays a key role in the initiation and maintenance of the t(8;21)-positive acute myeloid leukemia. Here we show that this oncogene is a regulator of the alternative RNA splicing for a sub-set of genes in the leukemia cells. We found two primary mechanisms underlying changes in the production of RNA isoforms: (i) RUNX1/RUNX1T1-mediated regulation of alternative transcription start sites selection in target genes, an...
Source Cite
Published on May 1, 2019in Applied Microbiology and Biotechnology 3.34
Karen Rossmassler2
Estimated H-index: 2
(Colorado State University),
Christopher D. Snow26
Estimated H-index: 26
(Colorado State University)
+ 2 AuthorsSusan K. De Long9
Estimated H-index: 9
(Colorado State University)
Quantifying functional biomarker genes and their transcripts provides critical lines of evidence for contaminant biodegradation; however, accurate quantification depends on qPCR primers that contain no, or minimal, mismatches with the target gene. Developing accurate assays has been particularly challenging for genes encoding fumarate-adding enzymes (FAE) due to the high level of genetic diversity in this gene family. In this study, metagenomics applied to a field-derived, o-xylene-degrading met...
Source Cite
Published on May 1, 2019in G3: Genes, Genomes, Genetics 2.74
Tiina Sävilammi1
Estimated H-index: 1
(University of Turku),
Craig R. Primmer52
Estimated H-index: 52
(University of Helsinki)
+ 6 AuthorsSigbjørn Lien46
Estimated H-index: 46
(Norwegian University of Life Sciences)
Salmonids represent an intriguing taxonomical group for investigating genome evolution in vertebrates due to their relatively recent last common whole genome duplication event, which occurred between 80 and 100 million years ago. Here, we report on the chromosome-level genome assembly of European grayling ( Thymallus thymallus ), which represents one of the earliest diverged salmonid subfamilies. To achieve this, we first generated relatively long genomic scaffolds by using a previously publishe...
Source Cite
Published on May 1, 2019in Fungal Biology 2.57
Magriet A. van der Nest5
Estimated H-index: 5
(University of Pretoria),
Emma Theodora Steenkamp27
Estimated H-index: 27
(University of Pretoria)
+ 13 AuthorsQuentin C. Santana7
Estimated H-index: 7
(University of Pretoria)
The overall goal of this study was to determine whether the genome of an important plant pathogen in Africa, Ceratocystis albifundus, is structured into subgenomic compartments, and if so, to establish how these compartments are distributed across the genome. For this purpose, the publicly available genome of C. albifundus was complemented with the genome sequences for four additional isolates using the Illumina HiSeq platform. In addition, a reference genome for one of the individuals was assem...
Source Cite
Published on Apr 1, 2019in GigaScience 7.27
Kang Kang2
Estimated H-index: 2
(Leibniz Association),
Basti Bergdahl2
Estimated H-index: 2
(Technical University of Denmark)
+ 7 AuthorsGianni Panagiotou21
Estimated H-index: 21
(Leibniz Association)
G.P. would like to thank Deutsche Forschungsgemeinschaft (DFG) CRC/Transregio 124 “Pathogenic fungi and their human host: Networks of interaction,” subproject B5. B.B., L.D., M.J.H., and J.F. thank the Novo Nordisk Foundation for financial support.
Source Cite
Published on Apr 1, 2019in International Journal for Parasitology 3.08
Alexandre N. Léveillé2
Estimated H-index: 2
(Ontario Veterinary College),
Gad Baneth41
Estimated H-index: 41
(Hebrew University of Jerusalem),
John R. Barta30
Estimated H-index: 30
(Ontario Veterinary College)
Abstract Extrachromosomal genomes of the adeleorinid parasite Hepatozoon canis infecting an Israeli dog were investigated using next-generation (NGS) and standard sequencing technologies. A complete apicoplast genome and several mitochondrion–associated sequences were generated. The apicoplast genome (31,869 bp) possessed two copies of both large subunit (23S) and small subunit (16S) ribosomal RNA genes (rDNA) within an inverted repeat (IR) region, as well as 22 protein-coding sequences (CDS), 2...
Source Cite