Vol.3 No.1 2010
7/110

Research paper : A bioinformatics strategy to produce a cyclically developing project structure (M. Suwa et al.)−4−Synthesiology - English edition Vol.3 No.1 (2010) 1) Gene discovery phase The DNA sequence of the genome is scanned for every 6 reading frames, and the corresponding codons are translated to the amino acid sequence, and the fragmented region (corresponding to the exon region) that matches by certain level of resemblance score with the known GPCR amino acid sequence are completely listed (tblasn program). This will narrow down the region where the genes are present, and by using ALN[3], the whole length of the gene corresponding to the known sequence is composed by extending the search region to 1,000 bases upstream and downstream. Also, at the same time, a sequence is obtained by the Gene Decoder[4], which is a probability model of the gene region. Some regions overlap as several sequences match completely or partially, and the parts with significant overlaps are joined to determine the longest amino acid sequence. 2) GPCR gene refinement phase The determined amino acid sequence is sent to the sequence search program (blastp), the functional motif identification program (HMMER[5]), and the transmembrane helix prediction program (SOSUI[6]) (Fig. 3). By combining the maximum selectivity thresholds and maximum sensitivity thresholds determined for each program in section 3.1, data sets are created from various detection selectivities and sensitivities. While allowing some false-positives (error-in-prediction), if one wished to extract all GPCR, the union of output obtained from the maximum sensitivity threshold for blastp, HMMER, and SOSUI (E-value < 10-30, E-value < 10-1, and predicted number 6~8, respectively) is calculated. This presents 100 % sensitivity at 20.4 % selectivity (level D) for the learning set. On the other hand, the most accurate data set (level A) is the union of output of maximum selectivity threshold of blast and HMMER (E-value < 10-80, E-value < 10-10, respectively). This shows 99.4 % sensitivity and 96.6 % selectivity for the learning set. Also, we created level B (sensitivity 99.8 %, selectivity 70 %) and level C (sensitivity 99.9 %, selectivity 48.4 %) data sets as intermediates between the two levels. Finally the dataset is matched with sequence data for non-GPCR genes, and the wrongly predicted sequences are eliminated. 3) Functional analysis phase Using the identified GPCR sequence, the sequences related to E-value < 10-30 are grouped together, and added to the known family. The sequences that show resemblance of 96 % or over at 100 residues or more for the known GPCR sequence are considered the same as the known sequence, and any other sequences are considered new sequences. If stop-start codons are found in the exon region, it is considered a pseudo-gene. Based on the analysis conducted in the GPCR gene refinement phase, the functional and structural information such as the coordinates on the chromosome, the number of exons, the length of sequence, the sequence search information, the transmembrane helix region, the functional motif region, and the domain region are added to each sequence.Fig. 3 SEVENS pipeline. This is an analysis pipeline where various tools are combined sequentially at optimal threshold and order, to comprehensively identify the GPCR genes from the genome sequence. ・ GPCR gene refinement phase・ Functional analysis phaseGene discovery phaseNon-GPCR sequence DBk=0K < Nk=k+1GPCR geneYesNoNoYesStartEndTransmembrane helix prediction(program: SOSUI)Sequence search(program: blastp)Eliminate sequences other than GPCRGPCR specific motif search(program: HMMER)Gene candidateRe-construction of genes(program: ALN)Extension to upstream/downstream of gene region(length of extension ΔL)Was the entire length of known GPCR sequence covered?Gene candidate regionMapping onto genome sequence(program: tblastn)Known GPCR sequence (N)Genome sequence

元のページ 

10秒後に元のページに移動します

※このページを正しく表示するにはFlashPlayer9以上が必要です