Vol.3 No.1 2010
6/110
Research paper : A bioinformatics strategy to produce a cyclically developing project structure (M. Suwa et al.)−3−Synthesiology - English edition Vol.3 No.1 (2010) or where it is read from the opposite end (reading frame). To capture the gene region by the computational method, a model is created by learning the codon at the place where the translation of the protein to amino acid sequence starts (start codon or initiation codon), the codon where it stops (stop codon or termination codon), and the sequence information of the characteristic region such as the boundary between the exon and intron for each reading frame. Then, the regions that match these are extracted. If the target of search is GPCR, in addition to the general characteristic of the gene, the characteristic region common to protein GPCR is included into the model. These characteristics of the region include the seven transmembrane helixes, as well as the glycosylation site on the NH2 terminal side of the amino acid sequence, fatty acid binding site of the COOH terminal side, short common functional sequence (functional motif) such as the three amino acids (sequence of Asp, Arg, Tyr (DRY sequence)), and also domains that are globally common over several residues. The elemental technologies for informatics used in gene identification are groups of programs that capture the above-mentioned characteristics of the gene. An experimental researcher who spends all his/her effort to find new genes without error may be reluctant to use such a program even if the prediction is possible with a certain rate of success. The researcher’s demand is that the prediction must be almost entirely correct. Therefore, to allow predictions at extremely high accuracy, we selected a group of appropriate programs from abroad and in Japan, and evaluated their performances. First we evaluated the program where a known gene sequence is pasted onto the genome by modeling the exon-intron boundaries (ALN[3]), and a program where the expression and transition probability model of nucleic acid base (hidden Markov model) is applied to gene structure (Gene Decoder[4]). We confirmed the maximum length of the gene region from the learned data for nucleic acid sequence region for which the exon-intron structure is decoded in a known gene, and evaluated the ability of the programs to clarify how much upstream and downstream extension from arbitrary exon (additional extension) is needed to cover the entire region of the gene, and studied the sequence resemblance score for identifying the exon most accurately. Next, as a tool to see whether the gene sequence candidate is actually GPCR or not, the program for sequence investigation (blastp), the program to check the motif characteristic to GPCR (HMMER[5]), and the program to predict the transmembrane helix region (SOSUI[6]) were evaluated. The parameters for selecting GPCR were: resemblance expectation score (E-value) when searching the protein sequence with blastp; E-value for searching the functional motif (Pfam) expressed in the hidden Malkov mode in the HMMER; and the number of predicted helixes in SOSUI. From the learning set including the known GPCR sequence and the non-GPCR sequences in the protein sequence DB (such as UniProt and GPCRDB), the thresholds of the parameters for determining the correct GPCR sequence were set while evaluating sensitivity (percentage of correct predictions among correct sequences) and selectivity (percentage of correct sequences among the predictions). The threshold where almost 100 % selectivity could be achieved while false-negative results (where correct sequence cannot be predicted) were kept to a minimum was defined as “maximum selectivity threshold,” while the threshold where nearly 100 % sensitivity could be achieved while false-positive results (where sequence different from GPCR is predicted) were held to the minimum was defined as “maximum sensitivity threshold.” Since the objective was to “understand” the properties of elemental programs that were necessary basic knowledge for solving the issues of the research, this phase could be considered as Type 1 Basic Research.3.2 Gene identification and function analysis pipelineBased on the research of section 3.1, we developed a system for comprehensively identifying the GPCR gene from the genome sequence. Each elemental program was considered to be a pipe with input and resultant output, and these pipes were joined together step-by-step in optimal order and threshold (SEVENS pipeline, Fig. 3). It is composed of phases for extracting the protein code region from the genome sequence (gene discovery phase), determining the GPCR gene candidate (GPCR gene refinement phase), and adding the function and structure information (functional analysis phase). This part takes the stance of systematizing by combining the elemental programs and then controlling them as a result and, therefore, may be considered as Type 2 Basic Research. Fig. 2 Conceptual diagram of gene region on the DNA sequence.AAAA….Complementary DNA sequenceMature mRNA sequenceTranslationTranscriptionAmino acid sequenceDNA sequencemRNA sequenceGene regionRegulatory regionIntronIntronStop codonTGATAGTAAExonExonStart exonStart codonATGOpen Reading frameDNA213FrameaacgccaggtcATGGGTCAGAATTCGTCGTGA12312312312312312312312312312312GTAG
元のページ