Vol.2 No.1 2009
68/88

Research paper : Advanced in-silico drug screening to achieve high hit ratio (Y. Fukunishi et al.)−65−Synthesiology - English edition Vol.2 No.1 (2009) of structures with a low probability of existence.(3) The distribution of data generated using commercially available software for developing compound DBs is prohibited due to the licensing policy.As described later in this paper, our objective was to develop and distribute novel DBs for VS using protein-compound affinity matrices based on the compound DBs. However, this cannot be achieved using commercially available software. The development of in-house software that generates compound data and compound DBs will resolve these issues. Upon distribution, the use of VS can be encouraged for users who find it difficult to afford costly licenses such as small- and medium-sized enterprises and academic researchers, and novel and advanced VS methods can be widely disseminated even to large enterprises. Economic and technological benefits will be obtained as a result.4 Processes4.1 Overall perspectiveThe overall software development process consisted of approximately 10 steps, as follows (Fig. 1). First, we eliminated duplicated compounds listed in 2D SD files provided by reagent vendors (for example, methanol is sold by any vendor). Since hydrogen atoms (protons) are normally omitted in 2D structures, protons were added. Parameters such as distances and bond angles between atoms were assigned to all of the atoms. The 3D structural coordinates, as well as enantiomers if they existed, were reproduced from the 2D coordinates based on this information. Atomic charges were then evaluated by quantum chemical calculations so that equivalent atoms exhibited equivalent charges. The generated 3D data were compiled into a relational database. We developed our software, avoiding violations of the patents on a number of commercially available software products for each step. Each development step is explained in detail below.4.2 Handling of massive data setsIt is difficult to handle massive data sets. If millions of items of compound information are stored in a single file, the file size will exceed the limit that a computer can handle, whereas if one item of compound information is stored in each file, the millions of files produced cannot be contained in a single folder due to the constraints imposed by the computer system. Thus, the information on a single compound was stored in a file and the data for approximately 10,000 compounds were contained in each of several hundred folders prepared in order to handle millions of items of compound information using a hierarchical structure. The developed compound DB could be stored as a single relational database in a system with a 64-bit architecture.4.3 Exclusion of duplicated compounds: Determination of compound identityIt is necessary to determine whether or not two molecules are identical. Since the identification of 4 million compounds requires the square of 4 million comparisons, we developed a high-speed discrimination method that consists of several steps as described later. We prioritized speed over accuracy by sacrificing a certain degree of discrimination accuracy. Since a few percent of commercially available compounds are different from the actual structures due to incorrect structure identifications and insufficient quality control, excessive pursuit of mathematical strictness would be meaningless.4.3.1 Determination of chemical compositions based on pseudo-molecular mass weightThe chemical composition is a description of the type and number of atoms contained in a molecule; in the case of methanol (CH3-OH), for example, it will be C1O1H4. Comparison of chemical compositions is a quick method of discriminating compounds. No further discrimination is necessary if the chemical compositions of two molecules differ from each other. However, the character-string comparison of chemical compositions takes too much time. We therefore evaluated the molecular mass weight using the atomic mass weight with three places after the decimal point for each atom, and obtained a six-digit number for each molecule. This realized an accurate comparison in practical terms of chemical compositions by a single computation of molecular mass weight without comparing their character strings.4.3.2 Identification of molecular topology based on graph invariantsThe structural formulas of two compounds may differ even when their chemical compositions match. Although molecules can be graphically compared by superimposing their graphs, the graphical superposition of molecules is a nondeterministic polynomial time (NP)-complete problem, in which the computation time cannot be described as a polynomial of the number of atoms. In general, high-speed algorithms exist for problems with polynomial computation times; however, no effective algorithm exists in the case of Development of 3D compound database from 2D electronic catalogs・Generation ofatomic charges byquantum chemicalcalculations・Conversion to mol2 file format・Compilation ofdatabase・Generation ofenergetically stable 3Dstructures・Generation ofenantiomers・Protonation・Identification ofaromatic rings・Identification of atom typesInformation inelectronic catalogs・2D planar structures・No protons・No chargeinformation・No enantiomersFig. 1 Process of development of 3D compound structures.

元のページ 

10秒後に元のページに移動します

※このページを正しく表示するにはFlashPlayer9以上が必要です