Enhancing Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington, IN

Introduction : PLATCOM (A Platform for Computational Comparative Genomics) PLATCOM is a framework for the relative investigation of different genomes. PLATCOM comprises of 3 parts: Databases of organic elements e.g. fna, faa, ptt, gbk… Databases of connections among elements e.g. genome-genome, protein-protein pairwise examination Mining instruments over the databases The web interface of PLATCOM framework is situated at http://biokdd.informatics.indiana.edu/kwchoi/platcom/

PLATCOM Web Interface Frontpage of Genome Plot

Background : What is operon ? http://biocyc.org:1555/ECOLI/new-image?object=Transcription-Units The operon structure was found in 1960 by 2 French scholars. Jacob,F. what's more, Monod,J. (1961) Genetic administrative components in the amalgamation of proteins. J. Mol. Biol. , 3 , 318–356. An operon is a gathering of qualities that encodes practically connected proteins. Its parts are : Adjacent (200-300 nt) On a similar strand (+ or - ) Co-communicated by one promoter.

Background : How to distinguish or anticipate operon structure? At the point when a promoter and eliminator are known : Gene bunches = Transcription Units Classical idea of operon When a promoter is not known : Gene groups = Directrons Hypothetical operon applicants Depending on bearing and legitimate intergenic remove (200-300 nt) Computational techniques have been produced to discover quality bunches in bacterial genomes.

PCBBH and PCH R.Overbeek et al . PNAS, 1999, Vol.96, pp.2896-2901 PCBBH : Pair of Close Bidirectional Best Hits BBH : Bidirectional Best Hits PCH : Pair of Close Homologs COG : Clusters of Orthologous Genes

Background : Über-operon : P.Bork et al . Treds. Biochem. Sci., Vol. 25, pp. 474-479 Über-operon : An arrangement of qualities with a nearby useful and administrative settings that has a tendency to be monitored regardless of various adjustments. This idea concentrate on the utilitarian topics of operons, not a particular qualities or quality request.

Background : Why quality bunches are moderated ? Certain operons, especially those that encode subunits of multiprotein buildings (e.g. ribosomal proteins) are preserved in phylogenetically far off bacterial genomes. These quality bunches may have been rationed since the last general regular progenitor. Why? Childish operon speculation :Horizontal exchange of a whole operon is supported by common determination over exchange of individual qualities since co-expression and co-direction are saved.

Background : Problems in Operon Prediction. More than 150 genomes have been completely sequenced until today, however The natural elements of a few qualities are still obscure. There is just a couple promoter location calculations, yet they are not completely acceptable. By and large, genomic information documents don't give full data of qualities and their items. ( e.g. quality name, COG, PID.) Operon has a tendency to experience various modifications amid advancement. Therefore, quality request at a lever above is inadequately rationed. (e.g. qualities required in again purine amalgamation)

Background : Problems in Computational Algorithms to Predict Operons Direct Signal Finding Experiment-based approach Transcription promoters (5'- end) and eliminators (3'- end) were sought. Just be compelling for species whose interpretation signs are outstanding, E.coli. Mix of quality expression information, utilitarian comment and other test information. Writing based approach Primarily relevant to all around examined genomes, for example, E.coli , on the grounds that information records are deficient for different genomes. Much of the time, genomic information documents don't give full data of qualities and their items. ( e.g. quality name, COG, PID.)

Procedure As a piece of PLATCOM venture, an incorporated entire genome examination framework was based on BIOKDD server. Web interface for all-to-all pairwise examination DB and apparatuses are additionally given. A few instruments for various genomes examination were composed in Perl and after that quality neighborhoods was recreated from the grouping information. My quality bunching calculation was utilized to repay the deformity of the writing based approach. Associated quality neighborhoods were examined to anticipate quality capacity and utilitarian coupling between bunches.

Materials/Tools Raw Data 22 genomes were decided for this study. (14 bunches) Protein-Protein Pairwise Comparison Data e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/L42023.faa.U00096.faa.cmp.txt PTT records from NCBI site e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/U00096.ptt.txt Data Generated by Web Tools Gene Clustering Data (in light of grouping homology) e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/clustering_13321_23_750.txt Gene Clusters created from PTT record (given intergenic remove) e.g. http://biokdd.informatics.indiana.edu/kwchoi/Thesis/candidates_22211.htm E. coli database for rude awakening http://biocyc.org/http://ecocyc.org/

Genomes http://www.infobiogen.fr/administrations/deambulum/english/genomes2a.html

Procedure My Approach to recreate Genomic Neighborhoods The thought fundamental this study is that Different genomes contain diverse, covering parts of developmentally and practically associated quality neighborhoods By producing a " Tiling Path ", the whole neighborhood can be remade. Genomic setting of surely understood genome (e.g. E.coli ) is utilized as a logical structure . Begin with taking a gander at this system and afterward seek a gathering of comparable quality neighborhoods in the objective genomes. " Genomic setting " implies the example of arrangement of COG. In the event that COG is not given, we can anticipate the capacity of an obscure quality in light of my quality grouping information. We can likewise recognize some " Hitchhikers ". " Hitchhikers " are embedded qualities that are started from various settings/subjects.

Tiling Path V.Koonin et al. Nucleic Acids Research, 2002, Vol.30, No.10, pp. 2212-2223

Gene Neighborhoods

Results Case 1 Relationship between Gene Order and Phylogenetic Distance Case 2 One topic : Typical Operon (rbs operon) Reconstruct quality neighborhoods Find missing parts from the remade quality groups. Case 3 Two or more subjects : Functional Coupling ? Find genomic drifters Predict quality capacity of uncharacterized protein Predict useful coupling

Case 1 : Gene Order and Phylogenetic Distance If quality request of two genome is very much moderated, the arrangement of homologs ought to show up as a line on the genome examination corner to corner plot. What is the relationship between phylogenetic separation and the protection of quality request?

Phylogenetic Tree V.Daubin et al . Genome Research, Vol 12, Issue 7, 1080-1090

Genome Comparison Diagonal Plot : Phylogenetically-Distant Species (Z-score = more than 500)

Genome Comparison Diagonal Plot : Phylogenetically-Close Species (Z-score > 1000)

Fragmented Gene Clusters

Case 1 : Conclusion Gene arrange in phylogenetically-removed species are inadequately moderated. Yet, this perception does not imply that quality request is rationed extremely well among the phylogenetically-close species. If there should be an occurrence of close species (e.g. E.coli versus H.influenza ), quality requests are totally scattered. As a rule, just a little number of qualities are seen as a short line or bunch and we may consider it as a putative operon. In next stride, this probability will be researched profoundly.

Case 2 : Rbs Operon (Typical Operon) Theme : Ribose transport crosswise over film COG1869 D-ribose high-liking transport framework; layer related protein COG1129 ATP-restricting part of D-ribose high-fondness transport system COG1172 D-ribose high-partiality transport system COG1879 D-ribose periplasmic restricting protein COG0524 ribokinase COG1609 controller for rbs operon http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU00206

Case 2 : Rbs Operon Z-score = more than 750, Intergenic Distance = 300

Case 2 : Conclusion All segments are included in ribose transport crosswise over bacterial cell layer In Rbs operon framework, quality request example is 1869-1129-1172-1879-0524-1609. 10 out of 22 genomes have this operon framework. Exceptsome cases, this quality request example is rationed extremely well. So it is conceivable that there exists a sort of "General Contextual Framework" of quality request.

Case 3 : Functional Coupling of at least 2 subjects Theme 1 : Transcription COG0779 Uncharacterized Conserved Protein COG0195 Transcription extension factor COG2740 Predicted nucleic-corrosive restricting protein (interpretation end?) Theme 2 : Translation COG1358 Ribosomal protein S17E COG0532 Translation start figure 2 (GTPase) COG1550 Uncharacterized Conserved Protein COG0858 Ribosome-restricting element A COG0184 Ribosomal protein S15P/S13E COG0130 tRNA Pseudouridine synthase Hitchhiker ? COG0196 FDA Synthase (Hitchhiker?) http://biocyc.org:1555/ECOLI/new-image?type=OPERON&object=TU341

Case 3 : Functional Coupling Z-score = more than 750, Intergenic Distance = 300

Case 3 : Conclusion Functional Coupling : In microscopic organisms, interpretation, interpretation and RNA change/corruption are coupled and the upsides of co-direction the relating qualities are self-evident. COG0779(Uncharacterized) is practically indivisible from the COG0195(Transcription Elongation Factor), so it is probably going to be a useful accomplice of COG0195. Wanderer : The relationship of the COG0196(FDA synthase) is not as tight as the associations between the qualities having a place with the topic. Quality capacity expectation : The elements of 3