See this page online at: http://www.bioscienceworld.ca/BioinformaticsInTheEraOfGenomics
Sign up for your subscription and keep up-to-date.
Stay updated on the latest news and technologies with Bioscienceworld's newsletters.
Five to choose from.
By Nansheng Chen, Ph.D
On of the biggest changes in the field of biology in the past decade is the shift in paradigm of how biologists carry out research activities.
Thanks to this change, researchers in biology and related fields have entered an unprecedented golden period, highlighted by a wider landscape, more excitements, profound opportunities, and greater hopes.
At the heart of this change is the birth of bioinformatics, an interdisciplinary field that firmly builds on the strength of molecular biology and information technology, two rapidly developing fields.
Although the history of bioinformatics can be traced back to 1970s when protein structural biologists started to curate protein sequences and structures and to set up simple protein databases, it is the explosive development of the field of genomics that has driven the rapid establishment and wide acceptance of bioinformatics as a field.
Genomics is the study of an organism’s genome and can be simply taken as modern genetics. The establishment of the field of genomics is highlighted by the successful completion of the Human Genome Project in 2003. The success of genomics represents one of many remarkable breakthroughs in the glorious history of genetics in the past 100 years, which include the discovery and rediscovery of Mendel’s laws (1900s), the establishment of chromosome theory (1920s), the discovery of DNA double helix structure by James Watson and Francis Crick (1940s), and the invention of molecular cloning and DNA sequencing technologies (1970s). In particular, the invention of molecular cloning served as the foundation of a brand new industry—biotechnology—in 1980s.
The invention of DNA sequencing technologies by Fred Sanger in MRC Laboratory of Molecular Biology in Cambridge and Walter Gilbert in Harvard University, which earned them a Nobel Prize in chemistry in 1980, paved the road for genome sequencing and the establishment of the field of genomics. In fact, using his own invention—now called the Sanger sequencing method, Fred Sanger himself was able to sequence the first complete genomes of a virus and a mitochondrion in 1970s. By the time the Human Genome Project was announced to be completed by the international consortium comprised of scientists in China, France, Germany, Japan, the United Kingdom, and the United States in 2003, genomes of many other species had been sequenced, analyzed, and published, including the budding yeast (Saccharomyces cerevisiae) in 1996, the nematode worm (Caenorhabditis elegnas) in 1998, fruit fly (Drosophila melanogaster) in 2000, and a small flowing plant (Arabidopsis thaliana) also in 2000 as well. The successful completion of these genome sequencing projects and the analysis of these genomes marked the arrival of the era of genomics.
While the success of these genome sequencing projects turned a historical page in genetics, it presented two major challenges that are beyond the scope of genomics. The first challenge is data management. Data generated by genome sequencing projects can no longer be comfortably and adequately written into pages of traditional lab notebooks. Obviously, much more effective methods for data collection, storage, and retrieval are needed. The then rapidly developing computing science came to the rescue.
However, even experienced computer scientists struggled at first. Lincoln Stein, who is currently a Professor and a Platform Leader of Informatics and Biocomputing at the Ontario Institute for Cancer Research and was a bioinformatician at the MIT Whitehead Institute Genome Sequencing Center, wrote in 1996: “The scale of the human DNA sequencing project is enough to send your average Unix system administrator running for cover. From the information-handling point of view, DNA is a very long string consisting of the four letters G, A, T and C (the letters are abbreviations for the four chemical units that form the ‘rungs’ of the DNA double helix ladder). The goal of the project is to determine the order of letters in the string.”
The second challenge is data analysis. Although genome sequences contain the ultimate genetic instructions that define structure and function of essentially all aspects of living organisms, ranging from a single virus to highly sophisticated species such as humans, it is not immediately obvious how such instructions are encoded in the genomic sequence obtained by the expanding number of the genome sequencing projects. Sophisticated computer programs need to be designed to comb through the genome sequences and extract relevant information about such functional sequences as protein coding genes, RNA genes, as well as regulatory elements, among many other types of functional elements.
These two major challenges presented by genomics attracted innovating minds in computing science, and many other fields including mathematics, statistics and physics. It is their joint effort that quickly predicated a solid foundation for the establishment of the new field bioinformatics. The NIH working definition of bioinformatics (2000) is “Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”
Bioinformatics can have many different faces or different branches. The origins of bioinformatics lie in the field of protein structural biology and many early bioinformatics programs and databases were developed to store, compare, and analyze protein structures. Now this branch of bioinformatics is sometimes called structural bioinformatics. Bioinformatics is sometimes referred to as neuroinformatics when the content under consideration is collected in the field of neuroscience. Although the general goal of bioinformatics is to increase our understanding of biological processes, major research effort in the field of bioinformatics is devoted to genomics.
In addition to its close relationship to genomics, bioinformatics also goes side by side intimately with another filed called computational biology, which is defined by NIH as “The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” These two fields, bioinformatics and computational biology, have significant overlap as shown by that fact that most bioinformaticians are also computational biologists, and vice versa.
They also have significant differences. In essence, bioinformatics concerns more data management and analysis, while computational biology concerns more program development.
A fundamental task and the most dynamic activities in the field of bioinformatics are the creation of biological databases.
Since the creation of protein databases in 1970s, many biological databases have been established. These databases can be broadly divided into four categories: (1) Primary databases—ones hosting primary data, including primary nucleotide databases such as GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) developed by National Center for Biotechnology Information (NCBI) in collaboration with DNA Data Bank of Japan (DDBJ) and European Molecular Biological Laboratory (EMBL), and primary protein databases such as Protein Data Bank (PDB, http://www.rcsb.org/pdb/home/home.do); (2) Model organism databases (MODs)—ones focusing on a single model organism, including WormBase
(http://www.wormbase.org/) and FlyBase (http://www.flybase.org/); (3) Comparative genomics databases—ones hosting a number of closely and distantly related species, including UCSC Genome Browser (http://genome.ucsc.edu/) and Ensembl database (http://www.ensembl.org/); and (4) Systems biology databases—ones describing pathways and interactions, including the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.ad.jp/kegg/) and the Reactome database (http://www.reactome.org/).
In addition to these well established databases, new databases are constantly created. Some of the new databases were created to accommodate results from specific data analysis projects. For example, the InParanoid database (http://inparanoid.sbc.su.se/cgi-bin/index.cgi) has been set up to host pair-wise orthologs. Others are set up to accommodate new data types. For example, the accelerating accumulation of sequence reads from the next generation sequencing technologies calls for the establishment of extremely large databases. Still others are set up to accommodate specific large-scale projects such as the HapMap project (http://www.hapmap.org/) and the dbGaP (database of Genotype and Phenotype), which was developed to archive and distribute the results of studies focused on analysis of the interaction between genotype and phenotype. Due to the dynamic nature of biological studies, new biological databases are constantly established and existing databases are constantly updated. These new databases and updates of existing databases are regularly reported in the Database issue of the Nucleic Acid Research, which is annually published each January.
All biological databases provide tools for data retrieval and visualization. A typical MOD database provides five different approaches for users to retrieve data: (1) Interactive web interface searching—text searching or similarity searching—and browsing; (2) Batch retrieval of batch data; (3) Query language searching; (4) Bulk downloads through FTP sites; and (5) Scripting (for advanced users with programming literacy).
For better interpretation of genomics and other types of data, increasing number of biological databases is providing various types of visualization tools. Most genome databases provide genome browsers and comparative genome browsers. Other databases provide more specialized visualization tools. For example, PDB allows users to visualize three-dimensional protein structures using an array of different three-dimensional structural visualization options including KiNG, Jmol, WebMol, MBT SimpleViewer, MBT Protein Workshop, and Quick PDB.
The most challenging and exciting area in the field of bioinformatics is arguably data analysis, including sequence alignment, database searches, phylogenetic analysis, genome analysis, gene prediction, transcriptional regulation analysis, and molecular evolution analysis. Many bioinformatics tools were developed in 1980s, including popular similarity search tools BLAST and FASTA, the widely used tool for multiple sequence alignment—ClustalW, and the toolkit for phylogenetic analysis—PHYLIP. In genome analysis, accurate gene prediction has been a long-term challenging problem.
Although many gene prediction programs, including Gene Finder by Phil Green, GenScan by Christopher Burge, GeneWise by Ewan Birney and TwinScan by Michael Brent, have been developed and applied to identify genes and make gene predictions, they suffer from both high false negative and false positive rates. On the one hand, many true genes have not been found; on the other, many predicted genes are not real genes at all. Experimental evidence, including that gathered from EST and cDNA sequencing, has been used to circumvent this challenge. Recent advances in sequencing technology, including the next generation sequencing methods, have made large-scale sequencing of cDNA libraries more affordable in labs of any size, offering a means of overcoming the first obstacle. Accordingly, new bioinformatics systems will be set up to accommodate such cDNA sequences and gene identification.
At a global level, bioinformatics is critical for understanding genome architecture and structural variation through comparative genomics. Accumulating evidence suggests that genes are not randomly distributed in a genome but form functional gene clusters such as operons.
An example of such clusters is the HOX gene clusters and clusters of tissue-specific genes. In contrast, extensive genome rearrangement events—indels (insertions and deletion), inversions, transpositions, reciprocal translocations and segmental duplications—have also been identified in all sequenced genomes. Thus, genome architecture is highly dynamic and is shaped by two opposing forces—conservation and diversification. Many tools have been developed to identify conserved gene clusters among species—synteny blocks— as well as genome rearrangements, but most of these have been designed for specific purposes, and hence they show critical limitations for general applications. More generalized bioinformatics methods are expected to be developed for understanding the genome architecture as well as the molecular evolution of genomes.
In addition to gene prediction and genome architecture analysis, bioinformatics resources and tools have been applied to understand pathogen genomes for the discovery of virulence factors and effective drug targets; for the identification of human disease genes; for the general understanding of the genome expression, as well as the realization of the personalized medicine concept, to name a few. Bioinformatics is still a young field, growing together with genomics on the one side and computational biology on the other. Over time, these three fields become essentially inseparable, namely it is hard to imagine how scientists in the field of genomics would carry out research without interacting with databases (bioinformatics) and applying computer programs (computational biology). With the declining DNA sequencing cost and the increasing DNA sequencing power, as demonstrated by the next generation sequencing methods, bioinformatics is geared to playing an even more critical role in the fast developing era of genomics.
Acknowledgement:
Research in the author’s laboratory is supported by grants from Natural Sciences and Engineering Research Council (NSERC) of Canada, the Michael Smith Foundation for Health Research (MSFHR), Simon Fraser University, and the SFU Community Trust Endowment Fund (CTEF). The author thanks Dr. Maja Tarailo-Graovac for comments.
Nansheng Chen is an Associate Professor in the Department of Molecular Biology and Biochemistry at Simon Fraser University.
He is also a Michael Smith Foundation for Health Research (MSFHR) Scholar. He is interested in developing bioinformatics programs and genomics (and molecular) tools for understanding genome architecture and expression. In particular, he is engaged in characterizing gene structures, transcription factors and regulatory elements, sequence and structural variations, and, more importantly, regulatory mechanisms. His ultimate goal is to understand how genome sequences and their organization (including various modifications) direct development and biological functions. Current research projects is his lab fall into two major categories: (1) Identification and characterization of novel genes, synteny blocks, functional elements, genome variations and genome rearrangements; (2) Transcriptional regulation of genes and gene batteries.