Advisory Board

Professor Mark B. Gerstein

Mark B. Gerstein, Ph.D. is co-director of the Yale Computational Biology and Bioinformatics program, and Albert L. Williams Professor of Biomedical Informatics, Professor of Molecular Biophysics & Biochemistry, and Professor of Computer Science at Yale University.
 
Mark is on the editorial boards of Genome Research, Functional and Integrative Genomics, Molecular Systems Biology, J Struc Func Genetics, PLoS Comp Bio, GenomeBiology, BMC Bioinformatics, Molecular & Cellular Proteomics, and Molecular Biology & Evolution.
 
His research involves applying quantitative approaches such as data mining and simulation to problems in molecular biology. He is specifically interested in human genome annotation, molecular networks, and macromolecular geometry.
 
Research Summary: Protein Bioinformatics
 
As the 21st century unfolds, the biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a dramatic example of this. Simultaneous to this increase in biological data, computers and computation have had a transformative effect on the way information is handled, stored, and mined. These computational advances apply, of course, to many facets of life. The goal of his lab is to connect these two developments: harnessing computational advances for the analysis of large-scale biological data, principally by performing integrative surveys and systematic data mining.
 
More specifically, he is focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in databases and in whole-genome experiments. Overall he has four research foci, which follow a progression from surveying the overall genomic landscape to analyzing individual proteins and their interactions in more detail, to zooming in on the chemical structure of specific molecules. 1. Genomics: Mining and Annotating Intergenic Regions, especially in relation to Pseudogenes
 
Mark is involved in a number of large-scale collaborations (e.g. ENCODE) to probe the activity of intergenic regions with tiling array technology. His lab has developed tools to design, score, and interpret these arrays and to highlight particular array artifacts. The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. In connection with tiling array experiments, he has done an extensive amount of intergenic annotation, with a particular focus on mining intergenic regions for pseudogenes (protein fossils). His lab was, in fact, one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, which it did for human, worm, yeast and a number of other organisms. Collectively, his studies enable him to determine the common “pseudofolds” and “pseudofamilies” in various genomes and to address important evolutionary questions about the type of proteins that were present in the past history of an organism.
 
2. Proteomics: Using Networks to Mine Functional Genomic Data and Understand Protein Function
 
After the main elements of the human genome are identified, we need to characterize their function. His lab is trying to characterize gene function through molecular networks. He works on systematically integrating many weak functional genomic features with data mining techniques to predict protein networks (comprising protein interactions and other functional linkages). Some of the features integrated are obviously related to protein interactions (e.g. expression correlations), but many others such as gene essentiality are much less so. In addition, he has studied the structure of protein networks, both on a large scale in terms of global statistics (e.g. the diameter) and on a small scale in terms of local network motifs (e.g. hubs). In particular, he has correlated network hubs with gene essentiality. Most importantly, he extensively studies the dynamics of networks. This has allowed him to show how a network dramatically changes in different conditions.
 
3. Structural Genomics: Analysis of Folds, Families and Functions on a Large Scale
 
Another area of research in his lab is structural genomics. Here, he conceptualizes proteins not purely as character sequences or abstract network nodes, but more in terms of their molecular structure. He has examined the large-scale relationships between sequence, structure, and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. He has related the occurrence of protein folds and families to phylogeny and deep evolutionary history. His studies enabled him to recognize that particular folds are more common in certain organisms than in others. Finally, as part of his work on structural genomics, he relates the properties of proteins with their eventual success at being purified and structurally characterized. This has been in the framework of a database and decision-tree mining framework that he has built for the NESG structural genomics consortium.
 
4. Computational Biophysics: Relating Macromolecular Motions and Packing
 
The final area of focus in his lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, he tries to interpret macromolecular motions in terms of packing. He has set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains. Part of this project involves devising a system for characterizing motions in a highly standardized fashion. His motions classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing can greatly constrain a protein’s mobility. His lab has developed tools for measuring and comparing the packing efficiency at different interfaces (e.g. inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra).
 
Summary & Broader Societal Issues
 
In summary, his lab acts a connector, bringing quantitative approaches from disciplines such as CS and applied math to bear on real questions and data in molecular biology. In particular, he has extensively applied classical computational approaches involving simulation, machine learning, and database design to biological problems. This often happens in the framework of practical, experimental collaborations, where his lab functions as part of multi-disciplinary teams. Team participation is a key feature of his lab. Finally, as part of his mission to connect biology with computation, he has also extensively analyzed how a number of larger issues relating to computation in society impact biological research. In particular, he has examined how general aspects of e-publishing and digital libraries relate to biomedical databases and how various legal and security concerns significantly impact genomics database interoperation.
 
Mark has authored/coauthored over 400 scientific publications including Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome, What is a gene, post-ENCODE? History and updated definition, Relating three-dimensional structures to protein networks provides evolutionary insights, PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to Controls, Unlocking the secrets of the genome, Mapping copy number variation by population scale genome sequencing, Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project, The genomic complexity of primary human prostate cancer, and Variation in Transcription Factor Binding Among Humans. Read the full list of his publications!
 
Mark earned his BA (summa cum laude) in Physics at Harvard in 1989. He earned his Ph.D. in Biophysics/Chemistry at Cambridge in 1993. He completed his post-doc in Bioinformatics at Stanford in 1996.
 
Watch CRG Symposium: Mark Gerstein and Insights from integrative analysis of the C. elegans genome — Mark Gerstein. Read Yale team finds order amidst the chaos within the human genome; Mom and Dad’s contributions counted and fossil DNA not dead after all. Read his Google+ profile, LinkedIn profile, and his Wikipedia profile. Follow his Twitter feed.