Professor Mark B. Gerstein
Mark B.
Gerstein, Ph.D. is
co-director of the Yale Computational Biology and Bioinformatics
program, and Albert L. Williams Professor of Biomedical Informatics,
Professor of Molecular Biophysics & Biochemistry, and Professor of
Computer Science at Yale University.
Mark is on the editorial boards of Genome Research,
Functional and Integrative Genomics,
Molecular Systems Biology,
J Struc Func Genetics,
PLoS Comp Bio,
GenomeBiology,
BMC Bioinformatics,
Molecular & Cellular Proteomics, and
Molecular Biology & Evolution.
His
research involves applying quantitative approaches such as data mining
and simulation to problems in molecular biology. He is specifically
interested in human genome annotation, molecular networks, and
macromolecular geometry.
Research Summary: Protein Bioinformatics
As the 21st century unfolds, the biological sciences are being
transformed by the advent of large-scale data. The sequencing of the
human genome is a dramatic example of this. Simultaneous to this
increase in biological data, computers and computation have had a
transformative effect on the way information is handled, stored, and
mined. These computational advances apply, of course, to many facets of
life. The goal of his lab is to connect these two developments:
harnessing computational advances for the analysis of large-scale
biological data, principally by performing integrative surveys and
systematic data mining.
More specifically, he is focused on protein bioinformatics:
understanding the structure, function, and evolution of proteins through
analyzing populations of them in databases and in whole-genome
experiments. Overall he has four research foci, which follow a
progression from surveying the overall genomic landscape to analyzing
individual proteins and their interactions in more detail, to zooming in
on the chemical structure of specific molecules.
1. Genomics: Mining and Annotating Intergenic Regions, especially
in
relation to Pseudogenes
Mark is involved in a number of large-scale collaborations (e.g.
ENCODE) to probe the activity of intergenic regions with tiling array
technology. His lab has developed tools to design, score, and interpret
these
arrays and to highlight particular array artifacts. The overall
conclusion from this work has been that much of the intergenic regions
of the human genome appear to be active, both transcriptionally and in
terms of protein binding. In connection with tiling array experiments,
he has done an extensive amount of intergenic annotation, with a
particular focus on mining intergenic regions for pseudogenes (protein
fossils). His lab was, in fact, one of the first groups to perform
comprehensive surveys of pseudogenes on a genome-wide scale in terms of
protein families, which it did for human, worm, yeast and a number of
other organisms. Collectively, his studies enable him to determine the
common “pseudofolds” and “pseudofamilies” in various genomes and to
address important evolutionary questions about the type of proteins that
were present in the past history of an organism.
2. Proteomics: Using Networks to Mine Functional Genomic Data and
Understand Protein Function
After the main elements of the human genome are identified, we need
to characterize their function. His lab is trying to characterize gene
function through molecular networks. He works on systematically
integrating many weak functional genomic features with data mining
techniques to predict protein networks (comprising protein interactions
and other functional linkages). Some of the features integrated are
obviously related to protein interactions (e.g. expression
correlations), but many others such as gene essentiality are much less
so. In addition, he has studied the structure of protein networks, both
on a large scale in terms of global statistics (e.g. the diameter) and
on a small scale in terms of local network motifs (e.g. hubs). In
particular, he has correlated network hubs with gene essentiality. Most
importantly, he extensively studies the dynamics of networks. This has
allowed him to show how a network dramatically changes in different
conditions.
3. Structural Genomics: Analysis of Folds, Families and Functions
on
a Large Scale
Another area of research in his lab is structural genomics. Here, he
conceptualizes proteins not purely as character sequences or abstract
network nodes, but more in terms of their molecular structure. He has
examined the large-scale relationships between sequence, structure, and
function in order to understand the extent to which structural and
functional annotation can reliably be transferred between similar
sequences, particularly when similarity is expressed in modern
probabilistic language. He has related the occurrence of protein folds
and families to phylogeny and deep evolutionary history. His studies
enabled him to recognize that particular folds are more common in
certain
organisms than in others. Finally, as part of his work on structural
genomics, he relates the properties of proteins with their eventual
success at being purified and structurally characterized. This has been
in the framework of a database and decision-tree mining framework that
he has built for the NESG structural genomics consortium.
4. Computational Biophysics: Relating Macromolecular Motions and
Packing
The final area of focus in his lab is analyzing small populations of
structures in terms of their detailed 3D-geometry and physical
properties. Here, he tries to interpret macromolecular motions in terms
of
packing. He has set up a database of macromolecular motions and coupled
it with simulation tools to interpolate between structural
conformations; the database also has tools to predict likely motions
based on simple models, such as normal modes and localized hinges
connecting rigid domains. Part of this project involves devising a
system for characterizing motions in a highly standardized fashion. His
motions classification scheme is motivated by the fact that protein
interiors are packed exceedingly tightly, and the tight packing can
greatly constrain a protein’s mobility. His lab has developed tools for
measuring and comparing the packing efficiency at different interfaces
(e.g. inter-domain, protein surface, helix-helix, protein vs. RNA) using
specialized geometric constructions (e.g. Voronoi
polyhedra).
Summary & Broader Societal Issues
In summary, his lab acts a connector, bringing quantitative
approaches from disciplines such as CS and applied math to bear on real
questions and data in molecular biology. In particular, he has
extensively applied classical computational approaches involving
simulation, machine learning, and database design to biological
problems. This often happens in the framework of practical, experimental
collaborations, where his lab functions as part of multi-disciplinary
teams.
Team participation is a key feature of his lab. Finally, as part of his
mission to connect biology with computation, he has also extensively
analyzed how a number of larger issues relating to computation in
society impact biological research. In particular, he has examined how
general aspects of e-publishing and digital libraries relate to
biomedical databases and how various legal and security concerns
significantly impact genomics database interoperation.
Mark has authored/coauthored over 400 scientific publications including
Paired-End Mapping Reveals Extensive Structural Variation in the
Human
Genome,
What is a gene, post-ENCODE? History and updated definition,
Relating three-dimensional structures to protein networks
provides evolutionary insights,
PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to
Controls,
Unlocking the secrets of the genome,
Mapping copy number variation by population scale genome
sequencing,
Integrative Analysis of the Caenorhabditis elegans Genome by the
modENCODE Project,
The genomic complexity of primary human prostate cancer, and
Variation in Transcription Factor Binding Among Humans.
Read the
full list of his publications!
Mark earned his BA (summa cum laude) in Physics at Harvard in 1989. He
earned his Ph.D.
in Biophysics/Chemistry at Cambridge in 1993. He completed his post-doc
in Bioinformatics at Stanford in 1996.
Watch
CRG Symposium: Mark Gerstein and
Insights from integrative analysis of the C. elegans genome —
Mark
Gerstein.
Read
Yale team finds order amidst the chaos within the human genome; Mom
and
Dad’s contributions counted and fossil DNA not dead after all.
Read his
Google+ profile,
LinkedIn profile, and his
Wikipedia profile.
Follow his
Twitter feed.
