Interpreting genetic variation using large genomic datasets.
At the Karczewski lab at the Broad Institute, our research is focused on assembling and analyzing massive public datasets of genetic variation, and developing novel strategies using these to aid in the interpretation of putative disease variants, in order to better distinguish causal disease variants and improve our understanding of human biology.
Broadly speaking, we are interested in genome sequencing and its future role in our daily lives. With the age of rapidly decreasing sequencing costs, it is not difficult to imagine an age where personal genetic information plays an important role in medicine and daily life. We are constantly discovering more and more of the genetic basis of diseases, but much work has yet to be done in fully explaining the genetic components of disease and other phenotypes.
A non-exhaustive list of projects
Interpreting genetic variants:
- leveraging the demographic history of human populations
- using first principles of the genome (LOFTEE)
- Understanding the function of genes and the genome:
I am a computational biologist working on interpreting genetic variation from large-scale datasets. As exome and genome data grow to massive scales, the methods and analytical frameworks need to scale at the same rate, and interpreting results from large-scale analyses is an exciting challenge. My group builds methods to interpret genetic variation, to learn about the function of human genes and the regulation of the genome as a whole. I earned a B.A. in Molecular Biology from Princeton University and a Ph.D. in Biomedical Informatics from Stanford University.
Siwei is a Postdoctoral Research Fellow in Dr. Benjamin Neale's lab at ATGU and the Broad Institute of Harvard and MIT. Prior to joining the ATGU, Siwei earned her Ph.D. in Computational Biology from Cornell University, where her dissertation focused on identifying and interpreting disease mutations in the human protein interactome. Siwei is widely interested in studying the genetic basis of human diseases and is currently working on whole-exome/genome genetic variation analysis to understand how genetic risk translates into biological mechanisms.
Associate Computational Scientists
Wenhan is an Associate Computational Biologist in the Karczewski and Neale Labs at the Broad Institute. She earned her BS Degree in Mathematics at Nankai University and MS Degree in Biostatistics at Yale University. She is interested in developing statistical methods to reveal the underlying messages from large-scale biomedical data, as well as building pipelines for the quality control of large data and data virtualization.
Friends of the lab
Rahul is an MD-PhD student in the Harvard-MIT Health Sciences and Technology program, and is currently doing his PhD in Genetics and Genomics at Harvard. He completed his undergraduate work in Chemical Engineering and Biology at the University of Pennsylvania in 2016. Co-mentored by Drs. Ben Neale and Vamsi Mootha, Rahul's work is focused on using human genetics to better understand perturbed pathways and complex disease mechanism, with a focus on mitochondrial function and dysfunction.
We are committed to training the next generation of computational scientists. A crucial component of this training is building a collaborative spirit in the team, promoting a happy and healthy environment where everyone can excel, and ensuring that we move the science forward together in a rigorous fashion.
What we do
Our core mission involves the use of massive datasets to learn about human disease and the biology of the genome. We value high quality data and code, ensuring reproducibility, and openness to advance human genetics.
The onslaught of genetic data has arrived, and we are fortunate to work in a time when massive data sizes enable rigorous approaches and robust statistics. However, these data volumes also require special handling: at the Karczewski lab, we take a "cloud-first" approach to computational biology. As we are managing datasets in the 1TiB range (and some crossing the 100TiB threshold), scalable computation is a must. We value our mutually-beneficial relationship with the Hail team, where we build our pipelines in Hail and feed back our progress or issues, and they enable our ideas for massive-scale analysis.
High quality data
One of the cornerstones of efficient scientific progress in a field is the veracity of the literature, which creates confidence and builds trust. On the other hand, high dimensional datasets that represent millions to quadrillions of measurements have an inherent error rate. Understanding the error modes for each dataset, developing methods to address them, and faithfully reporting our results, including all raw data and code (see Reproducibility below), is a crucial step in ensuring high quality. This is an iterative process that occasionally takes longer than we'd like, but the end result is something we can all be proud of.
Similarly, as our data grows and computational pipelines become more complex, having a foundation of public code creates a record of our analytical approach. Partially for the community, and partially for ourselves, the ability to reproduce each step of an analysis promotes code reusability and pays dividends on the inevitable need to rerun code after peer review, dataset updates, or that one additional QC step upstream. Additionally, public reproducible code leads to fewer mistakes due to out-of-sync intermediate files, and ensures that even when mistakes do occur, it is easy to document what steps are affected and to what degree, allowing for rapid correction of the scientific record if needed.
The publication policy at ATGU pledges that in the spirit of rapid open science, we will submit all manuscripts to a preprint server at the time of journal submission. Further to this, the large-scale datasets that we manage are property of the community, not the individual researcher that happened to aggregate them or run a particular analysis on them. As stewards of these data, we commit to release intermediate products of the data on completion of quality control rather than the time of publication, if allowed by regulatory bodies. This may include hosting datasets, primarily for initial releases of data, or browser frameworks to enable exploration of the data.
How we do it
Equally important to the science we advance is the manner in which we perform the research and the development of each lab member. As every career level and trajectory will be different, I encourage each lab member to reach out to define a set of goals, so that we can work together to achieve them. Lab members are expected to pursue rigorous research and effective dissemination of this research, which often involves publications, though other forms of communication, including code release and browser development are also encouraged.
The ATGU and the Broad Institute are highly dynamic and collaborative environments, where we work with experts in computational biology, statistical genetics, population genetics, scalable computing, medical genetics, and more. As all trainees build a research program, we encourage discussion of the ideas, methods, and results with those in the local and broader environments, in order to advance the science as effectively as possible. Credit is infinitely divisible, and collaborations lead to new ideas and additional publications for all involved. Trainees should feel empowered to form collaborations, and I am happy to advise how to navigate these. Similarly, pursuing new directions that may or may not be related to your primary projects can be a valuable endeavor, and I encourage spending 10-20% of your time on average on exploring new avenues. Feel free to discuss these directions with me so that we can ensure your time is spent effectively.
Work life balance
While our work is important, I believe that a happy and healthy team is an effective team. To this end, I encourage each member of the team to identify and adopt their most efficient style, and to respect the choices of others. I expect an amount of work requisite with what is written in your contract, but understand that different stages of work and life may require different considerations. To this end, I am flexible on working hours, and though Slack messages may come at all times or on all days, no one should feel pressured to respond to messages outside of their working hours. Occasional exceptions may arise, such as a lead up to conferences like ASHG, but in these cases, taking a long-needed break after the event is encouraged.
A critical aspect of being a successful scientist in the modern era is the ability to communicate our research to others in our field, other scientists, and the public. I will provide opportunities for lab members to present their research internally and externally, and I commit to helping each individual craft a narrative for their research as they present to the public. I also encourage public discussion of research before publication (see Open Science, above), as well as teaching opportunities as desired.
For individuals further along in their training, mentoring junior researchers is a key component to a well-rounded training. I am happy to discuss an arrangement that suits everyone's research questions and career trajectories.
Full list available at Google Scholar:
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al., "The mutational constraint spectrum quantified from variation in 141,456 humans." Nature. 2020 May 27. doi: 10.1038/s41586-020-2308-7. Flagship article for the gnomAD package. See Press below.
Karczewski KJ, and Martin AR. "Analytic and Translational Genetics." Annu Rev Biomedical Data Sci. 2020. doi: 10.1146/annurev-biodatasci-072018-021148.
Karczewski KJ and Snyder M. "Integrative Omics for Health and Disease." Nat Rev Genet. 2018 May; 19(5):299-310. doi: 10.1038/nrg.2018.4.
Karczewski KJ, Weisburd B, Thomas B, Ruderfer DM, Kavanagh D, Hamamsy T, et al., "The ExAC Browser: Displaying reference data information from over 60,000 exomes." Nucleic Acids Res. 2017 Jan 4; 45(D1):D840-D845. doi: 10.1093/nar/gkw971. Epub 2016 Nov 28. (bioRxiv. doi: 10.1101/070581. 2016 Aug 19.)
Lek M, Karczewski KJ*, Minikel EV*, Samocha KE*, Banks E, Fennell T, et al., "Analysis of protein-coding genetic variation in 60,706 humans." Nature. 2016 Aug 17; 536(7616):285-91. doi: 10.1038/nature19057. (bioRxiv. doi: 10.1101/030338. 2015 Oct 30).
Karczewski KJ, Snyder M, Altman RB, Tatonetti NP. "Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association." PLoS Genetics. 10(2): e1004122. doi:10.1371/journal.pgen.1004122.s012
Karczewski KJ*, Dudley JT*, Kukurba KR, Chen R, Butte AJ, Montgomery SB, Snyder M. "Systematic functional regulatory assessment of disease-associated variants." Proc Natl Acad Sci U S A. Epub 2013 May 20. doi: 10.1073/pnas.1219099110.
Dudley JT and Karczewski KJ. Exploring Personal Genomics. January 2013. Oxford University Press.
Karczewski KJ*, Tirrell RP*, Tatonetti NP, Dudley JT, Cordero P, Salari K, et al., "Interpretome: A Freely Available, Modular, and Secure Personal Genome Interpretation Engine." Pac Symp Biocomput. Epub 2011 Oct 25. 17:339-350(2012).
Karczewski KJ, Tatonetti NP, Landt SG, Yang X, Slifer T, Altman RB, Snyder M. "Cooperative Transcription Factor Associations Discovered using Regulatory Variation." Proc Natl Acad Sci U S A. 2011 Aug 9;108(32):13353-8. doi: 10.1073/pnas.1103105108. Epub 2011 Jul 26.
gnomAD: The Genome Aggregation Database, dataset and browser.
Genebass: Rare variant associations for >3k phenotypes in the UK Biobank.
LOFTEE: loss-of-function variation annotation.
Hail (contributor): open-source library for large-scale data analysis.