Spring 2020 seminars

 


Monday April 13th 2020, 11:00am - 12:15pm, Zoom

Kannan Premnath,
Department of Mechanical Engineering, CU Denver

Central Moment Lattice Boltzmann Methods: Modeling and Applications for Complex Flow Simulations

Lattice Boltzmann methods (LBM) are stream-and-collide based algorithms involving the evolution of the particle distribution functions inspired from kinetic theory for computational fluid dynamics. They are efficient, naturally parallelizable and deliver second order accuracy with high fidelity involving relatively low dissipative truncation errors. The collision step in LBM, generally represented as a relaxation process, plays a prominent role in modeling the relevant physics and in tuning the desired numerical properties, such as the numerical stability of the scheme. Among different possible techniques, the central moment LBM, a promising formulation, is generally based on the relaxation of various central moments of the distribution functions to their equilibria at different rates under collision. The latter is usually constructed either directly from the continuous Maxwell distribution function or by exploiting its factorization property. In this presentation, we will first provide an introductory survey of the LBM, and then discuss our contributions based on such classical formulation of the central moment LBM related to enhancing its efficiency and enabling new modeling capabilities for a variety of complex flow applications, such as thermal convection, non-Newtonian flows, and multiphase flows with surface tension effects, including its modulation by the presence of surfactants. In addition, we have recently developed a new formulation for the collision operator of the central moment LBM from a different perspective involving the continuous space-time Fokker-Planck (FP) kinetic equation, which was originally proposed for representing stochastic diffusive processes such as Brownian dynamics, by adapting it as a collision model for the Boltzmann equation for flow simulations. The resulting discrete formulation can be interpreted in terms of the relaxation of the various central moments to “equilibria” that depend only on the adjacent, lower order post-collision moments. We designate such newly constructed chain of equilibria as the Markovian central moment attractors and the relaxation rates are based on scaling the drift coefficient of the FP model by the order of the participating moment. We will conclude by presenting this more recent modified FP-guided central moment LBM and demonstrate its accuracy and robustness by its comparison against other collision models for some benchmark flows.


Friday April 13th 2020, 11:00am - 12:15pm, Zoom

Michael Santana, Grand Valley State University, Allendale, MI 

Finding disjoint doubly chorded cycles

Abstract:  In 1963, Corràdi and Hajnal verified a conjecture of Erdös by showing that for all $k \in \mathbb{Z}^+$, if a graph $G$ has at least $3k$ vertices and $\delta(G) \ge 2k$, then $G$ will contain $k$ disjoint cycles.  This result, which is best possible,  has served as motivation behind many recent results in finding sharp minimum degree conditions that guarantee the existence of a variety of structures.  In particular, Qiao and Zhang in 2010, showed that if  $G$ has at least $4k$ vertices and $\delta(G) \ge \lfloor \frac{7k}{2}\rfloor$, then $G$ contains $k$ disjoint doubly chorded cycles.  Then in 2015, Gould, Hirohata, and Horn proved that  if $G$ has at least $6k$ vertices, then $\delta(G) \ge 3k$ is sufficient to guarantee $k$ disjoint doubly chorded cycles.  In this talk, we extend the result of Gould et al. by showing that $\delta(G) \ge 3k$ suffices for graphs on at least $5k$ vertices (which is best possible for the given minimum degree), and we present an improvement of the Qiao and Zhang condition to $\delta(G) \ge \lceil \frac{10k-1}{3}\rceil$, which is sharp.  This is joint work with Maia Wichman.


Wednesday April 8th 2020, 2:00pm - 3:0pm, Zoom

Lu Vy
Department of Mathematical and Statistical Sciences, CU Denver

MS Project Presentation :: Variance Reduction Methods Based on Multilevel Monte Carlo

Advisor: Yaning Liu
Committee: Erin Austin, Yaning Liu (Committee Chair), Burt Simon

If we could see into the future, then finance would be a lot easier. Unfortunately, we can’t, so stock traders work with mathematicians. When a time machine isn’t available, the next best option is a good mathematical model. While many excellent models exist to predict stock prices, their complexity often evades an analytic solution. When this is the case, simulation becomes the best alternative. What began in the Los Alamos Laboratories as Monte Carlo estimation evolved over the next 80 years to become something ubiquitous in financial mathematics. Today, Monte Carlo computational methods are so heavily used that pseudo-random numbers alone hardly suffice. Predicting the modern market requires efficiency, and to this end, a number of variance reduction techniques emerged. In this paper, I juxtapose two, and find that their combined effects are synergistic.

In 2004, Okten introduced a method of generating high quality random numbers. Somewhat paradoxically, he proposed re-randomizing the un-randomized Halton sequence. Doing so, he argued, allows us to keep the space-filling property of quasi-random sequences without the troublesome correlation between terms. In 2018, Giles published a paper that introduced the multi-level Monte Carlo (MLMC) algorithm. Instead of high-quality numbers, he sought efficiency in the structure of the algorithm. Monte Carlo algorithms naturally require averaging many estimates, but due to computational expenses, one must choose between averaging many poor estimates and averaging a few good estimates. Giles found a way to incorporate both, thus producing an even better estimate. Both methods work well by themselves, but no one has yet tried to combine them.


Monday April 6th 2020, 11:00am - 12:15pm, Zoom

Chase Viss
Department of Mathematical and Statistical Sciences, CU Denver

PhD Defense: Circuits in Optimization

Committee:  Stephen Hartke (Chair), Steffen Borgwardt (Advisor), Weldon Lodwick, Florian Pfender, and Tamon Stephen

Abstract: Circuits play a fundamental role in the theory of linear programming due to their intimate connection to algorithms of combinatorial optimization and the efficiency of the simplex method. Generalizing edge walks, circuit walks follow the edge directions of the underlying polyhedron and often have useful combinatorial interpretations. Further, circuits are used as step directions in various augmentation schemes for solving linear programs. We are interested in better understanding the properties of circuit walks in polyhedra as well as working toward viable implementations of circuit augmentation schemes.

We first introduce a hierarchy for integral polyhedra based on different types of behavior exhibited by their circuit walks. Second, we relate circuits to a fundamental task in data analytics and machine learning: the clustering of large data sets. In particular, we consider an application in which one clustering is gradually transformed into another via circuit walks. Third, we address significant challenges regarding the computation of circuits via a proposed polyhedral model. This model serves as a universal framework for representing the set of circuits of any polyhedron and enables the efficient computation of so-called steepest-descent circuits. Lastly, we work toward a viable implementation of a steepest-descent circuit augmentation scheme in which our dynamic model provides the required augmenting directions.


Monday March 30th 2020, 11:00am - 12:15pm, Zoom

Jan van Leeuwen,
Colorado State University and University of Reading, UK

Particle Flow-based Bayesian Inference for high-dimensional geophysical problems

Bayesian Inference is the science of how to optimally combine existing information encoded in computational models with information in observations of the modeled system. Many systems in the geosciences are highly nonlinear, asking for fully nonlinear Bayesian Inference. Particle filters are one of the few fully nonlinear methods that seem to be feasible to serve that goal. The vanilla particle filters are highly sensitive to the likelihood, and the particle ensemble size needed for accurate results growths roughly exponential with the number of independent observations. Localization methods for particle filters have been developed since the early 90ties and have undergone much refinement, but fundamental problems remain. These problems are related to the fact that even with localization the local areas contain too many observations to avoid degeneracy, and creating smooth posterior particles via gluing local particles together remains troublesome (although some remarkable successes have been booked recently) Equal-weight particle filters have been developed, exploring e.g. ideas from synchronization, but up to now only their first and second moments can be made unbiased.

An older development that has recently gained new attention are particle flows. The basic idea is to iteratively move all particles from being samples from the prior to equal-weight samples from the posterior. This motion of all particles through state space is defined via iteratively decreasing the distance between the actual particle density and the posterior density. Many methods to achieve this have been proposed and will be discussed briefly. Then we will describe a solution based on minimizing the relative entropy. By restricting the transport map to a Reproducing Kernel Hilbert Space a practical solution is found.

When applying this to geophysical problems two issues arise: how to accurately represent the prior and how to choose the kernel covariance. These two problems are related, and we will demonstrate practical methods to solve this problem, based on iterative refinement of the kernel covariances. The behaviour of the new methodology will be investigated using toy problems and a highly nonlinear high-dimensional one-layer
model of the atmosphere.


Wednesday March 11th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Ethan Anderes
Statistics, UC Davis

Gravitational wave and lensing inference from the CMB polarization

In the last decade cosmologists have spent a considerable amount of effort mapping the radially-projected large-scale mass distribution in the universe by measuring the distortion it imprints on the CMB. Indeed, all the major surveys of the CMB produce estimated maps of the projected gravitational potential generated by mass density fluctuations over the sky. These maps contain a wealth of cosmological information and, as such, are an important data product of CMB experiments. However, the most profound impact from CMB lensing studies may not come from measuring the lensing effect, per se, but rather from our ability to remove it, a process called delensing. This is due to the fact that lensing, along with emission of millimeter wavelength radiation from the interstellar medium in our own galaxy, are the two dominant sources of foreground contaminants for primordial gravitational wave signals in the CMB polarization. As such delensing, i.e. the process of removing the lensing contaminants, and our ability to either model or remove galactic foreground emission sets the noise floor on upcoming gravitational wave science. 

In this talk we will present a complete Bayesian solution for simultaneous inference of lensing, delensing and gravitational wave signals in the CMB polarization as characterized by the tensor-to-scalar ratio r parameter. Our solution relies crucially on a physically motivated re-parameterization of the CMB polarization which is designed specifically, along with the design of the Gibbs Markov chain itself, to result in an efficient Gibbs sampler---in terms of mixing time and the computational cost of each step---of the Bayesian posterior. This re-parameterization also takes advantage of a newly developed lensing algorithm, which we term LenseFlow, that lenses a map by solving a system of ordinary differential equations. This description has conceptual advantages, such as allowing us to give a simple non-perturbative proof that the lensing determinant is equal to unity in the weak-lensing regime. The algorithm itself maintains this property even on pixelized maps, which is crucial for our purposes and unique to LenseFlow as compared to other lensing algorithms we have tested. It also has other useful properties such as that it can be trivially inverted (i.e. delensing) for the same computational cost as the forward operation, and can be used for fast and exact likelihood gradients with respect to the lensing potential. Incidentally, the ODEs for calculating these derivatives are exactly analogous to the backpropagation techniques used in deep neural networks but are derived in this case completely from ODE theory.


Wednesday March 11th 2020, 9:00am - 11:00am, Student Commons Building room 4017

Stephan Patterson
PhD candidate, CU Denver

PhD Defense: Algorithms for Discrete Barycenters

Committee: Ethan Anderes, Stephen Billups (chair), Steffen Borgwardt (advisor), Yaning Liu, and Burt Simon

Discrete barycenters are solutions to a class of optimal transport problems in which the input are probability measures with finite support sets. Exact solutions can be computed through exponential-sized linear programs, a prohibitively expensive approach. Efficient computations are highly desirable, as applications arise in a variety of fields including economics, physics, statistics, manufacturing, facility locations, and more. To illustrate the extremes of problem difficulty, we focus on two examples: a best-case setting based on the MNIST Digits Data set and a worst-case setting based on Denver crime locations. We describe improved linear programming models and solving strategies for each setting, supported with implementations that demonstrate significant improvements in total running time and memory requirements. We conclude with a brief examination of our proof that a decision variant of the problem is computationally hard; that is, through a reduction from planar three-dimensional matching, we show it is NP-hard to decide if there exists a solution with non-mass-splitting transport cost and support set size below prescribed bounds.


Monday March 9th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Jan Mandel
Department of Mathematical and Statistical Sciences
CU Denver

Unlimited computing with the Open Science Grid

The Open Science Grid (OSG) is a consortium of institutions which make their facilities available to run a large number of jobs. I will decribe the architecture, the power, and the limitations of OSG, show how you can get an account, and how to run a quick job. Bring your laptops! In the second part of the talk, I will provide a brief overview of a proposal submitted earlier this year for a new cluster. The cluster, if funded, should provide a part of its capacity to the OSG and integrate submitting jobs locally and to the OSG.


Wednesday March 4th 2020, 2:00pm - 4:00pm, Student Commons Building room 4229

Mohammad Meysami 
PhD candidate, CU Denver

PhD thesis proposal: Retrospective and prospective disease surveillance using scan methods

Committee: Erin Austin (chair), Katie Colborn, Mike Ferrara, Joshua French (advisor), and Burt Simon

Identifying disease patterns and clusters correctly and quickly is a vital part of public health and epidemiology that will help us to prevent outbreaks and control disease spread at the earliest time. Numerous methods can be used for detecting clustering and clusters, such as Moran’s I for a global index of spatial autocorrelation or circular, flexibly-shaped, and cylindrical space-time scan methods for detecting local spatial and spatio-temporal clusters. In this dissertation proposal, we first briefly describe relevant methods for detecting cluster of diseases. Next, a new methodology is proposed for estimating the population upper bound used in the circular scan method We then propose new spatial and space-time scan methods to address the limitations and weaknesses of existing methods. Lastly, different performance measures such as power, sensitivity, PPV, and misclassification are used to evaluate the efficiency of the proposed approaches and provide a comparison to the other existing methods.


Friday February 28th 2020, 10:30am - 11:30am, Student Commons Building room 4017

Dr. Alvin C. Bronstein, PhD
Hawaii State Department of Health

Understanding the COVID-19 CoVID2019 Emerging Infectious Disease

Background: 
On 31 December 2019, China reported a cluster of pneumonia cases in people associated with Huanan Seafood Wholesale Market exposure in Wuhan, Hubei Province. On 7 January 2020, Chinese health authorities confirmed a case associated with a novel coronavirus, 2019-nCoV.  As the outbreak developed cases appeared with no history of being in the Wuhan seafood market. Epidemiologic data began to indicate that person-to-person transmission of the virus was occurring. Person-to-person spread has been reported outside China, including in the United States (30 January 2020) and other countries. Chinese officials report that sustained person-to-person spread in the community is occurring in China. Though originally called Novel coronavirus on 11 February the World Health Organization (WHO) renamed the virus COVID-19. 

Objectives:
1.    Describe the known corona viruses
2.    Understand the anatomy of the COVID-19
3.    Describe the COVID-19 clinical syndrome
4.    List clinical COVID-19 risk factors
5.    Describe how the virus is spread
6.    List common infection control practices. 
7.    Understand treatment options
8.    Discuss tracking COVID-19

Methods:
We will review the known history of COVID-19 and discuss the current global outbreak.  A basic understanding of the clinical picture of disease will be presented. Prevention best practices will be discussed. We will look at a dashboard developed to track the virus.

Question:
Is there a role for adaptive systems in describing and monitoring the outbreak?

Conclusion:
COVID-19 is an emerging infectious disease with medical, societal and economic impacts. A basic understanding of the virus and prevention practices is essential for everyone. 


Monday February 24th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Michelle Daya, PhD
Colorado Center for Personalized Medicine,

A whirlwind tour of tools for genome-wide genetic analyses in diverse populations (with some “real life” examples)

Genome-wide genetic analyses in populations with substructure due to ancestry require special considerations. In this talk I will present a number of statistical techniques that I have found useful in my research to both account for and leverage substructure. I will summarize a number of methods: 1) Calculating ancestry related principal components that reflect ancestry and not recent relatedness, and in turn estimating recent relatedness that is not biased by ancestry. I will show how these results can be used to ensure proper calibration of test statistics in genome-wide association studies (GWAS). 2)  A meta-regression approach for trans-ethnic meta-analysis of GWAS that quantifies heterogeneity due to differences in ancestry. 3) Leveraging local ancestry in association testing. 4) Polygenic risk scoring in underrepresented populations.  5) Special considerations when estimating heritability from GWAS summary statistics in admixed populations. 


Monday February 17th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Megan Null, PhD candidate, CU Denver

RAREsim: Simulating Rare Variant Genetic Data

Simulating realistic rare variant genetic data is vital for accurate evaluation of new statistical methods. Research suggests that large sample sizes and functional information are necessary for sufficiently powered rare variant association tests. Further, the distribution of simulated variants should be similar to that observed in sequencing data. Currently there is no simulation software that produces large sample sizes with realistic functional annotation and the expected AFS across all, including very rare, variants.
We developed RAREsim, a flexible software that simulates large sample sizes of genetic data with an AFS similar to that observed in sequencing data. Because RAREsim simulates from a sample of real haplotypes, existing functional and other genetic annotation can be used, capturing known and unknown complexities of real data. RAREsim is a two-step algorithm. First, RAREsim simulates haplotypes using HAPGEN2 (Su 2011) allowing for mutations to occur at most sites across the region. Second, RAREsim prunes the rare variants using the expected number of variants at each minor allele count. The expected number of variants is calculated from an estimate of the total number of variants in the region and the AFS. Since the AFS and total number of variants have been shown to vary by ancestry and variant type (e.g. synonymous, intron), we provide tuning parameters to enable user flexibility while maintaining the general relationship between the number of variants, AFS, and sample size. While we derive default parameters from the Genome Aggregation Database (gnomAD), the user has the ability to vary the number and distribution of variants to reflect their desired distribution for the region.
RAREsim is available as an R package and provides the ability to simulate large samples of rare variant data with functional annotation and the expected AFS for all, including very rare, variants. Realistic rare variant simulations are critical for rare variant method development. In turn, advances in these methods will allow for a greater understanding of the role rare variants play within health and disease.


Monday February 10th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Yaning Liu
Department of Mathematical and Statistical Sciences
CU Denver

From HDMR to FAST-HDMR: Surrogate Modeling for Uncertainty Quantification

Surrogate modeling is a popular and practical method to meet the needs of a large number of queries of computationally demanding models in the analysis of uncertainty, sensitivity and system reliability. We first explore various methods that can improve the accuracy of a particular class of surrogate models, the high dimensional model representation (HDMR), and their performances in uncertainty quantification and variance-based global sensitivity analysis. The efficiency of our proposed methods is demonstrated by a few analytical examples that are commonly studied for uncertainty and sensitivity analysis algorithms. HDMR techniques are also applied to an operational wildland fire model that is widely employed in fire prevention and safety control, and a chemical kinetics H2/air combustion model predicting the ignition delay time, which plays an important role in studying fuel and combustion system reliability and safety. We then show how the traditional Fourier Amplitude Sensitivity Testing (FAST), heavily used for variance-based global sensitivity analysis, can be treated in the framework of HDMR. The resulting surrogate model, named FAST-HDMR, is shown to be computationally more efficient then the original FAST. Various improvements that further enhance the accuracy of FAST-HDMR are discussed and illustrated by examples.


Wednesday February 5th 2020, 4:00pm - 5:30pm, Student Commons Building room 4113

James T. Campbell
University of Memphis

Lightning Strikes ( see flyer )

A pair of undergraduate students in an honors seminar proposed the following discrete model for the formulation of lightning. Place randomly generated numbers (levels) in each cell of an mxn grid, creating a configuration. Choose a starting cell along the top row, examine the neighboring cells, and (i) draw an edge to any neighbor whose level is less than or equal to our current level (such a cell has become visited), (ii) list the visited cells in a queue, and (iii) start the process over at the beginning of the queue, proceeding until the queue is empty.

The pictures in Figure 1 (see flyer) were computer generated from this model, with a 50x50 grid, the cell values chosen uniformly from the set {0,1,2}, and the center cell in the top row as the starting point. Each picture corresponds to a different initial distribution of the integers in the cells.

We are interested the fate of the resulting path, and would especially like to be able to compute the probability that some portion of the path reaches the bottom of the grid. We think of this case as success, or more colloquially, a lightning strike. Besides being fun to think about, it turns out that in its proper generality, the question is highly non-trivial. There are tons of related open questions, most of which are accessible to undergraduates.

Early results obtained in collaboration with Lauren Sobral, who was an undergraduate at the time.

 


Monday February 3rd 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Matt Strand
Professor and Head of Biostatistics
National Jewish Health

Building a practical survival model for the COPDGene study

There are different types of survival models, ranging from nonparametric (e.g., Kaplan-Meier), to semi-parameteric (e.g., Cox proportional hazards) to parametric (e.g., accelerated failure time [AFT]) models.  An AFT survival model using the Weibull distribution was built for subjects in the COPDGene study to quantify mortality risk as a function of several predictors spanning behavioral, physiologic, demographic and imaging categories.  The risk model can be used to motivate patients to modify behaviors to decrease risk with the help of clinician input, and to identify subjects who can be targeted for therapeutic intervention or clinical trials.  A point system was developed based on the fitted models (one for men, one for women) for ease of use and interpretation of predictors.


Wednesday January 29th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Aakash Sahai
CU Denver Department of Electrical Engineering

Nonlinear Cylindrical Ion Soliton-driven cKdV equation

This talk will to introduce how effectively modeling complex nonlinear collective phenomena using large-scale computations is critical for future scientific discovery. Collective motion of particles forms the basis of physical processes ranging from the Astrophysical to the Atomic-scale. Lab-based nonlinear collective modes strongly driven as wakefields in gasses have now paved the way to controllably access electric fields exceeding 100GV/m. These fields can effect dramatic advances in particle acceleration technology by offering at least two orders of magnitude reduction in the size of future discovery machines that will succeed 27km long LHC at CERN.

However, a major challenge lies in understanding how the electron modes interact with ions especially in the region where particle beam gets accelerated. My work shows that a cylindrical ion-soliton can be driven by the steepened nonlinear electron modes excited as wakefields. A hollow region naturally excited in the plasma solves the critical problem of collisions and related undesirable effects. Proof of principle of the theoretical model of a driven cKdV equation is established using a computational model. Experiments have recently confirmed the existence of such long-lived soliton modes which paves the way for transformative directions in accelerator technology.


Monday January 27th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Megan Sorenson, CU Denver PhD Student

Empirical simulation of very rare variant genetic data

Simulating realistic rare variant genetic data is vital for accurate evaluation of new statistical methods. Research suggests that large sample sizes and functional information are necessary for sufficiently powered rare variant association tests. Further, the distribution of simulated variants should be similar to that observed in sequencing data. HAPGEN2 (Su 2011) accurately simulates common genetic variants, but is unable to simulate data that reflects the observed allele frequency spectrum (AFS) for very rare variants, such as singletons and doubletons. Currently there is no simulation software that produces large sample sizes with realistic functional annotation and the expected AFS across all, including very rare, variants.

We developed RAREsim, a flexible software that simulates large sample sizes of genetic data with an AFS similar to that observed in sequencing data. Because RAREsim simulates from a sample of real haplotypes, existing functional and other genetic annotation can be used, capturing known and unknown complexities of real data. RAREsim is a two-step algorithm. First, RAREsim simulates haplotypes using HAPGEN2 (Su 2011) allowing for mutations to occur at most sites across the region. Second, RAREsim prunes the rare variants using the expected number of variants at each minor allele count. The expected number of variants is calculated from an estimate of the total number of variants in the region and the AFS. RAREsim is available as an R package, with a Shiny App for the user to select and visualize the allele distribution parameters. RAREsim provides the ability to simulate large samples of rare variant data with functional annotation and the expected AFS for all, including very rare, variants.

Jessica Murphy, CU Denver student

Accessible Analysis of Longitudinal Data with Linear Mixed Effects Models: There’s an App for That

Longitudinal mouse models are commonly used to study possible causal factors associated with human health and disease. However, the statistical models applied in these studies are often incorrect. If correlated observations in longitudinal data are not modeled correctly, they can lead to biased and imprecise results. Therefore, we provide an interactive Shiny App to enable appropriate analysis of correlated data using linear mixed effects (LME) models. Using the app, we re-analyze a dataset published by Blanton et al (Science 2016) that modeled mice growth trajectories after microbiome implantation from nourished or malnourished children. We then compare the fit and stability of LME models with different parameterizations. While the model with the best fit and zero convergence warnings differed substantially from the two-way ANOVA model chosen by Blanton et al, both models found significantly different growth trajectories for microbiota from nourished vs. malnourished children. We also show through simulation that the results from the two-way ANOVA and LME models will not always be consistent, supporting the need to model correlated data correctly. Hence, our app provides easy implementation of LME models for accessible and appropriate analysis of studies with longitudinal data.


Monday January 15th 2020, 11:00am - 12:15pm, Student Commons Building room 4017

Joanne B. Cole, Ph.D.
Instructor, Harvard Medical School, 
Postdoctoral Research Fellow, Medical and Population Genetics Program, Broad Institute of MIT and Harvard

Genetics of dietary intake in UK Biobank: You eat what you are

Unhealthful diet is a leading risk factor for several life-altering metabolic diseases such as obesity, type 2 diabetes (T2D), and coronary artery disease (CAD), all of which substantially increase mortality and decrease quality of life. Recent advent of large biobanks with both genetic data and deep phenotyping enable us to study the genetics of modestly heritable traits, such as diet. We derived 170 data-driven dietary habits in UK Biobank (UKB) including single food quantitative traits and principal component (PC) analysis dietary patterns, and found most (84%) had a significant proportion of phenotypic variance that could be explained by common genetic markers (‘SNP heritability’), with milk type, butter consumption, dietary pattern PC1, alcohol intake, water intake, and adding salt to food with the largest genetic contributions. Genome-wide association studies (GWAS) testing for an association between genetic variants throughout the genome and 143 heritable dietary habits using linear mixed models in ~450K European individuals identified 814 independent genetic loci, for which 205 are novel, and 136 were uniquely associated with dietary patterns and not single foods. We conducted genetic instrumental variable analysis (‘Mendelian randomization’) to identify causal relationships between our lead “healthy” vs. “unhealthy” PC1-dietary pattern. Though we find little evidence that PC1, largely driven by type of bread consumed, has a causal effect on cardio-metabolic disease, we do find a significant bidirectional causal relationship with educational attainment, where the relative strengths of the causal estimates suggest that higher educational attainment and/or correlated traits, such as socioeconomic status, shift individuals towards healthier eating habits. Overall, this work uses comprehensive genetic analysis in a well-powered sample to expand our understanding of the genetic contributors to dietary intake, and uses these findings as tools to dissect relationships with human health and disease.