Hedvig Sundelin

Information about Hedvig

Recent work:

Machine Learning for Genetic Studies

Sundelin, Hedvig(2023).

Abstract

Preterm delivery (PTD) is a significant contributor to infant mortality and morbidity worldwide, influenced by environmental and genetic factors. Although previous studies have identified genetic variants associated with PTD and gestational duration, their effect sizes remain relatively small, leaving a substantial portion of the hereditary variation unexplained. This thesis explores the potential of machine learning (ML) techniques to uncover additional insights into PTD and gestational duration using genetic data. The background section underscores the global impact of preterm birth on child mortality and long-term health outcomes, emphasising the role of genetics with an estimated heritability of around 30%. This project aims to apply ML techniques to improve the prediction of gestational duration and PTD based on genetic data. Research questions address ML model selection, the impact of variables on prediction performance, and a comparison to previous studies. The study is based on the Norwegian Mother, Father and Child Cohort Study (MoBa) and uses data from the Medical Birth Registry of Norway (MBRN). The scope includes the use of genetic data and a focus on the 23 loci previously identified in a related study. The theory chapter provides an overview of genetics and its application in studying complex conditions like preterm delivery. It also introduces ML and explains the theoretical foundations f different ML models. Subsequently, the methods and materials chapter describes the data acquisition process, preprocessing steps, employed ML classifiers, and model evaluation methods. The chapter highlights the use of neural networks, classic ML algorithms, and libraries for implementation. Results reveal varying AUC scores among classic models, with logistic regression (LR) performing the best. The choice of variables had a significant impact, with the maternal genome and the Top 23 set, offering the best conditions. Network models achieved comparative scores for binary classification. Additional analyses on the predicted probabilities demonstrated higher AUC scores compared to binary classifications, identifying RMSprop as the best-performing network model. The study reveals a slight improvement in results compared to Polygenic Risk Scores (PRS) but a modest predictive ability overall. The findings in this study suggest that more extensive research is needed to unveil the potential of ML models in improving predictions based on genetic data.

Share on

Twitter Facebook LinkedIn

Your email address will not be published. Required fields are marked *

Hedvig Sundelin

Share on

Leave a Comment