Machine Learning for Genetic Studies


plot

Additional plots and information:


Genetic data

This study was based on a subset of genetic regions previously identified in the largest GWAS to date1. The subset consisted of 7 833 SNPs and is henceforth referred to as the “full” set. From the “full” set three feature subsets of varying size were created. The feature set “top5” contain the five SNPs with the lowest p-value among all genome-wide significant SNPs in the previous GWAS, while “tops” contain all SNPs with a p-value below 1e-8 (n = 23). For the last feature set we implemented an additional feature reduction method. We used ‘SelectKBest’ to select the 100 most significant features according to the ANOVA f-score.

NameReductionSelectionFeatures (nr)
FullGWASNo7 833
TopsGWASp-value $< 1e^{-8}$23
Top5GWASmin p-value5
SelectedGWAS + f-anovatop k100
Plots
Bernoulli Naive Bayes, AUC by subset and genotype origin
Decision Tree, AUC by subset and genotype origin
Linear Discriminant Analysis, AUC by subset and genotype origin
Neural Network, AUC by subset and genotype origin
Polygenic score, AUC by subset and genotype origin
Quadratic Discriminant Analysis, AUC by subset and genotype origin
Random Forest, AUC by subset and genotype origin
Support Vector Machine Classifier, AUC by subset and genotype origin

Models
Linear Discriminant Analysis
Support Vector Machine Classifier
Decision Tree
Random Forest
Neural Network