Machine Learning for Genetic Studies


plot

Additional plots and information:


Subsets

The full set of data obtained from the GWAS 1 consists of 7 833 SNPs and is henceforth referred to as the “full” set. Out of these SNPs, 23 have a p-value below 1e-8, and are henceforth referred to as the “tops”. Two optional feature reduction methods are implemented as well, ‘SelectKBest’ which selects the most significant features according to the ANOVA f-score, and lastly we use PCA to reduce the number of features in the full set to 100 “artificial” SNPs, the “reduced” set. (Results from the full set are not yet available)

NameReductionSelectionFeatures (nr)SNPs
FullGWASNo7 833Yes
TopsGWASp-value $< 1e^{-8}$23Yes
Top5GWASmin p-value5Yes
ReducedGWAS + PCAtop k100No
SelectedGWAS + f-anovatop k100Yes
Plots
Bernoulli Naive Bayes, AUC by subset and genotype origin
Decision Tree, AUC by subset and genotype origin
Linear Discriminant Analysis, AUC by subset and genotype origin
Neural Network, AUC by subset and genotype origin
Polygenic score, AUC by subset and genotype origin
Quadratic Discriminant Analysis, AUC by subset and genotype origin
Random Forest, AUC by subset and genotype origin
Support Vector Machine Classifier, AUC by subset and genotype origin

Models
Linear Discriminant Analysis
Support Vector Machine Classifier
Decision Tree
Random Forest
Neural Network