For instance, when training a gestational age clock model from placental methylation, a sample can only be collected after delivery of the baby and the placenta. So most samples have a gestational age greater than 30 weeks, which corresponds to moderate preterm and full-term births. For samples with a further younger gestational age, they are scarce, which makes the sample distribution seriously biased to large gestational ages and impairs the ability of the trained model to predict small ones. However, differences in gestational age as small as one week can significantly influence neonatal morbidity and mortality and long-term outcomes [18 – 23]. Hence, the model’s accuracy across the whole gestational age range becomes essential.
To solve this problem, we developed the R package eClock (ensemble-based clock). It improves the traditional machine learning strategy in handling the imbalance problem of category data [24], and combines bagging and SMOTE (Synthetic Minority Over-sampling Technique) methods to adjust the biased age distribution and predict DNAm age with an ensemble model. This is the first time applying these techniques to the clock model, bringing a new framework for clock model construction. eClock also provides other functions, such as training the traditional clock model, displaying features, and converting methylation probe/gene/DMR (DNA methylation region) values. To test the performance of the package, we used 3 different datasets, and the results show that the package can effectively improve the clock model performance on rare samples.