Development of a Dynamic Diagnosis Grading System for Infertility Using Machine Learning

This prognostic study assesses whether machine learning can be used to develop a dynamic scoring system for predicting the severity of infertility in patients.


eAppendix 1. Entropy-based feature discretization algorithm
The specific steps of the Entropy-based feature discretization algorithm are as follows: Step 1: Sort all sample values of feature A in ascending order; Step 2: Traverse each sample value of feature A, and use each sample value as a segmentation point to divide the sample into 2 sample subsets; Step 3: Calculate the weighted average entropy of the sample subset after each point is divided, and select the one with the smallest weighted average entropy as the segmentation point; : j  proportion of samples in j-th sample subset to total samples.
Step 4: When the entropy after division is greater than the set threshold and less than the specified number of groups, repeat steps 2-3 to continue the division, otherwise stop and output the division result.
There are two parameters that determine the effect of Entropy-based discretization segmentation. The first is the maximum number of groups of samples, that is, the number of intervals after discretization; the second is the minimum entropy that stops dividing. These two parameters together limit the segmentation to endless. For different characteristics, we used the method of repeated trials and combined with clinical experience to find the best classification parameters.

eAppendix 2. RF feature weighting method
The specific steps of the RF feature weighting method are as follows: Step 1: Using bootstrap technology to generate multiple decision trees from the original data set, construct a random forest model, determine the OOB data outside the bag, and calculate the data error OBBerror1 according to the generated model.
Step 2: Randomly change the value of feature A in OOB data (i.e. the noise interference of feature A), and calculate the OOB data error OBBerror2 again.
Step 3: Suppose there are n trees in the forest, and feature importance = (error2-error1) / n. The degree of feature importance is determined according to the magnitude of the numerical change. Finally, the importance of all features is normalized to obtain the weight of each feature.
For the features in the data, the variable with a larger significance score indicates that the variable is more important for classification and the corresponding weight is larger. The variable importance measurement is a natural mechanism behind RF simulation data, which has good statistical robustness and good application in different fields.

eAppendix 3. 10-fold cross validation
The specific steps of 10-fold cross validation are as follows: Step 1: The total scores of all samples are randomly divided into 10 equal parts to obtain 10 sample subsets.
Step 2: 9 sample subsets are selected in turn to form the training set, and the remaining one is the test set. The total score of each training set is graded by Entropybased method, and 10 grading systems are established after 10 divisions.
Step 3: Test the stability of each classification system using the test set corresponding to the training set, and define the stability index as follows: The stability of the ( 1, 2, ,10)