Supplementary Information

Gene Expression Profiling of Pediatric Acute Myelogenous Leukemia

Section II: Methods

Additional Statistical Methods

The ANN supervised learning algorithms have been previously described. 2 To determine the performance of each model using ANN, a confidence threshold was built for each diagnostic subtype utilizing a modification of the method described by Khan et al. 3   Models were built with two possiblilities: subgroup and non-subgroup. 3 ANN models were built by 3-fold cross validation utilizing only samples in the training set.  The training set samples were then shuffled and 3 additional ANN models were built. 100 repetitions of the model building process were performed. An empirical probability distribution for the ANN output node value was summarized using only nodal values greater than 0.5. to determine the 95% confidence threshold. For each individual sample in the training set, the 100 validation subtype nodal values were averaged, the samples was assigned to the subgroup only when its average subtype nodal value was greater than the 95% confidence threshold.  Similarly, nodal values for test set samples are averaged and assigned to a subgroup only when the nodal value exceed the 95% confidence level defined on the training set.