Subset selection from multi-experiment data sets with application to milk fatty acid profiles
The development of routine analyses to allow for the handling of large amounts of samples and to avoid cost and time expensive analytical techniques is of high value. These routine analyses most often require calibration using the detailed analyses as reference values. A representative subset reflecting the complete range of the variables of interest is required for this purpose. In this paper this subset selection problem is tackled for multi-experiment data sets. Conventional techniques such as the Kennard and Stone algorithm and OptiSim are compared to a new approach based on Genetic Algorithms. The challenge here is to find an adequate objective function and to modify the standard crossover and mutation operators to keep the number of desired samples fixed. These techniques are applied on a data set containing the concentration of 45 fatty acids, determined by a simplified reference method, in 1033 milk samples, stemming from six different experiments. The objective is to select a subset of 100 samples in which each of the six different experiments is sufficiently represented. While there is no obvious way to generalize the conventional methods for multi-experiment data sets, this can quite easily be accomplished for Genetic Algorithms by modifying the objective function. Our results indicate that Genetic Algorithms are very capable of handling the subset selection problem for multi-experiment data sets.