|
Learning Curve Plus Plus (LCPP)
|
This directory contains implementations of dataset containers, an OpenML connection, data sampling, manipulation, and transformation strategies. It also includes minor helpers for saving and loading that wrap mlpack functions, a Gram class for computing approximate or exact Gram matrices, and a reporting function for the dataset container of LCPP, which prints statistical information about the inputs and labels of the given dataset. Also, there is a wrapper for transforming these dataset containers with mlpack scaling and transformation functionality, for both inputs and labels.
Here the dataset containers can be found under the data namespace. The label type is templated, so one can either hold a one or multi dimensional label for regression tasks or store labels for classification. An example is given below.
Depending on the label type, different constructors must be used. For regression tasks, only the input dimensionality dim is required, whereas for classification tasks the number of classes num_class must be specified (unless the chosen toy dataset already has a fixed number of classes).
Several built-in functions are available for generating toy datasets with a specified number of samples. For example, the Linear method creates a dataset with a linear relationship, and the level of Gaussian noise added to the labels can be controlled. Similarly, the Sine method generates a sinusoidal relationship with configurable noise. For classification tasks, the Banana dataset provides two separable banana-shaped clusters, Dipping corresponds to a special dataset introduced in [1], and Gaussian generates spherical Gaussian blobs by specifying the means of the clusters.
Alternatively, users can define their own datasets and place them into this container, either at initialization or through the Update method.
In data::oml namespace there is another dataset container. This dataset gets data through the Rest API of the OpenML database. To get a certain dataset only the id is needed for a dataset which can be found in OpenML. First, the meta-data is fetched to the directory of choice (there is of course my deafults can also be used), than the dataset is fetched. The downloading is done only once, if the file is there it is only loaded. (until the directory specified is deleted or changed. In this case the target is always one dimensional and to determine the problem type one just needs to specify the label type as size_t or double/float. The sample can be seen below.
Here lies 3 different sampling strategies: Random Selection, Bootstrap, and Additive Sampling. Each method is designed to generate training and testing splits for machine learning experiments in slightly different ways.
All three methods share similar parameters:
(train, test) splits generated by the procedure. Note that, all the sampling strategies should be using indexes rather than the datasplits for efficiency purpose. These structures can be used as template parameters for LCurve class in src to have the splitting for the learning curve generation in a reproducible manner.
This module provides tools for dataset manipulation that extend the standard functionality of mlpack. While mlpack includes a dataset splitter, its default interface only supports splitting by ratio (e.g., 80/20). For tasks such as constructing learning curves, it is often necessary to control the exact number of training samples at each step. To support this, we reimplemented splitting functions with the same signatures as mlpack but with the additional flexibility of specifying the exact training set size.
The Collect class provides a convenient way to gather datasets from OpenML. It can be initialized using a study ID to collect all datasets within that study, or by providing a list of dataset IDs to fetch specific datasets. Optionally, a local path can be specified to save the datasets and their metadata in a structured folder. Once the collection is created, datasets can be accessed sequentially using a function that retrieves the next dataset, or individually by requesting a dataset with a specific ID. The class also provides utility functions to check the total number of datasets in the collection, track the current position when iterating through the collection, and retrieve all the available dataset IDs.
... // Collect datasets from an OpenML study with ID 1234 data::oml::Collect collector1(1234);
// Collect specific datasets by their OpenML IDs arma::Row<size_t> dataset_ids = {10, 20, 30}; data::oml::Collect collector2(dataset_ids);
// Collect datasets from a study and save them locally std::filesystem::path save_path = "local_data"; data::oml::Collect collector3(1234, save_path);
// Retrieve datasets from the collection auto dataset1 = collector1.GetNext(); // Get the next dataset in the collection auto dataset2 = collector2.GetID(20); // Get a dataset by its ID
// Check collection info size_t total = collector1.GetSize(); // Total number of datasets size_t counter = collector1.GetCounter(); // Current position in iteration auto keys = collector1.GetKeys(); // All available dataset IDs ...
[1] Loog, M., & Duin, R. P. W. (2012). The dipping phenomenon.