data

This directory contains implementations of dataset containers, an OpenML connection, data sampling, manipulation, and transformation strategies. It also includes minor helpers for saving and loading that wrap mlpack functions, a Gram class for computing approximate or exact Gram matrices, and a reporting function for the dataset container of LCPP, which prints statistical information about the inputs and labels of the given dataset. Also, there is a wrapper for transforming these dataset containers with mlpack scaling and transformation functionality, for both inputs and labels.

2. dataset

Here the dataset containers can be found under the data namespace. The label type is templated, so one can either hold a one or multi dimensional label for regression tasks or store labels for classification. An example is given below.

...
    data::Dataset<arma::Row<double>> regdataset(2) // Creates a 2D regression dataset
    data::Dataset<arma::Mat<double>> regdataset2(3) // Creates a 3D regression dataset with multi-dimensional output
    data::Dataset<arma::Row<size_t>> classdataset(2,3) // Creates a 2D 3-class classification dataset
...

Depending on the label type, different constructors must be used. For regression tasks, only the input dimensionality dim is required, whereas for classification tasks the number of classes num_class must be specified (unless the chosen toy dataset already has a fixed number of classes).

Several built-in functions are available for generating toy datasets with a specified number of samples. For example, the Linear method creates a dataset with a linear relationship, and the level of Gaussian noise added to the labels can be controlled. Similarly, the Sine method generates a sinusoidal relationship with configurable noise. For classification tasks, the Banana dataset provides two separable banana-shaped clusters, Dipping corresponds to a special dataset introduced in [1], and Gaussian generates spherical Gaussian blobs by specifying the means of the clusters.

Alternatively, users can define their own datasets and place them into this container, either at initialization or through the Update method.

In data::oml namespace there is another dataset container. This dataset gets data through the Rest API of the OpenML database. To get a certain dataset only the id is needed for a dataset which can be found in OpenML. First, the meta-data is fetched to the directory of choice (there is of course my deafults can also be used), than the dataset is fetched. The downloading is done only once, if the file is there it is only loaded. (until the directory specified is deleted or changed. In this case the target is always one dimensional and to determine the problem type one just needs to specify the label type as size_t or double/float. The sample can be seen below.

...
    data::oml::Dataset<double> regdataset(id) // Pulls the dataset with id as a regression task
    data::oml::Dataset<size_t> classdataset(id) // Pulls the dataset with id as a regression task

3. sample

Here lies 3 different sampling strategies: Random Selection, Bootstrap, and Additive Sampling. Each method is designed to generate training and testing splits for machine learning experiments in slightly different ways.

Random Selection chooses a subset of the dataset without replacement. That means once an element is picked, it cannot be chosen again in the same split. This procedure can be repeated multiple times, and with different subset sizes, to produce a collection of splits.
Bootstrap sampling chooses a subset of the dataset with replacement. This means the same element can be selected multiple times in the subset, while some elements may not be chosen at all.
Additive Sampling builds subsets in a cumulative manner. Instead of drawing independent subsets each time, it gradually enlarges the training set. The first subset is created by selecting.

All three methods share similar parameters:

Dataset size: Total number of elements in the dataset.
N (or Ns): The number(s) of elements to be selected. Can be a single value or a sequence of sizes.
Repeat count: How many times the splitting or sampling procedure should be repeated.
Random seed: Used to ensure reproducibility of results.
Output collection: A container that stores all the (train, test) splits generated by the procedure.

Note that, all the sampling strategies should be using indexes rather than the datasplits for efficiency purpose. These structures can be used as template parameters for LCurve class in src to have the splitting for the learning curve generation in a reproducible manner.

4. manip

This module provides tools for dataset manipulation that extend the standard functionality of mlpack. While mlpack includes a dataset splitter, its default interface only supports splitting by ratio (e.g., 80/20). For tasks such as constructing learning curves, it is often necessary to control the exact number of training samples at each step. To support this, we reimplemented splitting functions with the same signatures as mlpack but with the additional flexibility of specifying the exact training set size.

Set Difference (<tt>SetDiff</tt>)

Computes the difference between two sorted vectors.
Useful for identifying elements present in one dataset but not in another.

Migration (<tt>Migrate</tt>)

Transfers a specified number of samples from the test set to the training set.
Supports both raw Armadillo matrices and higher-level dataset containers.
Useful for incremental or curriculum learning scenarios, where the training set is gradually expanded.

Splitting (<tt>Split</tt>)

Provides dataset splitting by exact number of training samples, not just by ratio.
Essential for generating learning curves where fine-grained control of training set size is required.
Works with both inputs and labels, ensuring alignment.

Stratified Splitting (<tt>StratifiedSplit</tt>)

Ensures class proportions are preserved in both training and test sets.
Essential for classification tasks with imbalanced data.
Fixes a known limitation in mlpack*’s stratified splitting that can fail on balanced datasets.
Available for both raw data holders of *armadillo and of LCPP dataset containers.

5. collect

The Collect class provides a convenient way to gather datasets from OpenML. It can be initialized using a study ID to collect all datasets within that study, or by providing a list of dataset IDs to fetch specific datasets. Optionally, a local path can be specified to save the datasets and their metadata in a structured folder. Once the collection is created, datasets can be accessed sequentially using a function that retrieves the next dataset, or individually by requesting a dataset with a specific ID. The class also provides utility functions to check the total number of datasets in the collection, track the current position when iterating through the collection, and retrieve all the available dataset IDs.

@code{cpp}

... // Collect datasets from an OpenML study with ID 1234 data::oml::Collect collector1(1234);

// Collect specific datasets by their OpenML IDs arma::Row<size_t> dataset_ids = {10, 20, 30}; data::oml::Collect collector2(dataset_ids);

// Collect datasets from a study and save them locally std::filesystem::path save_path = "local_data"; data::oml::Collect collector3(1234, save_path);

// Retrieve datasets from the collection auto dataset1 = collector1.GetNext(); // Get the next dataset in the collection auto dataset2 = collector2.GetID(20); // Get a dataset by its ID

// Check collection info size_t total = collector1.GetSize(); // Total number of datasets size_t counter = collector1.GetCounter(); // Current position in iteration auto keys = collector1.GetKeys(); // All available dataset IDs ...

[1] Loog, M., & Duin, R. P. W. (2012). The dipping phenomenon.