KIT | KIT-Bibliothek | Impressum | Datenschutz

Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Bach, Jakob ORCID iD icon 1; Zoller, Kolja 2; Schulz, Katrin 2
1 Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)
2 Institut für Angewandte Materialien – Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)

Abstract:

These are the experimental data for the paper

> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"

published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979).
You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the code [here](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection).
See the `README` for details.

Some of the datasets used in our study (which we also provide here) originate from [OpenML](https://www.openml.org) and are CC-BY-licensed.
Please see the paragraph `Licensing` in the `README` for details, e.g., on the authors of these datasets.


Zugehörige Institution(en) am KIT Institut für Angewandte Materialien – Computational Materials Science (IAM-CMS)
Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Forschungsdaten
Publikationsdatum 26.07.2022
Erstellungsdatum 27.03.2021
Identifikator DOI: 10.5445/IR/1000148891
KITopen-ID: 1000148891
Lizenz Creative Commons Namensnennung 4.0 International
Schlagwörter Feature selection, Constraints, Domain knowledge, Theory-guided data science
Liesmich

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

These are the experimental data for the paper

> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"

accepted at the journal SN Computer Science.

Check our GitHub repository for the code and instructions to reproduce the experiments.
The data were obtained on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM.
The Python version was 3.8.
Our paper contains two studies, and we provide data for both of them.

Running the experimental pipeline for the study with synthetic constraints (syn_pipeline.py) took several hours.
The commit hash for the last run of this pipeline is acc34cf5d2.
The commit hash for the last run of the corresponding evaluation (syn_evaluation.py) is c1a7e7e99e.

Running the experimental pipeline for the case study in materials science (ms_pipeline.py) took less than one hour.
The commit hash for the last run of this pipeline is ba30bf9f11.
The commit hash for the last run of the corresponding evaluation (ms_evaluation.py) is c1a7e7e99e.

All these commits are also tagged.

In the following, we describe the structure/content of each data file.
All files are plain CSVs, so you can read them with pandas.read_csv().

ms/

The input data for the case study in materials science (ms_pipeline.py).
Output of the script prepare_ms_dataset.py.
As the raw simulation dataset is quite large, we only provide a pre-processed version of it
(we do not provide the input to prepare_ms_dataset.py).
In this pre-processed version, the feature and target parts of the data are already separated into two files: voxel_data_predict_glissile_X.csv and voxel_data_predict_glissile_y.csv.
In voxel_data_predict_glissile_X.csv, each column is a numeric feature.
voxel_data_predict_glissile_y.csv only contains one column, the numeric prediction target (reaction density of glissile reactions).

ms-results/

Only contains one result file (results.csv) for the case study in materials science.
Output of the script ms_pipeline.py, input to the script ms_evaluation.py.
The columns of the file mostly correspond to evaluation metrics used in the paper;
see Appendix A.1 for definitions of the latter.

  • objective_value (float): Objective Q(s, X, y), the sum of the qualities of the selected features.
  • num_selected (int): n_{se}, the number of selected features.
  • selected (string, but actually a list of strings): Names of the selected features.
  • num_variables (int): n, the total number of features in the dataset.
  • num_constrained_variables (int): n_{cf}, the number of features involved in constraints.
  • num_unique_constrained_variables (int): n_{ucf}, the number of unique features involved in constraints.
  • num_constraints (int): n_{co}, the number of constraints.
  • frac_solutions (float): n_{so}^{norm}, the number of valid (regarding constraints) feature sets relative to the total number of feature sets.
  • linear-regression_train_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set.
  • linear-regression_test_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set.
  • regression-tree_train_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set.
  • regression-tree_test_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set.
  • xgb-linear_train_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set.
  • xgb-linear_test_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set.
  • xgb-tree_train_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set.
  • xgb-tree_test_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set.
  • evaluation_time (float): Runtime (in s) for evaluating one set of constraints.
  • split_idx (int): Index of the cross-validation fold.
  • quality_name (string): Measure for feature quality (absolute correlation or mutual information).
  • constraint_name (string): Name of the constraint type (see paper).
  • dataset_name (string): Name of the dataset.

openml/

The input data for the study with synthetic constraints (syn_pipeline.py).
Output of the script prepare_openml_datasets.py.
We downloaded 35 datasets from OpenML and removed non-numeric columns.
Also, we separated the feature part (*_X.csv) and the target part (*_y.csv) of each dataset.
_data_overview.csv contains meta-data for the datasets, including dataset id, dataset version, and uploader.

Licensing

Please consult each dataset's website on OpenML for licensing information and citation requests.
According to OpenML's terms, OpenML datasets fall under the CC-BY license.
The datasets used in our study were uploaded by:

  • Jan van Rijn (user id: 1)
  • Joaquin Vanschoren (user id: 2)
  • Rafael Gomes Mantovani (user id: 64)
  • Tobias Kuehn (user id: 94)
  • Richard Ooms (user id: 8684)
  • R P (user id: 15317)

See _data_overview.csv to match each dataset to its uploader.

openml-results/

Result files for the study with synthetic constraints.
Output of the script syn_pipeline.py, input to the script syn_evaluation.py.
One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, results.csv.
The columns of the result files are the those of ms-results/results.csv, minus selected and evaluation_time;
see above for detailed descriptions.

Art der Forschungsdaten Dataset
Relationen in KITopen
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page