Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Bach, Jakob; Zoller, Kolja; Schulz, Katrin

doi:10.5445/IR/1000148891

Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

¹; Zoller, Kolja ²; Schulz, Katrin ²
¹ Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)
² Institut für Angewandte Materialien – Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)

Abstract:

These are the experimental data for the paper

> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection"

published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979).
You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the code [here](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection).
See the `README` for details.

Some of the datasets used in our study (which we also provide here) originate from [OpenML](https://www.openml.org) and are CC-BY-licensed.
Please see the paragraph `Licensing` in the `README` for details, e.g., on the authors of these datasets.

Zugehörige Institution(en) am KIT	Institut für Angewandte Materialien – Computational Materials Science (IAM-CMS) Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp	Forschungsdaten
Publikationsdatum	26.07.2022
Erstellungsdatum	27.03.2021
Identifikator	DOI: 10.5445/IR/1000148891 KITopen-ID: 1000148891
Lizenz	Creative Commons Namensnennung 4.0 International
Externe Relationen	Forschungsdaten/Software
Schlagwörter	Feature selection, Constraints, Domain knowledge, Theory-guided data science
Liesmich	Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper > Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal SN Computer Science. Check our GitHub repository for the code and instructions to reproduce the experiments. The data were obtained on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was `3.8`. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (`syn_pipeline.py`) took several hours. The commit hash for the last run of this pipeline is `acc34cf5d2`. The commit hash for the last run of the corresponding evaluation (`syn_evaluation.py`) is `c1a7e7e99e`. Running the experimental pipeline for the case study in materials science (`ms_pipeline.py`) took less than one hour. The commit hash for the last run of this pipeline is `ba30bf9f11`. The commit hash for the last run of the corresponding evaluation (`ms_evaluation.py`) is `c1a7e7e99e`. All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with `pandas.read_csv()`. `ms/` The input data for the case study in materials science (`ms_pipeline.py`). Output of the script `prepare_ms_dataset.py`. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to `prepare_ms_dataset.py`). In this pre-processed version, the feature and target parts of the data are already separated into two files: `voxel_data_predict_glissile_X.csv` and `voxel_data_predict_glissile_y.csv`. In `voxel_data_predict_glissile_X.csv`, each column is a numeric feature. `voxel_data_predict_glissile_y.csv` only contains one column, the numeric prediction target (reaction density of glissile reactions). `ms-results/` Only contains one result file (`results.csv`) for the case study in materials science. Output of the script `ms_pipeline.py`, input to the script `ms_evaluation.py`. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter. `objective_value` (float): Objective `Q(s, X, y)`, the sum of the qualities of the selected features. `num_selected` (int): `n_{se}`, the number of selected features. `selected` (string, but actually a list of strings): Names of the selected features. `num_variables` (int): `n`, the total number of features in the dataset. `num_constrained_variables` (int): `n_{cf}`, the number of features involved in constraints. `num_unique_constrained_variables` (int): `n_{ucf}`, the number of unique features involved in constraints. `num_constraints` (int): `n_{co}`, the number of constraints. `frac_solutions` (float): `n_{so}^{norm}`, the number of valid (regarding constraints) feature sets relative to the total number of feature sets. `linear-regression_train_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set. `linear-regression_test_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set. `regression-tree_train_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set. `regression-tree_test_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set. `xgb-linear_train_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set. `xgb-linear_test_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set. `xgb-tree_train_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set. `xgb-tree_test_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set. `evaluation_time` (float): Runtime (in s) for evaluating one set of constraints. `split_idx` (int): Index of the cross-validation fold. `quality_name` (string): Measure for feature quality (absolute correlation or mutual information). `constraint_name` (string): Name of the constraint type (see paper). `dataset_name` (string): Name of the dataset. `openml/` The input data for the study with synthetic constraints (`syn_pipeline.py`). Output of the script `prepare_openml_datasets.py`. We downloaded 35 datasets from OpenML and removed non-numeric columns. Also, we separated the feature part (`_X.csv`) and the target part (`_y.csv`) of each dataset. `_data_overview.csv` contains meta-data for the datasets, including dataset id, dataset version, and uploader. Licensing Please consult each dataset's website on OpenML for licensing information and citation requests. According to OpenML's terms, OpenML datasets fall under the CC-BY license. The datasets used in our study were uploaded by: Jan van Rijn (user id: 1) Joaquin Vanschoren (user id: 2) Rafael Gomes Mantovani (user id: 64) Tobias Kuehn (user id: 94) Richard Ooms (user id: 8684) R P (user id: 15317) See `_data_overview.csv` to match each dataset to its uploader. `openml-results/` Result files for the study with synthetic constraints. Output of the script `syn_pipeline.py`, input to the script `syn_evaluation.py`. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, `results.csv`. The columns of the result files are the those of `ms-results/results.csv`, minus `selected` and `evaluation_time`; see above for detailed descriptions.
Art der Forschungsdaten	Dataset
Relationen in KITopen	Verweist auf An Empirical Evaluation of Constrained Feature Selection. Bach, Jakob; Zoller, Kolja; Trittenbach, Holger; Schulz, Katrin; Böhm, Klemens (2022) Zeitschriftenaufsatz (1000150015) Experimental Data for the Dissertation "Leveraging Constraints for User-Centric Feature Selection". Bach, Jakob (2024) Forschungsdaten (1000175819)

Externe Links

Download (RADAR4KIT)

Export

Statistiken

Seitenaufrufe: 197
seit 26.07.2022

Repository KITopen

Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Abstract:

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

`ms/`

`ms-results/`

`openml/`

`openml-results/`