A Review of the Structure and Application of Scikit-Learn Datasets in Machine Learning Model Development

Authors

  • Mohammad Mahdi Salehi * Department of Mathematics and Computer Science, Shiraz Branch, Islamic Azad University, Shiraz, Iran.
  • Hossein Zarei Department of Mathematics and Computer Science, Shiraz Branch, Islamic Azad University, Shiraz, Iran.

https://doi.org/10.48314/ijorai.v1i2.64

Abstract

Scikit-learn is a widely used Python library that provides a diverse range of datasets alongside robust machine learning algorithms. This paper presents a comprehensive review of the structure and practical applications of key datasets available in Scikit-learn, emphasizing their role in developing and evaluating machine learning models. The datasets are categorized into built-in, fetchable, and synthetic types, each suited for different research and educational purposes. Through Exploratory Data Analysis (EDA), visualization, and baseline modeling on representative datasets such as Iris, Breast Cancer Wisconsin, and California Housing, this study highlights how these datasets facilitate various machine learning tasks, including classification and regression. Insights into dataset characteristics like class distribution, feature separability, and target variability guide the selection and optimization of algorithms. Overall, this review underscores the value of Scikit-learn datasets as foundational resources for prototyping and education, while also advocating for integration with larger, real-world datasets to address complex industrial challenges. 

Keywords:

Scikit-learn, Machine learning datasets, Data preprocessing

References

  1. [1] Fisher, R. (1936). UCI machine learning repository: Iris data set. http://archive.ics.uci.edu/ml/datasets/Iris

  2. [2] Iris dataset. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

  3. [3] Wine dataset. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html

  4. [4] Breast Cancer Wisconsin Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.load_breast_cancer.html

  5. [5] California Housing Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_california_housing.html

  6. [6] 20 Newsgroups Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_20newsgroups.html

  7. [7] Labeled Faces in the Wild (LFW) Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_lfw_people.html

  8. [8] Make_classification() Synthetic Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.make_classification.html

  9. [9] Make_regression() synthetic dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.make_regression.html

  10. [10] Make_blobs() synthetic dataset. https://scikit-learn.org /stable/modules/generated /sklearn.datasets.make_blobs.html

Published

2025-06-16

How to Cite

Salehi, M. M. ., & Zarei, H. . (2025). A Review of the Structure and Application of Scikit-Learn Datasets in Machine Learning Model Development. International Journal of Operations Research and Artificial Intelligence , 1(2), 90-100. https://doi.org/10.48314/ijorai.v1i2.64

Similar Articles

You may also start an advanced similarity search for this article.