


Datasets — Open Data for AI4Science
I believe that science moves faster when data is shared.
Every AI4Science project generates real-world data — from fabric spectra to climate indicators — that can help others explore, replicate, or extend my work. These datasets are a record of experiments and ideas across environmental sensing, sustainability, and health.
I curate and release them so that educators, researchers, and students can learn not just from the results, but from the process: how data is collected, cleaned, modeled, and used responsibly.
Textile & Sustainability Datasets
NIRS Fabric Sorting Dataset
Predicting material composition using near-infrared spectra.
A collection of spectral signatures from different textile types used to train AI models for automated sorting and recycling.
Includes: raw spectra, labeled materials, calibration scripts, and test sets.
Use case: sensor calibration, classification algorithms, sustainability research.
Access: Via Github here — educational download under CC BY-NC 4.0 license.


Textile & Sustainability Datasets
Mycoremediation Forecasting Dataset
Modeling fungal growth and pollutant breakdown.
Combines environmental variables (temperature, pH, moisture) with image-based mycelium growth tracking to predict remediation efficiency.
​
Use case: environmental modeling, time-series forecasting, bio-AI research.
​
Access: Via Github here — educational download under CC BY-NC 4.0 license.
Papers utilizing this dataset (and describing its creation, have been presented at NeurIPS Climate 2024 and IEEE AI-SI 2025. See publications for details.​​​
Textile & Sustainability Datasets
UV Marked Symbols on New and Washed Fabrics (for Traceability Studies)​
​
This dataset captures ultraviolet-illuminated markings printed on newly manufactured fabric samples. Each image reveals high-contrast symbols or patterns applied via UV inks or treatments that are otherwise invisible under standard lighting conditions. The dataset enables exploration of material detection, pattern recognition, and surface verification tasks in textile environments.
Contents
-
A collection of fabric samples with UV-marked symbols, imaged under controlled illumination
-
High-resolution images that showcase contrast between UV symbols and base fabric
-
Metadata for each sample: symbol label, fabric type, illumination parameters, exposure settings
Intended Use Cases
-
Training computer vision models to detect hidden or low-contrast symbols on materials
-
Developing fabric authentication or verification systems (e.g. counterfeit detection)
-
Supporting research in computer vision, fluorescence imaging, and textile inspection
-
Serving as an educational resource for students tackling real-world sensing challenge
Access the dataset for new fabrics at Kaggle here and the dataset for washed fabrics at Kaggle here.
​
A paper based on this work was published in ACM COMPASS 2024 (see publications.)


Textile & Sustainability Datasets
Fabric Microscope Images for Classification
​
This dataset contains high-resolution microscope (optical) imagery of various fabric samples. Each image captures the fine-grained textures, weave patterns, and fiber characteristics inherent to different textile materials. The goal is to support research in material classification, textile recycling, sensor calibration, and AI-driven sustainability.
Use Cases
-
Training models to classify fabric types based on microtextural features
-
Assisting sensor systems (e.g. NIRS, hyperspectral imaging) in validating predictions
-
Enabling educational modules in textile science and materials data exploration
-
Providing baseline data for research in textile recycling and sustainable materials
​
Available at Kaggle here.
Released under Kaggle terms (please check the dataset page for details). Free for educational and research use.
​
This work was also published in ACM COMPASS and is listed in publications.
Public Health - Mosquito Surveillance
Object Detection Datasets for Mosquito Detection and Classification​
​
This dataset provides annotated images of mosquitoes captured via visual (camera) sensors, where each mosquito instance is marked by a bounding box. It’s intended to support computer vision tasks in entomology, vector surveillance, and automated monitoring systems. These datasets were created by applying DINO to existing mosquito image datasets, thereby creating bounding box labeled versions for object detection which was not possible with the original datasets.
Contents
-
A collection of images containing one or more mosquito specimens
-
For each image, one or more bounding box annotations (coordinates + label) indicating the location of each mosquito
Use Cases
-
Training object detection models for mosquito detection
-
Evaluating model performance in identifying and localizing mosquitoes in field imagery
-
Developing automated surveillance systems (edge AI) for vector-borne disease monitoring
-
Serving as an educational benchmark in computer vision & entomology
​
Dataset for mosquitos in varied contexts can be found here. The dataset for mosquitoes on human skin can be found here.
​
My mosquito work has been published in IEEE BigData 2024, SPIE Digital Image Processing 2024, and will appear in NeurIPS Climate 2025.


Environment
Evapotranspiration Forecasting Dataset
Predicting water stress using satellite and meteorological data.
Aggregated daily time-series for 10 California cities, extracted from the OpenET data repository. This subset eases experimentation and eduction for evaptranspiration forecasting algoriths.
Use case: environmental forecasting, remote sensing ML.
This dataset was used in the Cloudera Machine Learning Hackathon, where it won first place for forecasting with Prophet and Kats.
Dataset can be accessed from Kaggle here. The code used in the hackathon can be downloaded from Cloudera here, with an article here.
Public Health - COVID Testing
COVID-19 Test Strip Dataset
Computer vision for lateral-flow assay analysis.
Image dataset capturing color and texture variations across at-home test strips to train classifiers for result interpretation.
​
Use case: biomedical image analysis, healthtech ML education.
​
Dataset is available at Kaggle here.

Dataset Use and Citation
All datasets are released for educational and non-commercial research use.When referencing or reusing them, please cite:Gupta, D. (2025). AI4Science Open Datasets. AI4Science.ai. Retrieved from https://ai4science.ai/datasets