About - Qualitative Data Saturation Estimation using AI

TUBITAK Project

This web application is a scientific output of a TÜBİTAK-funded research project conducted under the ARDEB program (Project Code: 124K233). The project is led by Principal Investigator Hasan TUTAR and aims to develop intelligent tools that support methodological decisions in qualitative research.

Project Purpose

Q-Sat AI (Qualitative Data Saturation Estimation using AI) is an AI-supported prediction tool developed to determine the optimal sample size in qualitative research. The system guides researchers using advanced machine learning models trained on 3000+ qualitative research data points.

Technical Specifications

Dataset

3000+ qualitative research samples
6 different research designs
11 different parameters
95th percentile data cleaning
Balanced dataset (n=500 per design)

Model Architecture

Ensemble Learning Model
Different Machine Learning
Different meta-models
Cross-validation (k=5)
85% R² Score (Coefficient of Determination)

Research Designs

Narrative Research: Focus on storytelling
Ethnographic Research: Cultural analysis
Phenomenology: Experience analysis

Grounded Theory: Theory development
Case Study: In-depth examination
Other Designs: Mixed approaches

Model Performance (Detailed)

Model	Test R² (Avg.)	Train R² (Avg.)	Test MAE (Avg.)	Best Params
KNeighbors	0.852742	0.909446	0.151285	{'model__n_neighbors': 15, 'model__p': 1, 'model__weights': 'distance'}
GradientBoosting	0.852534	0.907133	0.175680	{'learning_rate': 0.1, 'loss': 'squared_error', 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}
RandomForest	0.852449	0.905714	0.183468	{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
XGBoost	0.849898	0.904114	0.186369	{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'reg_alpha': 0, 'reg_lambda': 1.5, 'subsample': 0.8}
DecisionTree	0.845724	0.912250	0.147432	{'criterion': 'squared_error', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}
SVR	0.763296	0.849767	0.263101	{'model__C': 10.0, 'model__degree': 2, 'model__gamma': 'scale', 'model__kernel': 'rbf'}
MLP	0.685608	0.779126	0.370804	{'activation': 'logistic', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': (30,), 'learning_rate': 'constant', 'solver': 'lbfgs'}
AdaBoost	0.423281	0.438722	0.586717	{'learning_rate': 0.05, 'loss': 'square', 'n_estimators': 100}
Ridge	0.391545	0.400687	0.575250	{'model__alpha': 50.0}

User Guide

Select Data Quality: Choose the expected quality of your data.
Select Information Power: Assess the knowledge level of your participants.
Select Homogeneity/Heterogeneity: Define if your group is similar or diverse.
Select Number of Interviews: Specify the number of interviews per participant.
Select Researcher Competence: Rate the researcher's experience level.
Select Research Scope: Define if the scope is narrow or broad.
Select Data Diversity (Triangulation): Assess the diversity of your data sources.
Select Participant Originality: Rate the originality of participant insights.
Select Interview Duration: Specify the average length of the interviews.
Get Prediction: Click "Predict Sample Size" to see the results.

Important Warnings

Points to Consider

Prediction results are for guidance and are not absolute values.
You can make adjustments based on your research topic and methodology.
Consider data saturation criteria.
Don't forget to get expert opinions and conduct a literature review.
Do not neglect to obtain ethics committee approval and necessary permissions.

Technical Details

Technologies Used

Python 3.10+
Scikit-learn
Pandas & NumPy
Flask Web Framework
Tailwind CSS

Model Algorithms

Ridge Regression
K-Nearest Neighbors (KNeighbors)
Support Vector Regression (SVR)
Decision Tree
Random Forest
Gradient Boosting
AdaBoost
Neural Network (MLP)
XGBoost

About the System