Complete Documentation

DL-Studio Documentation

A comprehensive guide to understanding, setting up, and using DL-Studio for end-to-end machine learning and deep learning model development.

What is DL-Studio?

Understanding the core purpose and capabilities of DL-Studio

Local-First ML Platform

DL-Studio is a local development environment for building, training, and deploying machine learning and deep learning models. It runs entirely on your machine with no cloud dependencies, ensuring complete data privacy and control.

Unified Framework

Combines traditional ML algorithms (XGBoost, Random Forest, SVM) with deep learning models (MLP, RNN, Transformer) in a single, intuitive interface for easy model comparison and selection.

Built-in Explainability

Features integrated XAI (Explainable AI) capabilities including SHAP, LIME, sensitivity analysis, and correlation matrices to understand and interpret model decisions.

Experiment Tracking

Every training run is logged with metrics, visualizations, and artifacts. Compare models side-by-side and track performance improvements over time.

End-to-End Workflow

From raw data to deployed model in five simple steps

1

Data Upload & Analysis

Upload your CSV or Excel dataset. DL-Studio automatically analyzes the data, detects feature types, identifies missing values, and provides distribution insights.

What happens:

  • Automatic data type detection (numerical, categorical)
  • Missing value identification and reporting
  • Statistical summary generation
  • Target variable selection
2

Data Preprocessing

Clean and transform your data with built-in preprocessing pipelines. Handle missing values, encode categories, and scale features automatically.

Available transformations:

  • Missing value imputation (mean, median, mode, drop)
  • Categorical encoding (one-hot, label)
  • Feature scaling (standardization, normalization)
  • Outlier detection and removal
3

Model Selection & Training

Choose from 20+ ML/DL algorithms with configurable hyperparameters. Train with automatic 80/10/10 train/val/test split.

Training features:

  • One-click model training with smart defaults
  • Real-time training progress monitoring
  • Learning curve visualization
  • Live training logs streaming
  • Configurable hyperparameters per model
4

Model Evaluation & Comparison

Evaluate models using multiple metrics across train/val/test splits. Compare performance across different algorithms to find the best fit.

Evaluation metrics (per split):

  • R² Score (train, validation, test)
  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Side-by-side model comparison charts
  • Residual analysis and diagnostics
5

Explainability & Research Plots

Generate SHAP, LIME, and sensitivity analyses. Export paper-quality visualizations for publication and comprehensive reports.

Export options:

  • Trained model serialization (Keras format)
  • Feature importance reports (PNG)
  • SHAP/LIME explanation plots
  • Research-quality plots (correlation, residuals, distributions)
  • Complete run artifacts and logs

Supported Algorithms

Comprehensive collection of machine learning and deep learning models

Boosting Ensemble Methods

Gradient boosting algorithms that build trees sequentially, correcting errors from previous iterations. Industry-standard for tabular data performance.

XGBoost

Extreme Gradient Boosting with regularization (L1/L2). Best for structured data competitions.

LightGBM

Leaf-wise tree growth. Faster training on large datasets with similar accuracy.

CatBoost

Native categorical handling with ordered boosting. Minimal preprocessing needed.

Gradient Boosting (sklearn)

Scikit-learn implementation. Slower but reliable for smaller datasets.

Tree-Based Models

Decision tree algorithms that split data based on feature values. Easy to interpret and fast to train.

Decision Tree

Single tree. Good baseline model, prone to overfitting.

Random Forest

Ensemble of decision trees with feature bagging. Reduces overfitting.

Extra Trees

Extremely randomized trees. Faster than Random Forest, often similar performance.

Support Vector Machines

SVMs find optimal hyperplanes to separate classes or fit regression lines. Effective in high-dimensional spaces.

SVM (RBF Kernel)

Radial Basis Function. Handles non-linear relationships.

SVM (Linear)

Linear kernel. Fast for high-dimensional sparse data.

CNN (1-D Convolutional)

For sequential data with local patterns. Learns spatial hierarchies.

RNN / LSTM / GRU

For sequential data. Captures temporal dependencies.

Linear Models

Simple linear models that assume linear relationships between features and target. Fast, interpretable, and good baselines.

Linear Regression

Basic linear relationship. Fastest training, assumes linearity.

Ridge Regression

L2 regularization. Prevents overfitting with many features.

Lasso Regression

L1 regularization. Feature selection by shrinking coefficients to zero.

Deep Learning Models

Neural network architectures for learning complex patterns. Requires more data but can capture non-linear relationships automatically.

MLP (Multi-Layer Perceptron)

Fully connected feedforward network. Universal approximator.

RNN / LSTM / GRU

For sequential data. Captures temporal dependencies.

Transformer

Attention-based. State-of-the-art for NLP and sequences.

Dataset Requirements

Recommended dataset sizes and guidelines for optimal results

Algorithm Type Minimum Samples Recommended Samples Features Best For
Linear Models 50 500+ 1-100 Small Data
Decision Tree 100 1,000+ 1-50 Small Data
Random Forest 200 2,000+ 1-200 Medium Data
SVM 100 5,000+ 1-10,000 Medium Data
XGBoost / LightGBM 500 10,000+ 1-1,000 Large Data
CatBoost 500 10,000+ 1-500 Large Data
MLP 1,000 50,000+ 1-10,000 Medium-Large

Good Data Quality Signs

No missing values or minimal (< 5%) missing data. Balanced classes for classification (within 10:1 ratio). Clean labels without typos or inconsistencies. Relevant features that correlate with target.

Data Quality Red Flags

High missing rates (> 20%) need careful imputation. Heavy class imbalance requires resampling or class weights. Data leakage where future info leaks into training. Outliers that may be errors or genuine extreme values.

Explainable AI (XAI) Techniques

Built-in interpretability methods for understanding model decisions

SHAP (SHapley Additive exPlanations)

Based on game theory, SHAP calculates the marginal contribution of each feature to each prediction. Provides both global feature importance (overall model) and local explanations (individual predictions).

Theoretically grounded Local accuracy guaranteed Missingness handled Global + Local views

LIME (Local Interpretable Model-agnostic Explanations)

LIME explains individual predictions by approximating the model locally with an interpretable model. Perturbs input data around the point of interest and weights samples by proximity.

Model-agnostic Human-interpretable Works on any model Feature perturbation

Sensitivity Analysis

Examines how model predictions change when individual features are varied while keeping others constant. Creates feature response curves showing the relationship between each feature and output.

Intuitive visualization Detects non-linearity No model assumption Domain expert friendly

Correlation Analysis

Computes Pearson correlation coefficients between all numerical features. Essential for understanding feature relationships, detecting multicollinearity, and feature selection.

Multicollinearity detection Feature engineering hints Heatmap visualization Quick insights

Residual Analysis

Plots residuals (actual - predicted) to diagnose model fit. Reveals patterns that the model failed to capture, outliers, and heteroscedasticity.

Model diagnostic Outlier detection Assumption checking Improvement guidance

Native Feature Importance

Built-in importance scores from tree-based models (Gini/MDI for sklearn, gain-based for XGBoost/LightGBM). Quick ranking of features by their contribution to splits.

Fast computation No extra library needed Tree-specific accuracy Baseline comparison

Key Features

What makes DL-Studio powerful for ML development

Easy Data Upload

Drag-and-drop CSV/Excel files. Auto-detection of columns and types.

Auto-Preprocessing

Automatic handling of missing values, encoding, and scaling.

Model Comparison

Train multiple models and compare metrics side-by-side across train/val/test splits.

Research Plots

Paper-quality visualizations for publications: regression, distributions, correlation, importance.

Experiment Tracking

Every run logged with parameters, metrics, and artifacts. History sidebar for comparing runs.

Built-in XAI

SHAP, LIME, Sensitivity Analysis, Correlation Matrix, Residual Analysis.

Configurable Hyperparameters

All models have customizable parameters via the Architecture tab. Click the gear icon.

Dynamic Architecture

Visual neural network diagram updates based on selected model and hyperparameters.

Easy Export

Download trained models, plots, and complete run reports.

Studio Tabs Guide

Understanding the six main tabs in the DL-Studio workspace

Tab Description Key Actions
Architecture Select model family, configure hyperparameters via gear icon, preview dynamic neural network diagram Choose model, set params, see architecture preview, set benchmark mode
Training Hub Monitor training with live logs, loss curves, and MAE over epochs. Shows data split row counts View learning curves, MAE charts, real-time metrics, live log stream
Verification Test model with random samples. Auto-loads random data from dataset range. Randomize to get fresh samples Randomize inputs, edit values, run prediction, view target outputs
Split Results Comprehensive Train/Val/Test metrics table. All metrics: R², MAE, MSE, RMSE per split. Quality & overfit badges Sort by any metric, radar chart comparison, best-per-split cards, fit status
Benchmark Side-by-side model comparison across all trained algorithms. Top 3 podium, grouped bar charts Compare R², MAE by split, full leaderboard, winner highlight
Intelligence SHAP feature importance, LIME rules, sensitivity analysis curves, correlation matrix View explanations, understand feature contributions, analyze residuals
Research Plots Publication-quality charts: regression scatter, importance bar, correlation heatmap, distributions Export PNG, customize colors, paper-ready visualizations

Run Management

Track, compare, and manage your training experiments

Run History

Every training run is saved with full metadata: parameters, metrics, model artifacts, and logs. Browse past runs via the History sidebar or Run Manager panel. Load any run by ID to restore its state and make new predictions.

Delete Runs

Delete Active: Remove the currently loaded run and all its artifacts. Clear All: Remove all training history at once (with confirmation). Use the cleanup buttons in the Run Manager panel to free disk space and keep your workspace organized.

Prediction Verification

The Verification tab auto-loads a random sample from your dataset with values within each feature's actual min/max range. Click Randomize to get a fresh sample. Modify any value and click Run Prediction to see the model's output. This helps you understand how individual features affect predictions.

Data Split Strategy

Automatic 80/10/10 train/validation/test split for robust model evaluation

Training Set (80%)

Used to train all models including traditional ML and deep learning. The model learns patterns from this data.

Validation Set (10%)

Used for hyperparameter tuning, early stopping, and model selection. Prevents overfitting by monitoring validation loss during training.

Test Set (10%)

Held-out data for final model evaluation. Provides unbiased performance estimate. Benchmark results show metrics for all three splits.

Benchmark Metrics by Split

The Benchmark tab displays R², MAE, and MSE for each model across all three splits. This helps identify overfitting (high train R², low test R²) and underfitting (low R² across all splits). Best practice: test R² should be within 5% of validation R².

Configurable Hyperparameters

Click the ⚙️ icon next to any model to configure its parameters

XGBoost / LightGBM

Trees: Number of trees (50-500)
Max Depth: Tree depth limit (3-12)
Learning Rate: Step size (0.01-0.3)
Subsample: Row sampling (0.5-1.0)
Colsample: Feature sampling (0.5-1.0)

Neural Networks (MLP)

Hidden Layers: Number of layers (1-5)
Neurons: Units per layer (8-512)
Activation: ReLU, Tanh, or Sigmoid
Dropout: Regularization rate (0-0.5)

LSTM / GRU

Units: Memory units (32-256)
Layers: Recurrent layers (1-3)
Bidirectional: Forward + backward (LSTM only)

Transformer

Attention Heads: Parallel attention (1-8)
Layers: Transformer blocks (1-4)
FFN Dimension: Feed-forward size (64-512)

When to Use DL-Studio

Ideal use cases and scenarios

Best For

Tabular data analysis with structured datasets. Quick prototyping to test multiple algorithms. Explainability requirements needing SHAP/LIME. Local development without cloud dependencies.

Not Ideal For

Very large datasets (> 1M rows) may need distributed computing. Real-time inference requiring low-latency APIs. Complex NLP/Vision requiring state-of-the-art transformers. Production pipelines needing CI/CD integration.