Metadata-Version: 2.4
Name: Ak-dskit
Version: 1.0.1
Summary: A comprehensive data science toolkit with 221+ functions for ML workflows
Home-page: https://github.com/Programmers-Paradise/imputeKit
Author: Programmers' Paradise
Author-email: Aksh Agrawal <akshagr10@gmail.com>
Project-URL: Homepage, https://github.com/Programmers-Paradise/imputeKit
Project-URL: Documentation, https://github.com/Programmers-Paradise/imputeKit/blob/main/COMPLETE_FEATURE_DOCUMENTATION.md
Project-URL: Repository, https://github.com/Programmers-Paradise/imputeKit
Project-URL: Bug Tracker, https://github.com/Programmers-Paradise/imputeKit/issues
Project-URL: Changelog, https://github.com/Programmers-Paradise/imputeKit/releases
Keywords: data-science,machine-learning,eda,preprocessing,automl,feature-engineering,hyperplane-analysis,visualization,model-explainability
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: shap>=0.40.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: requests>=2.25.0
Requires-Dist: joblib>=1.0.0
Requires-Dist: scipy>=1.7.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Requires-Dist: pre-commit>=2.15.0; extra == "dev"
Provides-Extra: full
Requires-Dist: plotly>=5.0.0; extra == "full"
Requires-Dist: wordcloud>=1.8.0; extra == "full"
Requires-Dist: nltk>=3.7; extra == "full"
Requires-Dist: textblob>=0.17.0; extra == "full"
Requires-Dist: hyperopt>=0.2.5; extra == "full"
Requires-Dist: optuna>=2.10.0; extra == "full"
Requires-Dist: xgboost>=1.5.0; extra == "full"
Requires-Dist: lightgbm>=3.3.0; extra == "full"
Requires-Dist: catboost>=1.0.0; extra == "full"
Requires-Dist: imbalanced-learn>=0.8.0; extra == "full"
Requires-Dist: pandas-profiling>=3.0.0; extra == "full"
Requires-Dist: feature-engine>=1.4.0; extra == "full"
Provides-Extra: visualization
Requires-Dist: plotly>=5.0.0; extra == "visualization"
Provides-Extra: nlp
Requires-Dist: wordcloud>=1.8.0; extra == "nlp"
Requires-Dist: nltk>=3.7; extra == "nlp"
Requires-Dist: textblob>=0.17.0; extra == "nlp"
Provides-Extra: automl
Requires-Dist: hyperopt>=0.2.5; extra == "automl"
Requires-Dist: optuna>=2.10.0; extra == "automl"
Requires-Dist: xgboost>=1.5.0; extra == "automl"
Requires-Dist: lightgbm>=3.3.0; extra == "automl"
Requires-Dist: catboost>=1.0.0; extra == "automl"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# 🚀 DSKit - A Unified Wrapper Library for Data Science & ML

**DSKit** is a comprehensive, community-driven, open-source Python library that wraps complex Data Science and ML operations into intuitive, user-friendly 1-line commands.

Instead of writing hundreds of lines for cleaning, EDA, plotting, preprocessing, modeling, evaluation, and explainability, DSKit makes everything **simple**, **readable**, **reusable**, and **production-ready**.

The goal is to bring a **complete end-to-end Data Science ecosystem** in one place with wrapper-style functions and classes, supporting everything from basic data manipulation to advanced AutoML.

---

## 🎯 Project Objective

To create a Python library that lets users perform complete Data Science workflows with minimal code:

```python
from dskit import DSKit

# Complete ML Pipeline in 4 lines!
kit = DSKit.load("data.csv")
kit.comprehensive_eda(target_col="target").clean().engineer_features()
kit.train_advanced("xgboost").auto_tune().evaluate()
kit.explain()  # Generate SHAP explanations
```

The library remains:

- ✅ **Simple**: One-line commands for complex operations
- ✅ **Comprehensive**: 221 functions covering entire ML pipeline
- ✅ **Extensible**: Modular design for easy customization
- ✅ **Beginner-friendly**: Intuitive API with smart defaults
- ✅ **Expert-ready**: Advanced features and customization options
- ✅ **Production-ready**: Robust error handling and optimization

---

## 📦 Installation

### From PyPI (Recommended)

```bash
# Basic installation
pip install dskit

# Full installation with all optional dependencies
pip install dskit[full]

# Install specific feature sets
pip install dskit[visualization]  # Plotly support
pip install dskit[nlp]           # NLP utilities
pip install dskit[automl]        # AutoML algorithms

# Development installation
pip install dskit[dev]
```

### From Source

```bash
git clone https://github.com/Programmers-Paradise/imputeKit.git
cd imputeKit
pip install -e .
```

### Verify Installation

```bash
# Test the package
python test_package.py

# Check CLI
dskit --help
```

---

## 📦 Core Modules

DSKit includes comprehensive modules for:

### 📁 **Data I/O**

- Multi-format loading (CSV, Excel, JSON, Parquet)
- Batch folder processing
- Smart data type detection

### 🧹 **Data Cleaning**

- Auto-detect and fix data types
- Smart missing value imputation
- Outlier detection and removal
- Column name standardization
- Text preprocessing and NLP utilities

### 📊 **Exploratory Data Analysis**

- Comprehensive EDA reports
- Data health scoring
- Interactive visualizations
- Statistical summaries
- Correlation analysis
- Missing data patterns

### 🔧 **Feature Engineering**

- Polynomial and interaction features
- Date/time feature extraction
- Binning and discretization
- Target encoding
- Dimensionality reduction (PCA)
- Text feature extraction
- Sentiment analysis

### 🤖 **Machine Learning**

- 15+ algorithms (including XGBoost, LightGBM, CatBoost)
- AutoML capabilities
- Hyperparameter optimization
- Cross-validation
- Ensemble methods
- Imbalanced data handling

### 📈 **Visualization**

- Static plots (matplotlib/seaborn)
- Interactive plots (plotly)
- Model performance charts
- Feature importance plots
- Advanced correlation heatmaps

### 🧠 **Model Explainability**

- SHAP integration
- Feature importance analysis
- Model performance metrics
- Error analysis
- Learning curves

### 📐 **Hyperplane Analysis**

- Algorithm-specific hyperplane visualization
- SVM margins and support vectors
- Logistic regression probability contours
- Perceptron misclassification highlighting
- LDA class centers and projections
- Linear regression residual analysis
- Multi-algorithm comparison tools

### 🎯 **AutoML Features**

- Automated preprocessing pipelines
- Model comparison and selection
- Hyperparameter tuning (Grid, Random, Bayesian, Optuna)
- Automated feature selection
- Pipeline optimization

---

## 🚀 Quick Start

### Installation

```bash
# Basic installation
pip install dskit

# Full installation with all optional dependencies
pip install dskit[full]

# Development installation
git clone https://github.com/your-username/dskit.git
cd dskit
pip install -e .[dev,full]
```

### Basic Usage

```python
from dskit import DSKit

# Load and explore data
kit = DSKit.load("your_data.csv")
kit.data_health_check()  # Get data quality score
kit.comprehensive_eda(target_col="target")  # Full EDA report

# Clean and preprocess
kit.clean()  # Auto-clean: fix types, handle missing, normalize columns
kit.engineer_features()  # Create polynomial, date, and text features

# Train and evaluate models
kit.train_test_auto(target="your_target")
kit.compare_models("your_target")  # Compare multiple algorithms
kit.train_advanced("xgboost").auto_tune()  # Train with hyperparameter tuning
kit.evaluate().explain()  # Evaluate and generate SHAP explanations
```

### Advanced Features

```python
# Advanced text processing
kit.sentiment_analysis(["text_column"])
kit.extract_text_features(["text_column"])
kit.generate_wordcloud("text_column")

# Feature engineering
kit.create_polynomial_features(degree=3)
kit.create_date_features(["date_column"])
kit.apply_pca(variance_threshold=0.95)

# AutoML
kit.auto_tune(method="optuna", max_evals=100)
best_models = kit.compare_models("target", task="classification")

# Advanced visualizations
kit.plot_feature_importance(top_n=20)
kit.plot_learning_curves()
kit.plot_validation_curves()

# Algorithm-specific hyperplane visualization
dskit.plot_svm_hyperplane(svm_model, X, y)  # SVM with margins
dskit.plot_logistic_hyperplane(lr_model, X, y)  # Probability contours
dskit.plot_perceptron_hyperplane(perceptron_model, X, y)  # Misclassified points

# Compare multiple algorithm hyperplanes
models = {'SVM': svm, 'LR': lr, 'Perceptron': perceptron}
dskit.compare_algorithm_hyperplanes(models, X, y)
```

---

## 📚 Complete Feature Documentation

### 🧩 IMPLEMENTED FEATURES (All Tasks Complete)

Each task below is numbered and written in simple language with enough theory so that any contributor — even new ones — can understand exactly what to build.

---

## 📖 Examples & Tutorials

### Complete ML Pipeline Example

```python
import pandas as pd
from dskit import DSKit

# 1. Load and explore
kit = DSKit.load("customer_data.csv")
health_score = kit.data_health_check()  # Returns: 85.3/100

# 2. Comprehensive EDA
kit.comprehensive_eda(target_col="churn", sample_size=1000)
kit.generate_profile_report("eda_report.html")  # Automated EDA report

# 3. Advanced text processing (if text columns exist)
kit.advanced_text_clean(["feedback"])
kit.sentiment_analysis(["feedback"])
kit.extract_text_features(["feedback"])

# 4. Feature engineering
kit.create_date_features(["registration_date"])
kit.create_polynomial_features(degree=2, interaction_only=True)
kit.create_binning_features(["age", "income"], n_bins=5)

# 5. Preprocessing
kit.clean()  # Auto-clean pipeline
kit.handle_imbalanced_data(method="smote")  # Handle class imbalance

# 6. Model training and optimization
X_train, X_test, y_train, y_test = kit.train_test_auto("churn")
comparison = kit.compare_models("churn")  # Compare 10+ algorithms
kit.train_advanced("xgboost").auto_tune(method="optuna", max_evals=50)

# 7. Evaluation and explainability
kit.evaluate().explain()  # Comprehensive evaluation + SHAP
kit.plot_feature_importance()
kit.cross_validate(cv=5)
```

### NLP Pipeline Example

```python
# Text analysis workflow
kit = DSKit.load("reviews.csv")
kit.text_stats(["review_text"])  # Basic text statistics
kit.advanced_text_clean(["review_text"], remove_urls=True, expand_contractions=True)
kit.sentiment_analysis(["review_text"])  # Add sentiment scores
kit.generate_wordcloud("review_text", max_words=100)
kit.extract_keywords("review_text", top_n=20)
```

### Time Series Feature Engineering

```python
# Date/time feature extraction
kit.create_date_features(["transaction_date"])
# Creates: year, month, day, weekday, quarter, is_weekend columns

kit.create_aggregation_features("customer_id", ["amount"], ["mean", "std", "count"])
# Creates aggregated features grouped by customer
```

---

## 🎯 AutoML Capabilities

DSKit includes comprehensive AutoML features:

- **Automated Preprocessing**: Smart data cleaning and feature engineering
- **Model Selection**: Automatic algorithm comparison and selection
- **Hyperparameter Optimization**: Grid, Random, Bayesian, and Optuna-based tuning
- **Feature Selection**: Univariate, RFE, and embedded methods
- **Ensemble Methods**: Voting classifiers and advanced ensembles
- **Performance Optimization**: Cross-validation and learning curve analysis

---

## 📊 Supported Algorithms

### Classification & Regression

- **Traditional**: Random Forest, Gradient Boosting, SVM, KNN, Naive Bayes
- **Advanced**: XGBoost, LightGBM, CatBoost, Neural Networks
- **Ensemble**: Voting Classifiers, Stacking, Bagging

### Preprocessing

- **Scaling**: Standard, MinMax, Robust, Quantile
- **Encoding**: Label, One-Hot, Target, Binary
- **Imputation**: Mean, Median, Mode, KNN, Iterative
- **Feature Selection**: SelectKBest, RFE, RFECV, Embedded

---

## 🔧 Configuration

DSKit supports flexible configuration:

```python
# Global configuration
from dskit.config import set_config
set_config({
    'visualization_backend': 'plotly',  # or 'matplotlib'
    'auto_save_plots': True,
    'default_test_size': 0.2,
    'random_state': 42,
    'n_jobs': -1
})

# Method-specific parameters
kit.auto_tune(method="optuna", max_evals=100, timeout=3600)
kit.comprehensive_eda(sample_size=5000, include_correlations=True)
```

---

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup

```bash
git clone https://github.com/your-username/dskit.git
cd dskit
pip install -e .[dev,full]
pre-commit install
```

### Running Tests

```bash
pytest tests/ --cov=dskit --cov-report=html
```

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- Built on top of excellent libraries: pandas, scikit-learn, matplotlib, seaborn, plotly
- Inspired by the need for simplified data science workflows
- Community-driven development with contributions from data scientists worldwide

---

## 📞 Support

- 📚 Documentation: [dskit.readthedocs.io](https://dskit.readthedocs.io)
- 🐛 Issues: [GitHub Issues](https://github.com/your-username/dskit/issues)
- 💬 Discussions: [GitHub Discussions](https://github.com/your-username/dskit/discussions)
- 📧 Email: support@dskit.dev

---

**DSKit - Making Data Science Simple, Comprehensive, and Accessible! 🚀**
