Metadata-Version: 2.1
Name: future_sales_prediction_2024
Version: 2.2.6
Summary: A package for feature extraction, hyperopt, and validation schemas
Author: Polina Yatsko
Author-email: yatsko_polina1@mail.ru
License: MIT
Keywords: machine-learning xgboost hyperopt data-science regression
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7,<3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: lightgbm
Requires-Dist: xgboost
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: hyperopt
Requires-Dist: shap
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: dvc
Requires-Dist: dvc-gs
Requires-Dist: dvc[gs]
Requires-Dist: dvc[gcs]
Requires-Dist: google-auth
Requires-Dist: google-auth-oauthlib
Requires-Dist: google-cloud-storage
Requires-Dist: argparse
Requires-Dist: gcsfs

# Data Science: Sales Prediction

#### Project Status: Completed

## Overview:
This project provides a Python package,future_sales_prediction_2024, to simplify common tasks in data science workflows. It includes tools for feature extraction, validation schema creation, hyperparameter optimization, and model training.


### Methods Used
* Feature Engineering: Automates the creation and selection of important features, including memory optimization.
* Validation: Implements schema validation to ensure data consistency, identify missing values, and prevent duplicate records.
* Hyperparameter Tuning: Leverages tools like hyperopt for efficient parameter search.
* Visualization: Includes plotting tools for feature importance and error analysis.

### Technologies
* Python
* Pandas, numpy, scikit-learn
* dvc, gcloud


## Features
- Automatically fetches raw and preprocessed data from Google Cloud Storage after installation.
- Data Version Control integration. To use this tool run in the terminal:
    git clone -b DS-4.1 https://github.com/YPolina/Trainee.git
    cd /{path}/Trainee
    dvc init
    dvc pull
- Provides an easy interface for working with datasets for sales forecasting.

### Challenges:
* Complexity in Generalization: Making the tools generic enough to work with diverse datasets while maintaining simplicity.
* Performance Optimization: Balancing ease of use with computational efficiency.
* Error Handling: Ensuring clear and helpful error messages for data validation and model failures.

### Conclusion:
This package is a modular and flexible solution for streamlining data science workflows. It provides data scientists and ML engineers with reusable tools to focus on solving domain-specific problems.

## [0.1.1] - 2024-11-25
### Added
- Changes in loader function: upload files using filenames.

## [0.2.1] - 2024-11-26
- Added support for Google Cloud Storage.
- Improved deployment pipeline.
- Bug fixes and performance improvements.

## [0.2.2] - 2024-11-27
- Bug fixes.

## [0.2.3] - 2024-11-28
- Enhanced Explainability and Error Analysis
    Users can now save plots generated by the Explainability and ErrorAnalysis classes to files.
    The directory and filenames are customizable, and plots are automatically overwritten if files with the same name already exist.
- Customizable Hyperparameter Tuning
Users can now fully customize the hyperparameter tuning process:
    Define the search space for hyperparameters.
    Specify the optimization algorithm and objective function.
    Tailor the evaluation process to their needs.
- FeatureImportanceLayer Enhancements
    Plots for baseline and final model feature importances can now be saved directly to disk.
    Customizable output directory (output_dir) and file names.
    Plots overwrite existing files with the same name.

## [0.2.4] - 2024-11-29
- Bug fixes.

## [1.2.4] - 2024-11-29
- Cloud Storage Integration
- The data_handling.py and feature_extraction.py scripts now support loading .csv files from GCS paths. Outputs are saved to a user-specified GCS directory via the --outdir parameter.

## [1.2.5, 1.2.6] - 2024-11-29
- Bug fixes.

## [2.2.6] - 2024-11-29
- Automatically fetches raw and preprocessed data from Google Cloud Storage after installation.



