DataLib Modules¶
- class src.visualization.Visualizations¶
Bases:
object
A class used to create visualizations for data analysis.
Methods¶
- plot_histogram(df, column, bins=10, title=None, xlabel=None, ylabel=None)
Plots a histogram for a specified column in the DataFrame.
- plot_scatter(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)
Plots a scatter plot for two specified columns in the DataFrame.
- plot_boxplot(df, column, by=None, title=None, xlabel=None, ylabel=None)
Plots a boxplot for a specified column in the DataFrame.
- plot_heatmap(df, title=None, xlabel=None, ylabel=None)
Plots a heatmap of the correlation matrix for the DataFrame.
- plot_line(df, x, y, title=None, xlabel=None, ylabel=None)
Plots a line chart for two specified columns in the DataFrame.
- plot_bar(df, x, y, title=None, xlabel=None, ylabel=None)
Plots a bar chart for two specified columns in the DataFrame.
- plot_pie(df, column, title=None)
Plots a pie chart for a specified column in the DataFrame.
- plot_pairplot(df, hue=None)
Plots a pairplot for the DataFrame.
- plot_violin(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)
Plots a violin plot for the DataFrame.
- static plot_bar(df, x, y, title=None, xlabel=None, ylabel=None)¶
Plots a bar chart for two specified columns in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- xstr
The column for the x-axis.
- ystr
The column for the y-axis.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_boxplot(df, column, by=None, title=None, xlabel=None, ylabel=None)¶
Plots a boxplot for a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to plot.
- bystr, optional
The column to group by.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_heatmap(df, title=None, xlabel=None, ylabel=None)¶
Plots a heatmap of the correlation matrix for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_histogram(df, column, bins=10, title=None, xlabel=None, ylabel=None)¶
Plots a histogram for a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to plot.
- binsint, optional
The number of bins for the histogram (default is 10).
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_line(df, x, y, title=None, xlabel=None, ylabel=None)¶
Plots a line chart for two specified columns in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- xstr
The column for the x-axis.
- ystr
The column for the y-axis.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_pairplot(df, hue=None)¶
Plots a pairplot for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- huestr, optional
The column to use for color encoding.
Returns¶
None
- static plot_pie(df, column, title=None)¶
Plots a pie chart for a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to plot.
- titlestr, optional
The title of the plot.
Returns¶
None
- static plot_scatter(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)¶
Plots a scatter plot for two specified columns in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- xstr
The column for the x-axis.
- ystr
The column for the y-axis.
- huestr, optional
The column to use for color encoding.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- static plot_violin(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)¶
Plots a violin plot for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- xstr
The column for the x-axis.
- ystr
The column for the y-axis.
- huestr, optional
The column to use for color encoding.
- titlestr, optional
The title of the plot.
- xlabelstr, optional
The label for the x-axis.
- ylabelstr, optional
The label for the y-axis.
Returns¶
None
- class src.utils.Utils¶
Bases:
object
A class containing general utility functions for data operations.
Methods¶
- calculate_mean(df, column)
Calculates the mean of a specified column in the DataFrame.
- calculate_sum(df, column)
Calculates the sum of a specified column in the DataFrame.
- calculate_max(df, column)
Calculates the maximum value of a specified column in the DataFrame.
- calculate_min(df, column)
Calculates the minimum value of a specified column in the DataFrame.
- calculate_std(df, column)
Calculates the standard deviation of a specified column in the DataFrame.
- normalize_array(arr)
Normalizes a numpy array.
- calculate_median(df, column)
Calculates the median of a specified column in the DataFrame.
- calculate_variance(df, column)
Calculates the variance of a specified column in the DataFrame.
- calculate_mode(df, column)
Calculates the mode of a specified column in the DataFrame.
- calculate_iqr(df, column)
Calculates the interquartile range (IQR) of a specified column in the DataFrame.
- static calculate_iqr(df, column)¶
Calculates the interquartile range (IQR) of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the IQR for.
Returns¶
- float
The IQR of the specified column.
- static calculate_max(df, column)¶
Calculates the maximum value of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the maximum value for.
Returns¶
- float
The maximum value of the specified column.
- static calculate_mean(df, column)¶
Calculates the mean of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the mean for.
Returns¶
- float
The mean of the specified column.
- static calculate_median(df, column)¶
Calculates the median of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the median for.
Returns¶
- float
The median of the specified column.
- static calculate_min(df, column)¶
Calculates the minimum value of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the minimum value for.
Returns¶
- float
The minimum value of the specified column.
- static calculate_mode(df, column)¶
Calculates the mode of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the mode for.
Returns¶
- float
The mode of the specified column.
- static calculate_std(df, column)¶
Calculates the standard deviation of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the standard deviation for.
Returns¶
- float
The standard deviation of the specified column.
- static calculate_sum(df, column)¶
Calculates the sum of a specified column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- columnstr
The column to calculate the sum for.
Returns¶
- float
The sum of the specified column.
- class src.statistics.Statistics¶
Bases:
object
A class used to perform statistical analysis on data.
Methods¶
- describe_data(df)
Provides descriptive statistics for the DataFrame.
- correlation_matrix(df)
Computes the correlation matrix for the DataFrame.
- t_test(sample1, sample2)
Performs a t-test to compare the means of two samples.
- chi_square_test(observed, expected)
Performs a chi-square test to compare observed and expected frequencies.
- anova(*samples)
Performs a one-way ANOVA test to compare the means of multiple samples.
- linear_regression(x, y)
Performs a linear regression analysis.
- z_score(df, column)
Computes the z-scores for a column in the DataFrame.
- moving_average(df, column, window)
Computes the moving average for a column in the DataFrame.
- static anova(*samples)¶
Performs a one-way ANOVA test to compare the means of multiple samples.
Parameters¶
- samplesarray-like
The samples to compare.
Returns¶
- tuple
The F-statistic and the p-value.
- static chi_square_test(observed, expected)¶
Performs a chi-square test to compare observed and expected frequencies.
Parameters¶
- observedarray-like
The observed frequencies.
- expectedarray-like
The expected frequencies.
Returns¶
- tuple
The chi-square statistic and the p-value.
- static correlation_matrix(df)¶
Computes the correlation matrix for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame to compute the correlation matrix for.
Returns¶
- pandas.DataFrame
The correlation matrix of the DataFrame.
- static describe_data(df)¶
Provides descriptive statistics for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame to describe.
Returns¶
- pandas.DataFrame
The descriptive statistics of the DataFrame.
- static linear_regression(x, y)¶
Performs a linear regression analysis.
Parameters¶
- xarray-like
The independent variable.
- yarray-like
The dependent variable.
Returns¶
- tuple
The slope, intercept, r-value, p-value, and standard error of the estimate.
- static moving_average(df, column, window)¶
Computes the moving average for a column in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the column.
- columnstr
The column to compute the moving average for.
- windowint
The window size for the moving average.
Returns¶
- pandas.Series
The moving average of the column.
- class src.analysis.exploratory_analysis.ExploratoryAnalysis¶
Bases:
object
A class used to perform exploratory data analysis (EDA) on data.
Methods¶
- load_and_clean_data(source, cleaning_strategy=’drop’)
Loads and cleans the data from a specified source.
- describe_and_visualize_data(df)
Provides descriptive statistics and visualizations for the DataFrame.
- analyze_correlations(df)
Analyzes and visualizes correlations in the DataFrame.
- perform_statistical_tests(df, column1, column2)
Performs statistical tests between two columns in the DataFrame.
- visualize_distributions(df, columns)
Visualizes the distributions of specified columns in the DataFrame.
- static analyze_correlations(df)¶
Analyzes and visualizes correlations in the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame to analyze.
Returns¶
None
- static describe_and_visualize_data(df)¶
Provides descriptive statistics and visualizations for the DataFrame.
Parameters¶
- dfpandas.DataFrame
The DataFrame to analyze.
Returns¶
None
- static load_and_clean_data(source, cleaning_strategy='drop')¶
Loads and cleans the data from a specified source.
Parameters¶
- sourcestr or pandas.DataFrame
The source of the data. Can be a URL, file path, or DataFrame object.
- cleaning_strategystr, optional
The strategy to use for cleaning (‘drop’, ‘fill_mean’, ‘fill_median’).
Returns¶
- pandas.DataFrame
The cleaned DataFrame.
- class src.analysis.advanced_analysis.AdvancedAnalysis¶
Bases:
object
A class used to perform advanced analysis on data.
Methods¶
- preprocess_data(df, target_column, test_size=0.2, scale_data=True)
Preprocesses the data by splitting into training and testing sets and scaling if required.
- perform_pca(df, n_components=2)
Performs Principal Component Analysis (PCA) on the data.
- perform_tsne(df, n_components=2, perplexity=30.0, n_iter=1000)
Performs t-Distributed Stochastic Neighbor Embedding (t-SNE) on the data.
- logistic_regression(df, target_column)
Performs logistic regression on the data.
- decision_tree(df, target_column)
Performs decision tree classification on the data.
- random_forest(df, target_column)
Performs random forest classification on the data.
- linear_regression(df, target_column)
Performs linear regression on the data.
- ridge_regression(df, target_column, alpha=1.0)
Performs ridge regression on the data.
- lasso_regression(df, target_column, alpha=1.0)
Performs lasso regression on the data.
- decision_tree_regression(df, target_column)
Performs decision tree regression on the data.
- random_forest_regression(df, target_column)
Performs random forest regression on the data.
- static decision_tree(df, target_column)¶
Performs decision tree classification on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for classification.
Returns¶
- dict
The results of the decision tree classification including accuracy, classification report, and confusion matrix.
- static decision_tree_regression(df, target_column)¶
Performs decision tree regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
Returns¶
- dict
The results of the decision tree regression including mean squared error and R-squared score.
- static lasso_regression(df, target_column, alpha=1.0)¶
Performs lasso regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
- alphafloat, optional
Regularization strength (default is 1.0).
Returns¶
- dict
The results of the lasso regression including mean squared error and R-squared score.
- static linear_regression(df, target_column)¶
Performs linear regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
Returns¶
- dict
The results of the linear regression including mean squared error and R-squared score.
- static logistic_regression(df, target_column)¶
Performs logistic regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for classification.
Returns¶
- dict
The results of the logistic regression including accuracy, classification report, and confusion matrix.
- static perform_pca(df, n_components=2)¶
Performs Principal Component Analysis (PCA) on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- n_componentsint, optional
The number of components to keep (default is 2).
Returns¶
- pandas.DataFrame
The DataFrame with the principal components.
- static perform_tsne(df, n_components=2, perplexity=30.0, n_iter=1000)¶
Performs t-Distributed Stochastic Neighbor Embedding (t-SNE) on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- n_componentsint, optional
The number of components to keep (default is 2).
- perplexityfloat, optional
The perplexity parameter for t-SNE (default is 30.0).
- n_iterint, optional
The number of iterations for optimization (default is 1000).
Returns¶
- pandas.DataFrame
The DataFrame with the t-SNE components.
- static preprocess_data(df, target_column, test_size=0.2, scale_data=True)¶
Preprocesses the data by splitting into training and testing sets and scaling if required.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for analysis.
- test_sizefloat, optional
The proportion of the dataset to include in the test split (default is 0.2).
- scale_databool, optional
Whether to scale the data (default is True).
Returns¶
- tuple
The training and testing sets (X_train, X_test, y_train, y_test).
- static random_forest(df, target_column)¶
Performs random forest classification on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for classification.
Returns¶
- dict
The results of the random forest classification including accuracy, classification report, and confusion matrix.
- static random_forest_regression(df, target_column)¶
Performs random forest regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
Returns¶
- dict
The results of the random forest regression including mean squared error and R-squared score.
- static ridge_regression(df, target_column, alpha=1.0)¶
Performs ridge regression on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
- alphafloat, optional
Regularization strength (default is 1.0).
Returns¶
- dict
The results of the ridge regression including mean squared error and R-squared score.
- class src.models.classification.Classification¶
Bases:
object
A class used to perform classification and clustering on data.
Methods¶
- preprocess_data(df, target_column, test_size=0.2, scale_data=True)
Preprocesses the data by splitting into training and testing sets and scaling if required.
- logistic_regression(X_train, y_train, X_test, y_test)
Performs logistic regression on the data.
- decision_tree(X_train, y_train, X_test, y_test)
Performs decision tree classification on the data.
- random_forest(X_train, y_train, X_test, y_test)
Performs random forest classification on the data.
- kmeans_clustering(df, n_clusters)
Performs KMeans clustering on the data.
- static decision_tree(X_train, y_train, X_test, y_test)¶
Performs decision tree classification on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the decision tree classification including accuracy, classification report, and confusion matrix.
- static kmeans_clustering(df, n_clusters)¶
Performs KMeans clustering on the data.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- n_clustersint
The number of clusters to form.
Returns¶
- dict
The results of the KMeans clustering including cluster centers and labels.
- static logistic_regression(X_train, y_train, X_test, y_test)¶
Performs logistic regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the logistic regression including accuracy, classification report, and confusion matrix.
- static preprocess_data(df, target_column, test_size=0.2, scale_data=True)¶
Preprocesses the data by splitting into training and testing sets and scaling if required.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for classification.
- test_sizefloat, optional
The proportion of the dataset to include in the test split (default is 0.2).
- scale_databool, optional
Whether to scale the data (default is True).
Returns¶
- tuple
The training and testing sets (X_train, X_test, y_train, y_test).
- static random_forest(X_train, y_train, X_test, y_test)¶
Performs random forest classification on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the random forest classification including accuracy, classification report, and confusion matrix.
- class src.models.regression.Regression¶
Bases:
object
A class used to perform regression analysis on data.
Methods¶
- preprocess_data(df, target_column, test_size=0.2, scale_data=True)
Preprocesses the data by splitting into training and testing sets and scaling if required.
- linear_regression(X_train, y_train, X_test, y_test)
Performs linear regression on the data.
- ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0)
Performs ridge regression on the data.
- lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0)
Performs lasso regression on the data.
- decision_tree_regression(X_train, y_train, X_test, y_test)
Performs decision tree regression on the data.
- random_forest_regression(X_train, y_train, X_test, y_test)
Performs random forest regression on the data.
- static decision_tree_regression(X_train, y_train, X_test, y_test)¶
Performs decision tree regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the decision tree regression including mean squared error and R-squared score.
- static lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0)¶
Performs lasso regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
- alphafloat, optional
Regularization strength (default is 1.0).
Returns¶
- dict
The results of the lasso regression including mean squared error and R-squared score.
- static linear_regression(X_train, y_train, X_test, y_test)¶
Performs linear regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the linear regression including mean squared error and R-squared score.
- static preprocess_data(df, target_column, test_size=0.2, scale_data=True)¶
Preprocesses the data by splitting into training and testing sets and scaling if required.
Parameters¶
- dfpandas.DataFrame
The DataFrame containing the data.
- target_columnstr
The target column for regression.
- test_sizefloat, optional
The proportion of the dataset to include in the test split (default is 0.2).
- scale_databool, optional
Whether to scale the data (default is True).
Returns¶
- tuple
The training and testing sets (X_train, X_test, y_train, y_test).
- static random_forest_regression(X_train, y_train, X_test, y_test)¶
Performs random forest regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
Returns¶
- dict
The results of the random forest regression including mean squared error and R-squared score.
- static ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0)¶
Performs ridge regression on the data.
Parameters¶
- X_trainarray-like
The training data.
- y_trainarray-like
The training labels.
- X_testarray-like
The testing data.
- y_testarray-like
The testing labels.
- alphafloat, optional
Regularization strength (default is 1.0).
Returns¶
- dict
The results of the ridge regression including mean squared error and R-squared score.