DataLib Modules

class src.visualization.Visualizations

Bases: object

A class used to create visualizations for data analysis.

Methods

plot_histogram(df, column, bins=10, title=None, xlabel=None, ylabel=None)

Plots a histogram for a specified column in the DataFrame.

plot_scatter(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)

Plots a scatter plot for two specified columns in the DataFrame.

plot_boxplot(df, column, by=None, title=None, xlabel=None, ylabel=None)

Plots a boxplot for a specified column in the DataFrame.

plot_heatmap(df, title=None, xlabel=None, ylabel=None)

Plots a heatmap of the correlation matrix for the DataFrame.

plot_line(df, x, y, title=None, xlabel=None, ylabel=None)

Plots a line chart for two specified columns in the DataFrame.

plot_bar(df, x, y, title=None, xlabel=None, ylabel=None)

Plots a bar chart for two specified columns in the DataFrame.

plot_pie(df, column, title=None)

Plots a pie chart for a specified column in the DataFrame.

plot_pairplot(df, hue=None)

Plots a pairplot for the DataFrame.

plot_violin(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)

Plots a violin plot for the DataFrame.

static plot_bar(df, x, y, title=None, xlabel=None, ylabel=None)

Plots a bar chart for two specified columns in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

xstr

The column for the x-axis.

ystr

The column for the y-axis.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_boxplot(df, column, by=None, title=None, xlabel=None, ylabel=None)

Plots a boxplot for a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to plot.

bystr, optional

The column to group by.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_heatmap(df, title=None, xlabel=None, ylabel=None)

Plots a heatmap of the correlation matrix for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_histogram(df, column, bins=10, title=None, xlabel=None, ylabel=None)

Plots a histogram for a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to plot.

binsint, optional

The number of bins for the histogram (default is 10).

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_line(df, x, y, title=None, xlabel=None, ylabel=None)

Plots a line chart for two specified columns in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

xstr

The column for the x-axis.

ystr

The column for the y-axis.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_pairplot(df, hue=None)

Plots a pairplot for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

huestr, optional

The column to use for color encoding.

Returns

None

static plot_pie(df, column, title=None)

Plots a pie chart for a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to plot.

titlestr, optional

The title of the plot.

Returns

None

static plot_scatter(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)

Plots a scatter plot for two specified columns in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

xstr

The column for the x-axis.

ystr

The column for the y-axis.

huestr, optional

The column to use for color encoding.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

static plot_violin(df, x, y, hue=None, title=None, xlabel=None, ylabel=None)

Plots a violin plot for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

xstr

The column for the x-axis.

ystr

The column for the y-axis.

huestr, optional

The column to use for color encoding.

titlestr, optional

The title of the plot.

xlabelstr, optional

The label for the x-axis.

ylabelstr, optional

The label for the y-axis.

Returns

None

class src.utils.Utils

Bases: object

A class containing general utility functions for data operations.

Methods

calculate_mean(df, column)

Calculates the mean of a specified column in the DataFrame.

calculate_sum(df, column)

Calculates the sum of a specified column in the DataFrame.

calculate_max(df, column)

Calculates the maximum value of a specified column in the DataFrame.

calculate_min(df, column)

Calculates the minimum value of a specified column in the DataFrame.

calculate_std(df, column)

Calculates the standard deviation of a specified column in the DataFrame.

normalize_array(arr)

Normalizes a numpy array.

calculate_median(df, column)

Calculates the median of a specified column in the DataFrame.

calculate_variance(df, column)

Calculates the variance of a specified column in the DataFrame.

calculate_mode(df, column)

Calculates the mode of a specified column in the DataFrame.

calculate_iqr(df, column)

Calculates the interquartile range (IQR) of a specified column in the DataFrame.

static calculate_iqr(df, column)

Calculates the interquartile range (IQR) of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the IQR for.

Returns

float

The IQR of the specified column.

static calculate_max(df, column)

Calculates the maximum value of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the maximum value for.

Returns

float

The maximum value of the specified column.

static calculate_mean(df, column)

Calculates the mean of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the mean for.

Returns

float

The mean of the specified column.

static calculate_median(df, column)

Calculates the median of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the median for.

Returns

float

The median of the specified column.

static calculate_min(df, column)

Calculates the minimum value of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the minimum value for.

Returns

float

The minimum value of the specified column.

static calculate_mode(df, column)

Calculates the mode of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the mode for.

Returns

float

The mode of the specified column.

static calculate_std(df, column)

Calculates the standard deviation of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the standard deviation for.

Returns

float

The standard deviation of the specified column.

static calculate_sum(df, column)

Calculates the sum of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the sum for.

Returns

float

The sum of the specified column.

static calculate_variance(df, column)

Calculates the variance of a specified column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

columnstr

The column to calculate the variance for.

Returns

float

The variance of the specified column.

static normalize_array(arr)

Normalizes a numpy array.

Parameters

arrnumpy.ndarray

The array to normalize.

Returns

numpy.ndarray

The normalized array.

class src.statistics.Statistics

Bases: object

A class used to perform statistical analysis on data.

Methods

describe_data(df)

Provides descriptive statistics for the DataFrame.

correlation_matrix(df)

Computes the correlation matrix for the DataFrame.

t_test(sample1, sample2)

Performs a t-test to compare the means of two samples.

chi_square_test(observed, expected)

Performs a chi-square test to compare observed and expected frequencies.

anova(*samples)

Performs a one-way ANOVA test to compare the means of multiple samples.

linear_regression(x, y)

Performs a linear regression analysis.

z_score(df, column)

Computes the z-scores for a column in the DataFrame.

moving_average(df, column, window)

Computes the moving average for a column in the DataFrame.

static anova(*samples)

Performs a one-way ANOVA test to compare the means of multiple samples.

Parameters

samplesarray-like

The samples to compare.

Returns

tuple

The F-statistic and the p-value.

static chi_square_test(observed, expected)

Performs a chi-square test to compare observed and expected frequencies.

Parameters

observedarray-like

The observed frequencies.

expectedarray-like

The expected frequencies.

Returns

tuple

The chi-square statistic and the p-value.

static correlation_matrix(df)

Computes the correlation matrix for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame to compute the correlation matrix for.

Returns

pandas.DataFrame

The correlation matrix of the DataFrame.

static describe_data(df)

Provides descriptive statistics for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame to describe.

Returns

pandas.DataFrame

The descriptive statistics of the DataFrame.

static linear_regression(x, y)

Performs a linear regression analysis.

Parameters

xarray-like

The independent variable.

yarray-like

The dependent variable.

Returns

tuple

The slope, intercept, r-value, p-value, and standard error of the estimate.

static moving_average(df, column, window)

Computes the moving average for a column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the column.

columnstr

The column to compute the moving average for.

windowint

The window size for the moving average.

Returns

pandas.Series

The moving average of the column.

static t_test(sample1, sample2)

Performs a t-test to compare the means of two samples.

Parameters

sample1array-like

The first sample.

sample2array-like

The second sample.

Returns

tuple

The t-statistic and the p-value.

static z_score(df, column)

Computes the z-scores for a column in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the column.

columnstr

The column to compute z-scores for.

Returns

pandas.Series

The z-scores of the column.

class src.analysis.exploratory_analysis.ExploratoryAnalysis

Bases: object

A class used to perform exploratory data analysis (EDA) on data.

Methods

load_and_clean_data(source, cleaning_strategy=’drop’)

Loads and cleans the data from a specified source.

describe_and_visualize_data(df)

Provides descriptive statistics and visualizations for the DataFrame.

analyze_correlations(df)

Analyzes and visualizes correlations in the DataFrame.

perform_statistical_tests(df, column1, column2)

Performs statistical tests between two columns in the DataFrame.

visualize_distributions(df, columns)

Visualizes the distributions of specified columns in the DataFrame.

static analyze_correlations(df)

Analyzes and visualizes correlations in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame to analyze.

Returns

None

static describe_and_visualize_data(df)

Provides descriptive statistics and visualizations for the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame to analyze.

Returns

None

static load_and_clean_data(source, cleaning_strategy='drop')

Loads and cleans the data from a specified source.

Parameters

sourcestr or pandas.DataFrame

The source of the data. Can be a URL, file path, or DataFrame object.

cleaning_strategystr, optional

The strategy to use for cleaning (‘drop’, ‘fill_mean’, ‘fill_median’).

Returns

pandas.DataFrame

The cleaned DataFrame.

static perform_statistical_tests(df, column1, column2)

Performs statistical tests between two columns in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the columns.

column1str

The first column for the test.

column2str

The second column for the test.

Returns

None

static visualize_distributions(df, columns)

Visualizes the distributions of specified columns in the DataFrame.

Parameters

dfpandas.DataFrame

The DataFrame containing the columns.

columnslist

The columns to visualize.

Returns

None

class src.analysis.advanced_analysis.AdvancedAnalysis

Bases: object

A class used to perform advanced analysis on data.

Methods

preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

perform_pca(df, n_components=2)

Performs Principal Component Analysis (PCA) on the data.

perform_tsne(df, n_components=2, perplexity=30.0, n_iter=1000)

Performs t-Distributed Stochastic Neighbor Embedding (t-SNE) on the data.

logistic_regression(df, target_column)

Performs logistic regression on the data.

decision_tree(df, target_column)

Performs decision tree classification on the data.

random_forest(df, target_column)

Performs random forest classification on the data.

linear_regression(df, target_column)

Performs linear regression on the data.

ridge_regression(df, target_column, alpha=1.0)

Performs ridge regression on the data.

lasso_regression(df, target_column, alpha=1.0)

Performs lasso regression on the data.

decision_tree_regression(df, target_column)

Performs decision tree regression on the data.

random_forest_regression(df, target_column)

Performs random forest regression on the data.

static decision_tree(df, target_column)

Performs decision tree classification on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for classification.

Returns

dict

The results of the decision tree classification including accuracy, classification report, and confusion matrix.

static decision_tree_regression(df, target_column)

Performs decision tree regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

Returns

dict

The results of the decision tree regression including mean squared error and R-squared score.

static lasso_regression(df, target_column, alpha=1.0)

Performs lasso regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

alphafloat, optional

Regularization strength (default is 1.0).

Returns

dict

The results of the lasso regression including mean squared error and R-squared score.

static linear_regression(df, target_column)

Performs linear regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

Returns

dict

The results of the linear regression including mean squared error and R-squared score.

static logistic_regression(df, target_column)

Performs logistic regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for classification.

Returns

dict

The results of the logistic regression including accuracy, classification report, and confusion matrix.

static perform_pca(df, n_components=2)

Performs Principal Component Analysis (PCA) on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

n_componentsint, optional

The number of components to keep (default is 2).

Returns

pandas.DataFrame

The DataFrame with the principal components.

static perform_tsne(df, n_components=2, perplexity=30.0, n_iter=1000)

Performs t-Distributed Stochastic Neighbor Embedding (t-SNE) on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

n_componentsint, optional

The number of components to keep (default is 2).

perplexityfloat, optional

The perplexity parameter for t-SNE (default is 30.0).

n_iterint, optional

The number of iterations for optimization (default is 1000).

Returns

pandas.DataFrame

The DataFrame with the t-SNE components.

static preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for analysis.

test_sizefloat, optional

The proportion of the dataset to include in the test split (default is 0.2).

scale_databool, optional

Whether to scale the data (default is True).

Returns

tuple

The training and testing sets (X_train, X_test, y_train, y_test).

static random_forest(df, target_column)

Performs random forest classification on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for classification.

Returns

dict

The results of the random forest classification including accuracy, classification report, and confusion matrix.

static random_forest_regression(df, target_column)

Performs random forest regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

Returns

dict

The results of the random forest regression including mean squared error and R-squared score.

static ridge_regression(df, target_column, alpha=1.0)

Performs ridge regression on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

alphafloat, optional

Regularization strength (default is 1.0).

Returns

dict

The results of the ridge regression including mean squared error and R-squared score.

class src.models.classification.Classification

Bases: object

A class used to perform classification and clustering on data.

Methods

preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

logistic_regression(X_train, y_train, X_test, y_test)

Performs logistic regression on the data.

decision_tree(X_train, y_train, X_test, y_test)

Performs decision tree classification on the data.

random_forest(X_train, y_train, X_test, y_test)

Performs random forest classification on the data.

kmeans_clustering(df, n_clusters)

Performs KMeans clustering on the data.

static decision_tree(X_train, y_train, X_test, y_test)

Performs decision tree classification on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the decision tree classification including accuracy, classification report, and confusion matrix.

static kmeans_clustering(df, n_clusters)

Performs KMeans clustering on the data.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

n_clustersint

The number of clusters to form.

Returns

dict

The results of the KMeans clustering including cluster centers and labels.

static logistic_regression(X_train, y_train, X_test, y_test)

Performs logistic regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the logistic regression including accuracy, classification report, and confusion matrix.

static preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for classification.

test_sizefloat, optional

The proportion of the dataset to include in the test split (default is 0.2).

scale_databool, optional

Whether to scale the data (default is True).

Returns

tuple

The training and testing sets (X_train, X_test, y_train, y_test).

static random_forest(X_train, y_train, X_test, y_test)

Performs random forest classification on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the random forest classification including accuracy, classification report, and confusion matrix.

class src.models.regression.Regression

Bases: object

A class used to perform regression analysis on data.

Methods

preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

linear_regression(X_train, y_train, X_test, y_test)

Performs linear regression on the data.

ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0)

Performs ridge regression on the data.

lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0)

Performs lasso regression on the data.

decision_tree_regression(X_train, y_train, X_test, y_test)

Performs decision tree regression on the data.

random_forest_regression(X_train, y_train, X_test, y_test)

Performs random forest regression on the data.

static decision_tree_regression(X_train, y_train, X_test, y_test)

Performs decision tree regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the decision tree regression including mean squared error and R-squared score.

static lasso_regression(X_train, y_train, X_test, y_test, alpha=1.0)

Performs lasso regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

alphafloat, optional

Regularization strength (default is 1.0).

Returns

dict

The results of the lasso regression including mean squared error and R-squared score.

static linear_regression(X_train, y_train, X_test, y_test)

Performs linear regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the linear regression including mean squared error and R-squared score.

static preprocess_data(df, target_column, test_size=0.2, scale_data=True)

Preprocesses the data by splitting into training and testing sets and scaling if required.

Parameters

dfpandas.DataFrame

The DataFrame containing the data.

target_columnstr

The target column for regression.

test_sizefloat, optional

The proportion of the dataset to include in the test split (default is 0.2).

scale_databool, optional

Whether to scale the data (default is True).

Returns

tuple

The training and testing sets (X_train, X_test, y_train, y_test).

static random_forest_regression(X_train, y_train, X_test, y_test)

Performs random forest regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

Returns

dict

The results of the random forest regression including mean squared error and R-squared score.

static ridge_regression(X_train, y_train, X_test, y_test, alpha=1.0)

Performs ridge regression on the data.

Parameters

X_trainarray-like

The training data.

y_trainarray-like

The training labels.

X_testarray-like

The testing data.

y_testarray-like

The testing labels.

alphafloat, optional

Regularization strength (default is 1.0).

Returns

dict

The results of the ridge regression including mean squared error and R-squared score.