Metadata-Version: 2.4
Name: falgueras
Version: 1.0.0
Summary: Common code for Python projects involving GCP, Pandas, and Spark.
Author-email: Aleix Falgueras Casals <falguerasaleix@gmail.com>
License: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: colorama~=0.4.6
Requires-Dist: db-dtypes~=1.3.1
Requires-Dist: findspark==1.4.2
Requires-Dist: google-api-core~=2.24.0
Requires-Dist: google-api-python-client~=2.156.0
Requires-Dist: google-auth~=2.37.0
Requires-Dist: google-cloud-bigquery-storage~=2.27.0
Requires-Dist: google-cloud-bigquery~=3.27.0
Requires-Dist: google-cloud-language~=2.16.0
Requires-Dist: google-cloud-secret-manager~=2.22.0
Requires-Dist: google-cloud-storage~=2.19.0
Requires-Dist: numpy~=2.2.1
Requires-Dist: pandas~=2.2.2
Requires-Dist: protobuf~=5.29.2
Requires-Dist: pyspark==3.5.2
Requires-Dist: pytz~=2024.1
Requires-Dist: requests~=2.32.3
Description-Content-Type: text/markdown


# Falgueras 🪴

[![PyPI version](https://img.shields.io/pypi/v/falgueras?color=4CBB17)](https://pypi.org/project/falgueras/)

Development framework for Python projects involving GCP, Pandas, and Spark. 

The main goal is to accelerate development of data-driven projects by providing a common framework for developers
with different backgrounds: software engineers, big data engineers and data scientists.

## Installation

`pip install falgueras` (requieres Python>=3.10)

Set GOOGLE_APPLICATION_CREDENTIALS environment variable to enable GCP services.

### Run local Spark applications in Windows from IntelliJ

_try fast fail fast learn fast_

For local Spark execution in Windows, the following environment variables must be set appropriately: 
- SPARK_HOME; version spark-3.5.2-bin-hadoop3.
- HADOOP_HOME; same value than SPARK_HOME.
- JAVA_HOME; recommended Java SDK 11.
- PATH += %HADOOP_HOME%\bin, %JAVA_HOME%\bin.

%HADOOP_HOME%\bin must contain files winutils.exe and hadoop.dll, download from 
[here](https://github.com/kontext-tech/winutils/blob/master/hadoop-3.3.0/bin).

Additionally, add `findspark.init()` at the beginning of the script in order to set and add 
environment variables and dependencies to sys.path.

### Connect to BigQuery from Spark

As shown in the `spark_session_utils.py`, the SparkSession used must include the jar
`com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.41.1` 
in order to communicate with BigQuery.

## Packages

### `falgueras.common`

Shared code between other packages and utils functions: datetime, json, enums, logging.

### `falgueras.gcp`

The functionalities of various Google Cloud Platform (GCP) services are encapsulated within 
custom client classes. This approach enhances clarity and promotes better encapsulation.

For instance, Google Cloud Storage (GCS) operations are wrapped in the `gcp.GcsClient` class,
which has an attribute that holds the actual `storage.Client` object from GCS. Multiple `GcsClient` 
instances can share the same `storage.Client` object.

### `falgueras.pandas`

Pandas related code.

The pandas_repo.py file provides a modular and extensible framework for handling pandas DataFrame operations 
across various storage systems. Using the `PandasRepo` abstract base class and `PandasRepoProtocol`, 
it standardizes read and write operations while enabling custom implementations for specific backends 
such as BigQuery (`BqPandasRepo`). These implementations encapsulate backend-specific logic, allowing 
users to interact with data sources using a consistent interface.

### `falgueras.spark`

Spark related code.

In the same way than the pandas_repo.py file, the spark_repo.py file provides a modular and extensible 
framework for handling Spark DataFrame operations across various storage systems. Using the `SparkRepo` abstract base 
class and `SparkRepoProtocol`, it standardizes read and write operations while enabling custom implementations for 
specific backends such as BigQuery (`BqSparkRepo`). These implementations encapsulate backend-specific logic, allowing
users to interact with data sources using a consistent interface.
