Source code for babilonia.tools.parse

# SPDX-License-Identifier: GPL-3.0-or-later
#
# Copyright (C) 2025 The Project Authors
# See pyproject.toml for authors/maintainers.
# See LICENSE for license details.
"""
Parse and standardize bank statement CSV files.

This script scans a target directory for bank statement files
(T0 source format), standardizes their structure using the appropriate
``CashFlow`` parser, and writes the processed outputs as new CSV files
(T1 canonical format) in the same yearly subdirectories.

Processing can be restricted to a single year or applied to all available
years. When multiple years are present, files are processed year by year
with structured terminal output to facilitate log inspection.

Scripts Examples
----------------

The script is intended for command-line execution.

.. dropdown:: Minimal PowerShell example (Windows)
    :icon: code-square
    :open:

    Save as ``run_parse.ps1`` and execute from PowerShell.

    .. code-block:: powershell

        # ! Warning -- change paths and parameters

        # Paths
        $REPO   = "C:\\path\\to\\repo"
        $SCRIPT = "$REPO\\babilonia\\tools\\parse.py"
        $DATA   = "C:\\data\\bank_statements"

        # Parameters
        $TYPE = "bb-cc"
        $YEAR = 2024

        # Run script
        python $SCRIPT `
            --folder $DATA `
            --type $TYPE `
            --year $YEAR


.. dropdown:: Minimal shell example (Linux)
    :icon: code-square
    :open:

    Save as ``run_parse.sh`` and execute from a terminal.

    .. code-block:: bash

        #!/usr/bin/env bash

        # ! Warning -- change paths and parameters

        # Paths
        REPO="/path/to/repo"
        SCRIPT="$REPO/babilonia/tools/parse.py"
        DATA="/data/bank_statements"

        # Parameters
        TYPE="bb-cc"
        YEAR=2024

        # Run script
        python "$SCRIPT" --folder "$DATA" --type "$TYPE" --year "$YEAR"


Expected Folder Structure
-------------------------

The input data is expected to follow a simple hierarchical layout:

::

    bb/                                 # Bank
    └── cc/                             # Bank account
        ├── 2022/
        │   ├── EXTRATO_BB_CC_2022-01_T0.csv
        │   └── EXTRATO_BB_CC_2022-02_T0.csv
        ├── 2023/
        │   └── EXTRATO_BB_CC_2023-01_T0.csv
        └── 2024/
            ├── EXTRATO_BB_CC_2024-01_T0.csv
            └── EXTRATO_BB_CC_2024-02_T0.csv

Each ``*_T0.csv`` file represents a raw (Tier 0) bank statement.

During execution, the script generates standardized Tier 1 outputs
alongside the original files:

::

    bb/                                 # Bank
    └── cc/                             # Bank account
        └── 2024/
            ├── EXTRATO_BB_CC_2024-01_T0.csv   # original
            └── EXTRATO_BB_CC_2024-01_T1.csv   # standardized (canonical)

Data Levels
----------------

- **Tier 0 (T0)**: Raw statement files as exported by the bank.
  Column names, formats, and ordering may vary.
- **Tier 1 (T1)**: Canonical, standardized CSV files produced by this
  script, suitable for downstream analysis and aggregation.

The original Tier 0 files are never modified; Tier 1 files are written
only when they do not already exist.

"""

# IMPORTS
# ***********************************************************************
# import modules from other libs

# Native imports
# =======================================================================
import glob
import argparse
import pprint
from pathlib import Path

# ... {develop}

# External imports
# =======================================================================
# import {module}
# ... {develop}

# Project-level imports
# =======================================================================
# import {module}
from babilonia.tools.core import *


# ... {develop}

# CONSTANTS
# ***********************************************************************
# ... {develop}

# FUNCTIONS
# ***********************************************************************
# ... {develop}


[docs] def main(): args = get_arguments() data_folder = Path(args.folder) data_type = args.type.lower() year_arg = args.year print("\n\n") print("=" * 80) print(" Parsing Bank Statements\n".upper()) print(f" Folder : {data_folder}") print(f" Bank : {BANK_NAMES[data_type]}") print(f" Account : {ACCOUNT_NAMES[data_type]}") print(f" Year : {year_arg if year_arg is not None else 'ALL'}") print("=" * 80) # Resolve file pattern (year wildcard handled inside helper) pattern_files = get_file_pattern_statement_t0(data_type, data_folder, year_arg) ls_files = glob.glob(pattern_files) if not ls_files: print(" No input files found. Nothing to process.") print("=" * 80) return None # ------------------------------------------------------------------ # Group files by year (assumes year is the parent directory name) # ------------------------------------------------------------------ files_by_year = {} for f in ls_files: fpath = Path(f) try: year = fpath.parent.name except IndexError: continue files_by_year.setdefault(year, []).append(fpath) cf = PARSERS[data_type]() total_processed = 0 for year in sorted(files_by_year): print() # print("-" * 80) print(f" Year {year}") print("-" * 80) yearly_processed = 0 for i, fpath in enumerate(files_by_year[year], start=1): name = fpath.stem new_name = name.replace("T0", "T1") file_out = fpath.parent / f"{new_name}.csv" print(f"[{i:02d}] {fpath.name}", end=" -> ") if file_out.exists(): print(f"{file_out.name} SKIPPED") continue cf.load_data(file_data=str(fpath)) cf.standardize() cf.data.to_csv(file_out, sep=";", index=False) print(f"{file_out.name} PARSED") total_processed += 1 yearly_processed += 1 print(f"\n Year completed. Output files written: {yearly_processed}") print() print("=" * 80) print(f" Completed. Output files written: {total_processed}") print("=" * 80) print("\n\n") return None
# SCRIPT # *********************************************************************** # standalone behaviour as a script if __name__ == "__main__": main()