Source code for babilonia.tools.parse
# SPDX-License-Identifier: GPL-3.0-or-later
#
# Copyright (C) 2025 The Project Authors
# See pyproject.toml for authors/maintainers.
# See LICENSE for license details.
"""
Parse and standardize bank statement CSV files.
This script scans a target directory for bank statement files
(T0 source format), standardizes their structure using the appropriate
``CashFlow`` parser, and writes the processed outputs as new CSV files
(T1 canonical format) in the same yearly subdirectories.
Processing can be restricted to a single year or applied to all available
years. When multiple years are present, files are processed year by year
with structured terminal output to facilitate log inspection.
Scripts Examples
----------------
The script is intended for command-line execution.
.. dropdown:: Minimal PowerShell example (Windows)
:icon: code-square
:open:
Save as ``run_parse.ps1`` and execute from PowerShell.
.. code-block:: powershell
# ! Warning -- change paths and parameters
# Paths
$REPO = "C:\\path\\to\\repo"
$SCRIPT = "$REPO\\babilonia\\tools\\parse.py"
$DATA = "C:\\data\\bank_statements"
# Parameters
$TYPE = "bb-cc"
$YEAR = 2024
# Run script
python $SCRIPT `
--folder $DATA `
--type $TYPE `
--year $YEAR
.. dropdown:: Minimal shell example (Linux)
:icon: code-square
:open:
Save as ``run_parse.sh`` and execute from a terminal.
.. code-block:: bash
#!/usr/bin/env bash
# ! Warning -- change paths and parameters
# Paths
REPO="/path/to/repo"
SCRIPT="$REPO/babilonia/tools/parse.py"
DATA="/data/bank_statements"
# Parameters
TYPE="bb-cc"
YEAR=2024
# Run script
python "$SCRIPT" --folder "$DATA" --type "$TYPE" --year "$YEAR"
Expected Folder Structure
-------------------------
The input data is expected to follow a simple hierarchical layout:
::
bb/ # Bank
└── cc/ # Bank account
├── 2022/
│ ├── EXTRATO_BB_CC_2022-01_T0.csv
│ └── EXTRATO_BB_CC_2022-02_T0.csv
├── 2023/
│ └── EXTRATO_BB_CC_2023-01_T0.csv
└── 2024/
├── EXTRATO_BB_CC_2024-01_T0.csv
└── EXTRATO_BB_CC_2024-02_T0.csv
Each ``*_T0.csv`` file represents a raw (Tier 0) bank statement.
During execution, the script generates standardized Tier 1 outputs
alongside the original files:
::
bb/ # Bank
└── cc/ # Bank account
└── 2024/
├── EXTRATO_BB_CC_2024-01_T0.csv # original
└── EXTRATO_BB_CC_2024-01_T1.csv # standardized (canonical)
Data Levels
----------------
- **Tier 0 (T0)**: Raw statement files as exported by the bank.
Column names, formats, and ordering may vary.
- **Tier 1 (T1)**: Canonical, standardized CSV files produced by this
script, suitable for downstream analysis and aggregation.
The original Tier 0 files are never modified; Tier 1 files are written
only when they do not already exist.
"""
# IMPORTS
# ***********************************************************************
# import modules from other libs
# Native imports
# =======================================================================
import glob
import argparse
import pprint
from pathlib import Path
# ... {develop}
# External imports
# =======================================================================
# import {module}
# ... {develop}
# Project-level imports
# =======================================================================
# import {module}
from babilonia.tools.core import *
# ... {develop}
# CONSTANTS
# ***********************************************************************
# ... {develop}
# FUNCTIONS
# ***********************************************************************
# ... {develop}
[docs]
def main():
args = get_arguments()
data_folder = Path(args.folder)
data_type = args.type.lower()
year_arg = args.year
print("\n\n")
print("=" * 80)
print(" Parsing Bank Statements\n".upper())
print(f" Folder : {data_folder}")
print(f" Bank : {BANK_NAMES[data_type]}")
print(f" Account : {ACCOUNT_NAMES[data_type]}")
print(f" Year : {year_arg if year_arg is not None else 'ALL'}")
print("=" * 80)
# Resolve file pattern (year wildcard handled inside helper)
pattern_files = get_file_pattern_statement_t0(data_type, data_folder, year_arg)
ls_files = glob.glob(pattern_files)
if not ls_files:
print(" No input files found. Nothing to process.")
print("=" * 80)
return None
# ------------------------------------------------------------------
# Group files by year (assumes year is the parent directory name)
# ------------------------------------------------------------------
files_by_year = {}
for f in ls_files:
fpath = Path(f)
try:
year = fpath.parent.name
except IndexError:
continue
files_by_year.setdefault(year, []).append(fpath)
cf = PARSERS[data_type]()
total_processed = 0
for year in sorted(files_by_year):
print()
# print("-" * 80)
print(f" Year {year}")
print("-" * 80)
yearly_processed = 0
for i, fpath in enumerate(files_by_year[year], start=1):
name = fpath.stem
new_name = name.replace("T0", "T1")
file_out = fpath.parent / f"{new_name}.csv"
print(f"[{i:02d}] {fpath.name}", end=" -> ")
if file_out.exists():
print(f"{file_out.name} SKIPPED")
continue
cf.load_data(file_data=str(fpath))
cf.standardize()
cf.data.to_csv(file_out, sep=";", index=False)
print(f"{file_out.name} PARSED")
total_processed += 1
yearly_processed += 1
print(f"\n Year completed. Output files written: {yearly_processed}")
print()
print("=" * 80)
print(f" Completed. Output files written: {total_processed}")
print("=" * 80)
print("\n\n")
return None
# SCRIPT
# ***********************************************************************
# standalone behaviour as a script
if __name__ == "__main__":
main()