Metadata-Version: 2.4
Name: md-spreadsheet-parser
Version: 0.1.3
Summary: A lightweight, pure Python library for parsing Markdown tables
Project-URL: Homepage, https://f-y.github.io/md-spreadsheet-parser/
Project-URL: Repository, https://github.com/f-y/md-spreadsheet-parser
Project-URL: Issues, https://github.com/f-y/md-spreadsheet-parser/issues
Author: f-y
License: MIT License
        
        Copyright (c) 2025 f-y
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: csv,markdown,parser,spreadsheet,streaming,table
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# Markdown Spreadsheet Parser

<p align="center">
  <a href="https://github.com/f-y/md-spreadsheet-parser/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" />
  </a>
  <a href="https://pypi.org/project/md-spreadsheet-parser/">
    <img src="https://img.shields.io/badge/pypi-v0.1.0-blue" alt="PyPI" />
  </a>
  <a href="https://github.com/f-y/md-spreadsheet-parser">
    <img src="https://img.shields.io/badge/repository-github-green.svg" alt="Repository" />
  </a>
</p>

<p align="center">
  <strong>A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables.</strong>
</p>

---

**md-spreadsheet-parser** turns loose Markdown text into strongly-typed data structures. It validates content against schemas and generates clean Markdown output. Ideal for building spreadsheet-like interfaces, data pipelines, and automation tools.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
    - [1. Basic Parsing](#1-basic-parsing)
    - [2. Type-Safe Validation](#2-type-safe-validation-recommended)
        - [Pydantic Integration](#pydantic-integration-optional)
    - [3. JSON / Dict Export](#3-json--dict-export)
    - [4. Markdown Generation](#4-markdown-generation-round-trip)
    - [5. Advanced Features](#5-advanced-features)
    - [6. Advanced Type Conversion](#6-advanced-type-conversion)
    - [7. Robustness](#7-robustness-handling-malformed-tables)
    - [8. In-Cell Line Break Support](#8-in-cell-line-break-support)
    - [9. Performance & Scalability (Streaming API)](#9-performance--scalability-streaming-api)
    - [Command Line Interface (CLI)](#command-line-interface-cli)
- [Configuration](#configuration)
- [Future Roadmap](#future-roadmap)
- [License](#license)

## Features

- **Pure Python & Zero Dependencies**: Lightweight and portable. Runs anywhere Python runs, including **WebAssembly (Pyodide)**.
- **Type-Safe Validation**: Convert loose Markdown tables into strongly-typed Python `dataclasses` with automatic type conversion, including customizable boolean logic (I18N) and custom type converters.
- **Round-Trip Support**: Parse to objects, modify data, and generate Markdown back. Perfect for editors.
- **Robust Parsing**: Gracefully handles malformed tables (missing/extra columns) and escaped characters.
- **Multi-Table Workbooks**: Support for parsing multiple sheets and tables from a single file, including metadata.
- **JSON-Friendly**: Easy export to dictionaries/JSON for integration with other tools (e.g., Pandas, APIs).

## Installation

```bash
pip install md-spreadsheet-parser
```

## Usage

### 1. Basic Parsing

**Single Table**
Parse a standard Markdown table into a structured object.

```python
from md_spreadsheet_parser import parse_table

markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""

result = parse_table(markdown)

print(result.headers)
# ['Name', 'Age']

print(result.rows)
# [['Alice', '30'], ['Bob', '25']]
```

**Multiple Tables (Workbook)**
Parse a file containing multiple sheets (sections). By default, it looks for `# Tables` as the root marker and `## Sheet Name` for sheets.

```python
from md_spreadsheet_parser import parse_workbook, MultiTableParsingSchema

markdown = """
# Tables

## Users
| ID | Name |
| -- | ---- |
| 1  | Alice|

## Products
| ID | Item |
| -- | ---- |
| A  | Apple|
"""

# Use default schema
schema = MultiTableParsingSchema()
workbook = parse_workbook(markdown, schema)

for sheet in workbook.sheets:
    print(f"Sheet: {sheet.name}")
    for table in sheet.tables:
        print(table.rows)
```

**File Loading Helpers**

For convenience, you can parse directly from a file path (`str` or `Path`) or file-like object using the `_from_file` variants:

```python
from md_spreadsheet_parser import parse_workbook_from_file

# Clean and easy
workbook = parse_workbook_from_file("data.md")
```

Available helpers:
- `parse_table_from_file(path_or_file)`
- `parse_workbook_from_file(path_or_file)`
- `scan_tables_from_file(path_or_file)`

### 2. Type-Safe Validation (Recommended)

The most powerful feature of this library is converting loose markdown tables into strongly-typed Python objects using `dataclasses`. This ensures your data is valid and easy to work with.

```python
from dataclasses import dataclass
from md_spreadsheet_parser import parse_table, TableValidationError

@dataclass
class User:
    name: str
    age: int
    is_active: bool = True

markdown = """
| Name | Age | Is Active |
|---|---|---|
| Alice | 30 | yes |
| Bob | 25 | no |
"""

try:
    # Parse and validate in one step
    users = parse_table(markdown).to_models(User)
    
    for user in users:
        print(f"{user.name} is {user.age} years old.")
        # Alice is 30 years old.
        # Bob is 25 years old.

except TableValidationError as e:
    print(e)
```

**Features:**
*   **Type Conversion**: Automatically converts strings to `int`, `float`, `bool` using standard rules.
*   **Boolean Handling (Default)**: Supports standard pairs out-of-the-box: `true/false`, `yes/no`, `on/off`, `1/0`. (See [Advanced Type Conversion](#6-advanced-type-conversion) for customization).
*   **Optional Fields**: Handles `Optional[T]` by converting empty strings to `None`.
*   **Validation**: Raises detailed errors if data doesn't match the schema.

### Pydantic Integration (Optional)

For more advanced validation (email format, ranges, regex), you can use [Pydantic](https://docs.pydantic.dev/) models instead of dataclasses. This feature is enabled automatically if `pydantic` is installed.

```python
from pydantic import BaseModel, Field, EmailStr

class User(BaseModel):
    name: str = Field(alias="User Name")
    age: int = Field(gt=0)
    email: EmailStr

# Automatically detects Pydantic model and uses it for validation
users = parse_table(markdown).to_models(User)
```

The parser respects Pydantic's `alias` and `Field` constraints.

### 3. JSON / Dict Export

All result objects (`Workbook`, `Sheet`, `Table`) have a `.json` property that returns a dictionary, making it easy to serialize or pass to other libraries (like Pandas).

```python
import json
import pandas as pd

# Export to JSON
print(json.dumps(workbook.json, indent=2))

# Convert to Pandas DataFrame
table_data = workbook.sheets[0].tables[0].json
df = pd.DataFrame(table_data["rows"], columns=table_data["headers"])
```

### 4. Markdown Generation (Round-Trip)

You can modify parsed objects and convert them back to Markdown strings using `to_markdown()`. This enables a complete "Parse -> Modify -> Generate" workflow.

```python
from md_spreadsheet_parser import parse_table, ParsingSchema

markdown = "| A | B |\n|---|---| \n| 1 | 2 |"
table = parse_table(markdown)

# Modify data
table.rows.append(["3", "4"])

# Generate Markdown
# You can customize the output format using a schema
schema = ParsingSchema(require_outer_pipes=True)
print(table.to_markdown(schema))
# | A | B |
# | --- | --- |
# | 1 | 2 |
# | 3 | 4 |
```

### 5. Advanced Features

**Metadata Extraction (Table Names & Descriptions)**
You can configure the parser to extract table names (from headers) and descriptions (text preceding the table).

```python
from md_spreadsheet_parser import parse_workbook, MultiTableParsingSchema

markdown = """
# Tables

## Sales Data

### Q1 Results
Financial performance for the first quarter.

| Month | Revenue |
| ----- | ------- |
| Jan   | 1000    |
"""

# Configure schema to capture table headers (level 3) and descriptions
schema = MultiTableParsingSchema(
    table_header_level=3,     # Treat ### Header as table name
    capture_description=True  # Capture text between header and table
)

workbook = parse_workbook(markdown, schema)
table = workbook.sheets[0].tables[0]

print(f"Table: {table.name}")        # "Q1 Results"
print(f"Desc: {table.description}")  # "Financial performance for the first quarter."
```

**Lookup API**
Retrieve sheets and tables directly by name instead of iterating.

```python
sheet = workbook.get_sheet("Sales Data")
if sheet:
    table = sheet.get_table("Q1 Results")
    if table:
        print(table.rows)
```

**Simple Scan Interface**
If you want to extract *all* tables from a document regardless of its structure (ignoring sheets and headers), use `scan_tables`.

```python
from md_spreadsheet_parser import scan_tables

markdown = """
Here is some text.

| ID | Name |
| -- | ---- |
| 1  | Alice|

More text...

| ID | Item |
| -- | ---- |
| A  | Apple|
"""

# Returns a flat list of all tables found
tables = scan_tables(markdown)

print(len(tables))
# 2
```

### 6. Advanced Type Conversion

You can customize how string values are converted to Python objects by passing a `ConversionSchema` to `to_models()`. This is useful for internationalization (I18N) and handling custom types.

**Internationalization (I18N): Custom Boolean Pairs**

Configure which string pairs map to `True`/`False` (case-insensitive).

```python
from md_spreadsheet_parser import parse_table, ConversionSchema

markdown = """
| User | Active? |
| --- | --- |
| Tanaka | はい |
| Suzuki | いいえ |
"""

# Configure "はい" -> True, "いいえ" -> False
schema = ConversionSchema(
    boolean_pairs=(("はい", "いいえ"),)
)

users = parse_table(markdown).to_models(User, conversion_schema=schema)
# Tanaka.active is True
```

**Custom Type Converters**

Register custom conversion functions for specific types. You can use **ANY Python type** as a key, including:

- **Built-ins**: `int`, `float`, `bool` (to override default behavior)
- **Standard Library**: `Decimal`, `datetime`, `date`, `ZoneInfo`, `UUID`
- **Custom Classes**: Your own data classes or objects

Example using standard library types and a custom class:

```python
from dataclasses import dataclass
from uuid import UUID
from zoneinfo import ZoneInfo
from md_spreadsheet_parser import ConversionSchema, parse_table

@dataclass
class Color:
    r: int
    g: int
    b: int

@dataclass
class Config:
    timezone: ZoneInfo
    session_id: UUID
    theme_color: Color

markdown = """
| Timezone | Session ID | Theme Color |
| --- | --- | --- |
| Asia/Tokyo | 12345678-1234-5678-1234-567812345678 | 255,0,0 |
"""

schema = ConversionSchema(
    custom_converters={
        # Standard Library Types
        ZoneInfo: lambda v: ZoneInfo(v),
        UUID: lambda v: UUID(v),
        # Custom Class
        Color: lambda v: Color(*map(int, v.split(",")))
    }
)

data = parse_table(markdown).to_models(Config, conversion_schema=schema)
# data[0].timezone is ZoneInfo("Asia/Tokyo")
# data[0].theme_color is Color(255, 0, 0)
```

**Field-Specific Converters**

For granular control, you can define converters for specific field names, which take precedence over type-based converters.

```python
def parse_usd(val): ...
def parse_jpy(val): ...

schema = ConversionSchema(
    # Type-based defaults (Low priority)
    custom_converters={
        Decimal: parse_usd 
    },
    # Field-name overrides (High priority)
    field_converters={
        "price_jpy": parse_jpy,
        "created_at": lambda x: datetime.strptime(x, "%Y/%m/%d")
    }
)

# price_usd (no override) -> custom_converters (parse_usd)
# price_jpy (override)    -> field_converters (parse_jpy)
data = parse_table(markdown).to_models(Product, conversion_schema=schema)
```

**Standard Converters Library**

For common patterns (currencies, lists), you can use the built-in helper functions in `md_spreadsheet_parser.converters` instead of writing your own.

```python
from md_spreadsheet_parser.converters import (
    to_decimal_clean,        # Handles "$1,000", "¥500" -> Decimal
    make_datetime_converter, # Factory for parse/TZ logic
    make_list_converter,     # "a,b,c" -> ["a", "b", "c"]
    make_bool_converter      # Custom strict boolean sets
)

schema = ConversionSchema(
    custom_converters={
        # Currency: removes $, ¥, €, £, comma, space
        Decimal: to_decimal_clean,
        # DateTime: ISO format default, attach Tokyo TZ if naive
        datetime: make_datetime_converter(tz=ZoneInfo("Asia/Tokyo")),
        # Lists: Split by comma, strip whitespace
        list: make_list_converter(separator=",")
    },
    field_converters={
        # Custom boolean for specific field
        "is_valid": make_bool_converter(true_values=["OK"], false_values=["NG"])
    }
)
```

### 7. Robustness (Handling Malformed Tables)

The parser is designed to handle imperfect markdown tables gracefully.

*   **Missing Columns**: Rows with fewer columns than the header are automatically **padded** with empty strings.
*   **Extra Columns**: Rows with more columns than the header are automatically **truncated**.

```python
from md_spreadsheet_parser import parse_table

markdown = """
| A | B |
|---|---|
| 1 |       <-- Missing column
| 1 | 2 | 3 <-- Extra column
"""

table = parse_table(markdown)

print(table.rows)
# [['1', ''], ['1', '2']]
```

This ensures that `table.rows` always matches the structure of `table.headers`, preventing crashes during iteration or validation.

### 8. In-Cell Line Break Support

The parser automatically converts HTML line breaks to Python newlines (`\n`). This enables handling multiline cells naturally.

**Supported Tags (Case-Insensitive):**
- `<br>`
- `<br/>`
- `<br />`

```python
markdown = "| Line1<br>Line2 |"
table = parse_table(markdown)
# table.rows[0][0] == "Line1\nLine2"
```

**Round-Trip Support:**
When generating Markdown (e.g., `table.to_markdown()`), Python newlines (`\n`) are automatically converted back to `<br>` tags to preserve the table structure.

To disable this, set `convert_br_to_newline=False` in `ParsingSchema`.

### 9. Performance & Scalability (Streaming API)

**Do you really have a 10GB Markdown file?**

Probably not. We sincerely hope you don't. Markdown wasn't built for that.

But *if you do*—perhaps you're generating extensive logs or auditing standard converters—this library has your back. While Excel gives up after 1,048,576 rows, `md-spreadsheet-parser` supports streaming processing for files of **unlimited size**, keeping memory usage constant.

**scan_tables_iter**:
This function reads the file line-by-line and yields `Table` objects as they are found. It does **not** load the entire file into memory.

```python
from md_spreadsheet_parser import scan_tables_iter

# Process a massive log file (e.g., 10GB)
# Memory usage remains low (only the size of a single table block)
for table in scan_tables_iter("huge_server_log.md"):
    print(f"Found table with {len(table.rows)} rows")
    
    # Process rows...
    for row in table.rows:
        pass
```

This is ideal for data pipelines, log analysis, and processing exports that are too large to open in standard spreadsheet editors.

### Command Line Interface (CLI)

You can use the `md-spreadsheet-parser` command to parse Markdown files and output JSON. This is useful for piping data to other tools.

```bash
# Read from file
md-spreadsheet-parser input.md

# Read from stdin (pipe)
cat input.md | md-spreadsheet-parser
```

**Options:**
- `--scan`: Scan for all tables ignoring workbook structure (returns a list of tables).
- `--root-marker`: Set the root marker (default: `# Tables`).
- `--sheet-header-level`: Set sheet header level (default: 2).
- `--table-header-level`: Set table header level (default: 3).
- `--capture-description`: Capture table descriptions (default: True).
- `--column-separator`: Character used to separate columns (default: `|`).
- `--header-separator-char`: Character used in the separator row (default: `-`).
- `--no-outer-pipes`: Allow tables without outer pipes (default: False).
- `--no-strip-whitespace`: Do not strip whitespace from cell values (default: False).
- `--no-br-conversion`: Disable automatic conversion of `<br>` tags to newlines (default: False).

## Configuration

Customize parsing behavior using `ParsingSchema` and `MultiTableParsingSchema`.

| Option | Default | Description |
| :--- | :--- | :--- |
| `column_separator` | `\|` | Character used to separate columns. |
| `header_separator_char` | `-` | Character used in the separator row. |
| `require_outer_pipes` | `True` | If `True`, generated markdown tables will include outer pipes. |
| `strip_whitespace` | `True` | If `True`, whitespace is stripped from cell values. |
| `convert_br_to_newline` | `True` | If `True`, `<br>` tags are converted to `\n` (and back). |
| `root_marker` | `# Tables` | (MultiTable) Marker indicating start of data section. |
| `sheet_header_level` | `2` | (MultiTable) Header level for sheets. |
| `table_header_level` | `3` | (MultiTable) Header level for tables. |
| `capture_description` | `True` | (MultiTable) Capture text between header and table. |

## Future Roadmap

We plan to extend the library to support **Visual Metadata** for better integration with rich Markdown editors.

- **Column Widths**: Persisting user-adjusted column widths.
- **Conditional Formatting**: Highlighting cells based on values.
- **Data Types**: Explicitly defining column types (e.g., currency, date) for better editor UX.

## License

This project is licensed under the [MIT License](LICENSE).
