Metadata-Version: 2.4
Name: datashard
Version: 0.2.5
Summary: Safe concurrent data operations for ML/AI workloads, Python implementation of Apache Iceberg concepts with S3 support
Author-email: RODMENA LIMITED <contact@rodmena.com>
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [2025] [name of copyright owner]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: Homepage, https://github.com/rodmena-limited/datashard
Project-URL: Repository, https://github.com/rodmena-limited/datashard
Project-URL: Documentation, https://datashard.readthedocs.io/
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: typing-extensions>=3.10.0; python_version < "3.10"
Provides-Extra: s3
Requires-Dist: boto3>=1.26.0; extra == "s3"
Provides-Extra: pandas
Requires-Dist: pandas>=1.3.0; extra == "pandas"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: ruff>=0.0.247; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: types-setuptools>=65.0; extra == "dev"
Requires-Dist: boto3>=1.26.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.0; extra == "docs"
Dynamic: license-file

# DataShard

```
                  //
                 //
                //
               //
              //
             //
            //
           //
          //
         //
        //
       //
      //___________________
     //                   /
    //___________________/

       D A T A S H A R D
```

**Iceberg-Inspired Safe Concurrent Data Operations for Python**

---

## What is DataShard?

DataShard is a Python implementation of Apache Iceberg's core concepts, providing ACID-compliant data operations for ML/AI workloads. It ensures your data stays safe even when multiple processes read and write simultaneously.

### Key Features

- **ACID Transactions:** Operations fully complete or fully rollback
- **Time Travel:** Query data as it existed at any point in time
- **Safe Concurrency:** Multiple processes can write without corruption
- **Optimistic Concurrency Control (OCC):** Automatic conflict resolution
- **S3-Compatible Storage:** AWS S3, MinIO, DigitalOcean Spaces support (NEW in v0.2.2)
- **Distributed Workflows:** Run workers across different machines with shared S3 storage
- **Pure Python:** No Java dependencies, easy setup
- **pandas Integration:** Native DataFrame support

---

## Why DataShard Matters: The Data Corruption Problem

**The Problem:** Regular file operations in Python are NOT safe for concurrent access.

When multiple processes write to the same file, you get:
- Data loss from race conditions
- Partial writes that corrupt your data
- Silent failures that go unnoticed
- Unpredictable results

### Our Stress Test Results

**Test Setup:** 12 processes, each performing 10,000 operations
**Expected Result:** 120,000 total operations

**Regular Files:**
- Actual Result: Only 8,566 operations completed
- Data Loss: 111,434 operations LOST (93% failure rate!)
- Cause: Race conditions from simultaneous file access

**DataShard (Iceberg):**
- Actual Result: 120,000 operations completed
- Data Loss: ZERO (0% failure rate)
- Cause: ACID transactions with OCC prevent all race conditions

---

## Installation

Basic installation:
```bash
pip install datashard
```

With pandas support (recommended):
```bash
pip install datashard[pandas]
```

With S3 support (AWS S3, MinIO, etc.):
```bash
pip install datashard[s3]
```

With all features:
```bash
pip install datashard[pandas,s3]
```

With development tools:
```bash
pip install datashard[dev]
```

---

## Quick Start

```python
from datashard import create_table, Schema

# 1. Define your schema
schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "user_id", "type": "long", "required": True},
        {"id": 2, "name": "event", "type": "string", "required": True},
        {"id": 3, "name": "timestamp", "type": "long", "required": True}
    ]
)

# 2. Create a table
table = create_table("/path/to/my_table", schema)

# 3. Append records safely (even from multiple processes!)
records = [
    {"user_id": 100, "event": "login", "timestamp": 1700000000},
    {"user_id": 101, "event": "logout", "timestamp": 1700000100}
]
success = table.append_records(records, schema)

# 4. Query current data
snapshot = table.current_snapshot()
print(f"Snapshot ID: {snapshot.snapshot_id}")

# 5. Time travel to previous state
old_snapshot = table.snapshot_by_id(previous_snapshot_id)
```

---

## Real-World Use Case: Workflow Execution Logging

### Problem

A distributed workflow engine runs 100s of tasks across multiple workers. Each task needs to log its execution details (status, duration, errors) to a central location. Traditional approaches fail:

- **Database:** Too slow, adds latency to task execution
- **Log Files:** Corrupted by concurrent writes, not queryable
- **JSON Files:** Race conditions cause data loss

### Solution: DataShard as a Queryable Audit Log

DataShard provides ACID-compliant logging with pandas-queryable tables:

```python
from datashard import create_table, Schema
import os
import pandas as pd

# Define task log schema
task_log_schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "task_id", "type": "string", "required": True},
        {"id": 2, "name": "workflow_id", "type": "string", "required": True},
        {"id": 3, "name": "status", "type": "string", "required": True},
        {"id": 4, "name": "started_at", "type": "long", "required": True},
        {"id": 5, "name": "completed_at", "type": "long", "required": True},
        {"id": 6, "name": "duration_ms", "type": "long", "required": True},
        {"id": 7, "name": "error_message", "type": "string", "required": False},
        {"id": 8, "name": "worker_id", "type": "string", "required": True}
    ]
)

# Create task logs table (shared across all workers)
task_logs = create_table("/shared/storage/task_logs", task_log_schema)

# Worker 1: Log successful task execution
task_logs.append_records([{
    "task_id": "task-001",
    "workflow_id": "wf-123",
    "status": "success",
    "started_at": 1700000000000,
    "completed_at": 1700000150000,
    "duration_ms": 150000,
    "error_message": "",
    "worker_id": "worker-1"
}], task_log_schema)

# Worker 2: Log failed task (concurrent with Worker 1)
task_logs.append_records([{
    "task_id": "task-002",
    "workflow_id": "wf-123",
    "status": "failed",
    "started_at": 1700000000000,
    "completed_at": 1700000200000,
    "duration_ms": 200000,
    "error_message": "Connection timeout",
    "worker_id": "worker-2"
}], task_log_schema)

# Query logs with pandas (from any machine)
from datashard import load_table
from datashard.file_manager import FileManager

table = load_table("/shared/storage/task_logs")
snapshot = table.current_snapshot()
file_manager = FileManager(table.table_path, table.metadata_manager)

# Read manifest list
manifest_list_path = os.path.join(table.table_path, snapshot.manifest_list)
manifests = file_manager.read_manifest_list_file(manifest_list_path)

# Read all data files
data_frames = []
for manifest_file in manifests:
    manifest_path = os.path.join(table.table_path, manifest_file.manifest_path)
    data_files = file_manager.read_manifest_file(manifest_path)

    for data_file in data_files:
        parquet_path = os.path.join(table.table_path, data_file.file_path.lstrip('/'))
        df = pd.read_parquet(parquet_path)
        data_frames.append(df)

logs_df = pd.concat(data_frames, ignore_index=True)

# Analyze logs
failed_tasks = logs_df[logs_df['status'] == 'failed']
avg_duration = logs_df['duration_ms'].mean()
print(f"Failed tasks: {len(failed_tasks)}")
print(f"Average duration: {avg_duration}ms")
```

### Why DataShard Wins for Logging

- **Safe Concurrent Writes:** 100 workers can log simultaneously without corruption
- **Queryable with pandas:** Analyze logs with familiar DataFrame operations
- **Time Travel:** Debug by querying logs as they existed at incident time
- **ACID Guarantees:** Never lose log entries, even during crashes
- **Scalable:** Handles millions of log entries efficiently
- **No Database Needed:** File-based storage, no DB maintenance overhead

---

## Other Real-World Use Cases

### 1. ML Training Metrics
- Multiple training runs logging metrics simultaneously
- Query metrics history for model comparison
- Time travel to analyze model performance at specific epochs

### 2. A/B Test Results
- Concurrent experiments writing results safely
- Aggregate results across all test variants
- Historical analysis of test performance

### 3. Data Pipeline Checkpointing
- Multiple pipeline stages writing progress markers
- Safe recovery from failures using checkpoints
- Query pipeline state for monitoring

### 4. Feature Store
- Multiple processes computing and storing features
- Time travel for point-in-time feature retrieval
- Consistent feature snapshots for reproducibility

---

## S3-Compatible Storage (NEW in v0.2.2)

DataShard now supports S3-compatible storage backends, enabling truly distributed workflows across multiple machines without shared filesystems.

### Supported Storage Backends

- **AWS S3** - Amazon's Simple Storage Service
- **MinIO** - Self-hosted S3-compatible storage
- **DigitalOcean Spaces** - Cloud object storage
- **Wasabi** - S3-compatible cloud storage
- **Any S3-compatible API** - Including on-premise solutions

### Configuration

DataShard automatically selects the storage backend based on environment variables:

```bash
# Local filesystem (default - no configuration needed)
# Just use create_table("/path/to/table", schema)

# S3-compatible storage
export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_ENDPOINT=https://s3.amazonaws.com  # AWS S3
export DATASHARD_S3_ACCESS_KEY=your-access-key
export DATASHARD_S3_SECRET_KEY=your-secret-key
export DATASHARD_S3_BUCKET=my-datashard-bucket
export DATASHARD_S3_REGION=us-east-1
export DATASHARD_S3_PREFIX=optional/prefix/  # Optional: namespace within bucket
```

### Usage Example: Distributed Workflow Logging

**Scenario:** Workers running on different EC2 instances need to log to the same table.

```python
from datashard import create_table, load_table, Schema

# Define schema (same across all workers)
schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "task_id", "type": "string", "required": True},
        {"id": 2, "name": "worker_id", "type": "string", "required": True},
        {"id": 3, "name": "status", "type": "string", "required": True},
        {"id": 4, "name": "timestamp", "type": "long", "required": True},
    ]
)

# Worker 1 on EC2 instance us-east-1a
table = create_table("task-logs", schema)  # Creates in S3
table.append_records([
    {"task_id": "t1", "worker_id": "worker-1", "status": "success", "timestamp": 1700000000}
], schema)

# Worker 2 on EC2 instance us-east-1b (simultaneously)
table = load_table("task-logs")  # Loads from S3
table.append_records([
    {"task_id": "t2", "worker_id": "worker-2", "status": "success", "timestamp": 1700000001}
], schema)

# Both writes succeed with ACID guarantees!
# No shared filesystem required - just S3 access
```

### MinIO Example (Self-Hosted)

```bash
# Configure MinIO endpoint
export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_ENDPOINT=https://minio.mycompany.com
export DATASHARD_S3_ACCESS_KEY=minioadmin
export DATASHARD_S3_SECRET_KEY=minioadmin
export DATASHARD_S3_BUCKET=datashard
export DATASHARD_S3_REGION=us-east-1
```

```python
# Same code works with MinIO
from datashard import create_table, Schema

schema = Schema(schema_id=1, fields=[
    {"id": 1, "name": "metric", "type": "string", "required": True},
    {"id": 2, "name": "value", "type": "double", "required": True},
])

table = create_table("metrics", schema)
table.append_records([
    {"metric": "cpu_usage", "value": 45.2},
    {"metric": "memory_usage", "value": 67.8}
], schema)
```

### AWS S3 with IAM Roles

On EC2/ECS/Lambda, boto3 automatically uses IAM roles - no credentials needed:

```bash
# Minimal configuration (credentials from IAM role)
export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_BUCKET=my-bucket
export DATASHARD_S3_REGION=us-east-1
```

### Benefits of S3 Storage

**Distributed Workflows:**
- Workers on different machines/regions can share tables
- No NFS or shared filesystem required
- Auto-scaling friendly architecture

**Durability:**
- 99.999999999% (11 9's) durability with S3
- Automatic replication and redundancy
- No risk of local disk failures

**Cost-Effective:**
- Pay only for storage used
- ~$0.023/GB/month for S3 Standard
- Cheaper than EBS for cold data

**Performance:**
- Concurrent reads/writes from multiple workers
- Same ACID guarantees as local storage
- Optimistic concurrency control prevents conflicts

### Local vs S3 Performance

| Operation | Local Filesystem | S3 (same region) |
|-----------|-----------------|------------------|
| Write (1000 records) | ~1ms | ~50ms |
| Read (50k records) | ~20ms | ~100ms |
| Concurrent writes | ✅ Safe | ✅ Safe |
| Cross-machine access | ❌ Needs NFS | ✅ Native |
| Durability | Single disk | 11 9's |

**Recommendation:** Use local storage for single-machine workloads, S3 for distributed workflows.

### Switching Between Storage Backends

The same Python code works for both backends - just change environment variables:

```python
# This code works unchanged with local OR S3 storage
from datashard import create_table, Schema

schema = Schema(schema_id=1, fields=[...])
table = create_table("my-table", schema)
table.append_records(records, schema)
```

**Local mode:** No env vars (default)
**S3 mode:** Set `DATASHARD_STORAGE_TYPE=s3` + S3 credentials

### Security Best Practices

**Never hardcode credentials:**
```python
# ❌ WRONG - hardcoded
table = create_table("s3://bucket/table", schema)

# ✅ CORRECT - use environment variables
export DATASHARD_S3_ACCESS_KEY=...
table = create_table("table", schema)
```

**Use IAM roles on AWS:**
- EC2/ECS/Lambda auto-discover IAM credentials
- No access keys needed in code or env vars
- Automatic credential rotation

**Enable S3 encryption:**
- Server-side encryption (SSE-S3 or SSE-KMS)
- Configured at bucket level in AWS Console
- Transparent to DataShard - works automatically

---

## API Reference

### Creating and Loading Tables

```python
from datashard import create_table, load_table, Schema

# Create new table
schema = Schema(schema_id=1, fields=[...])
table = create_table("/path/to/table", schema)

# Load existing table
table = load_table("/path/to/table")
```

### Writing Data

```python
# Append records
records = [{"id": 1, "value": "data"}]
success = table.append_records(records, schema)

# Append with transactions (for complex operations)
with table.new_transaction() as tx:
    tx.append_data(records, schema)
    tx.commit()
```

### Reading Data

```python
# Get current snapshot
snapshot = table.current_snapshot()

# Get specific snapshot
snapshot = table.snapshot_by_id(snapshot_id)

# List all snapshots
snapshots = table.snapshots()

# Time travel
snapshot = table.time_travel(snapshot_id=12345)
snapshot = table.time_travel(timestamp=1700000000000)
```

### Schema Definition

```python
from datashard import Schema

schema = Schema(
    schema_id=1,  # Required: unique schema version ID
    fields=[      # Required: list of field definitions
        {
            "id": 1,              # Field ID (unique within schema)
            "name": "user_id",    # Field name
            "type": "long",       # Data type: long, int, string, double, boolean
            "required": True      # Whether field is required
        },
        # ... more fields
    ]
)
```

---

## How DataShard Compares to Apache Iceberg

### Apache Iceberg
- Java-based, requires JVM
- Built for big data platforms (Spark, Flink, Hive)
- Excellent performance for petabyte-scale data
- Complex setup, requires distributed infrastructure
- Production-grade for enterprise data lakes

### DataShard
- Pure Python, no JVM required
- Built for individual data scientists and ML engineers
- Optimized for personal projects and small teams
- Simple pip install, file-based storage
- Production-ready for Python-centric workflows

**Use Apache Iceberg if:** You're working with petabytes of data in a big data platform with Spark/Flink/Hive infrastructure.

**Use DataShard if:** You're a Python developer who needs safe concurrent data operations without Java dependencies or complex setup.

---

## Technical Details

DataShard implements:
- Optimistic Concurrency Control (OCC) with automatic retry
- Snapshot isolation for consistent reads
- Manifest-based metadata tracking (Iceberg's approach)
- Parquet format for efficient columnar storage
- ACID transaction semantics

---

## Performance Characteristics

- **Writes:** O(1) for append operations (no data rewriting)
- **Reads:** O(n) where n = number of data files in snapshot
- **Concurrency:** Scales linearly with number of processes
- **Storage:** Efficient columnar compression with Parquet
- **Metadata:** Lightweight JSON-based manifest tracking

---

## Limitations

### DataShard is optimized for:
- Append-heavy workloads (logging, metrics, events)
- Hundreds of concurrent writers
- Datasets up to hundreds of millions of records
- Local or network file systems

### DataShard is NOT optimized for:
- High-frequency updates/deletes (use a database)
- Complex queries (use a query engine like DuckDB)
- Petabyte-scale data (use Apache Iceberg with Spark)
- Distributed query processing (use Presto/Trino)

---

## Bug Fix: v0.1.3

**Fixed:** Manifest list creation bug where empty manifest arrays were written during append operations, causing queries to return no data despite parquet files being written correctly.

**Impact:** Queries using the manifest API now return correct data. If you're upgrading from v0.1.2, existing tables will continue to work but won't benefit from the fix until new snapshots are created.

**Details:** Transaction.commit() now correctly extracts data files from queued operations and passes them to _create_manifest_list_with_id().

---

## Development and Testing

Run tests:
```bash
pytest tests/ -v
```

Run specific test:
```bash
pytest tests/test_manifest_creation.py -v
```

With coverage:
```bash
pytest tests/ --cov=datashard --cov-report=html
```

---

## Why I Built DataShard

DataShard was born from my work on the Highway Workflow Engine, a strictly atomic workflow system. I needed to record massive amounts of execution metadata from hundreds of concurrent workers without risking data corruption.

I've had painful experiences with Java-based solutions in the past. Apache Iceberg conceptually fit my needs perfectly, but I wanted a pure Python solution that was simple to deploy and maintain.

Building a production-grade concurrent storage system seemed daunting, but we're in 2025 now, and modern AI tools (specifically Gemini) made it possible to implement complex Optimistic Concurrency Control logic in pure Python.

The result is DataShard: a battle-tested, production-ready solution for safe concurrent data operations that's helped Highway log millions of workflow executions without a single data corruption issue.

If you're building distributed systems in Python and need safe concurrent data operations, DataShard can help.

Enjoy using DataShard!
**Farshid**

---

## License

Copyright (c) RODMENA LIMITED
Licensed under Apache License 2.0

For full license text, see LICENSE file in the repository.

---

## Links

- **Homepage:** https://github.com/rodmena-limited/datashard
- **Documentation:** https://datashard.readthedocs.io/
- **Issues:** https://github.com/rodmena-limited/datashard/issues
