Metadata-Version: 2.1
Name: tsv2sqlite
Version: 0.5.1
Summary: Package for parsing a tab-delimited data file and loading the lines into SQLite3 database
Home-page: https://github.com/jai-python3/tsv2sqlite
Author: Jaideep Sundaram
Author-email: jai.python3@gmail.com
License: GNU General Public License v3
Keywords: tsv2sqlite
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.10
License-File: LICENSE
License-File: AUTHORS.rst
Requires-Dist: Click >=7.0
Requires-Dist: Rich
Requires-Dist: PyYAML

==========
tsv2sqlite
==========

Package for parsing a tab-delimited data file and loading the lines into SQLite3 database


Installation

.. code-block:: shell

    pip install tsv2sqlite


tsv2sqlite
----------
The software will automatically assign datatype TEXT for all columns.
The target SQLite3 database file name can be specified on invocation via ``--database_file``.
The default database file name will be the basename of the input file with the filename extension replaced with .sqlite.

The target table name can be specified on invocation via ``--table_name``.
The default table name will be the basename of the input file with the filename extension removed.

The configuration file can be used to specify the datatype for each of the columns.
The configuration file can be used to specify which columns should be excluded.

Configure column datatypes
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    table_schema:
      sample:
        CHROM: str
        POS: int
        ID: str
        REF: str
        ALT: str
        QUAL: int
        FILTER: str
        INFO: str

Configure column exclusion list
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

    ignore_columns:
      - FILTER
      - INFO

add-header-row
~~~~~~~~~~~~~~

The software will write a copy of the input tab-delimited file that does not have a header row.
The inserted header row will have column names:
- col_1
- col_2
- col_3
- etc.


Edit configuration file
~~~~~~~~~~~~~~~~~~~~~~~

Edit your configuration file e.g.: ``tsv2sqlite/conf/config.yaml``.


No column mapping mode
----------------------

Run tsv2sqlite with ``--no_column_mapping`` option.

.. code-block:: shell

    tsv2sqlite --infile ~/projects/tsv2sqlite/sample.tsv --config_file ~/projects/tsv2sqlite/tsv2sqlite/conf/config.yaml --no_column_mapping
    --outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420'
    Created output directory '/tmp/tsv2sqlite/2023-12-22-102420'
    --logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
    --database_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/sample.tsv.sqlite3'
    The log file is '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
    Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed

Observe the SQLite3 database file created.

.. code-block:: shell

    (venv) ➜  tsv2sqlite git:(main) ✗ ls -ltr sample.tsv.sqlite3
    -rw-r--r-- 1 sundaram sundaram 12288 Dec 22 10:24 sample.tsv.sqlite3

Connect to the SQLite3 database.

.. code-block:: shell

    (venv) ➜  tsv2sqlite git:(main) ✗ sqlite3 sample.tsv.sqlite3
    SQLite version 3.37.2 2022-01-06 13:25:41
    Enter ".help" for usage hints.
    sqlite>

Observe the two tables created.

.. code-block:: sql

    sqlite> .schema
    CREATE TABLE provenance (
                bytesize INTEGER NOT NULL,
                abspath TEXT NOT NULL,
                md5checksum TEXT NOT NULL,
                date_created TEXT NOT NULL
            );
    CREATE TABLE sample (CHROM TEXT,
    POS TEXT,
    ID TEXT,
    REF TEXT,
    ALT TEXT,
    QUAL TEXT,
    line_number INTEGER NOT NULL);
    sqlite> 

Execute queries against both tables.

.. code-block:: sql

    sqlite> select * from provenance;
    776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717
    sqlite> select * from sample;
    1|12345|rs567|A|G|50|3
    2|56789|rs890|T|C|44|4
    3|98765|rs123|G|T|60|5
    1|34567|rs456|C|A|55|6
    sqlite> 

Column mapping mode
-------------------

Column mapping was implemented to address a particular use case and may be deprecated in the near future.

To use the exported script for parsing a tab-delimited file to be loaded into a SQLite3 database instance:

.. code-block:: shell

     tsv2sqlite --infile sample.tsv
    --config_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/conf/config.yaml'
    --outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537'
    Created output directory '/tmp/tsv2sqlite/2023-11-13-053537'
    --logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
    --database_file was not specified and therefore was set to 'sample.tsv.sqlite3'
    The log file is '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
    Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed

Execute SQL queries
~~~~~~~~~~~~~~~~~~~

Connect to the SQLite3 database:

.. code-block:: shell

    sqlite3 sample.tsv.sqlite3

Inspect the file metadata:

.. code-block:: sql

    sqlite> .schema provenance
    CREATE TABLE provenance (
                bytesize INTEGER NOT NULL,
                abspath TEXT NOT NULL,
                md5checksum TEXT NOT NULL,
                date_created TEXT NOT NULL
            );
    sqlite> select * from provenance;
    776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717

Inspect the column mappings:

.. code-block:: sql

    sqlite> .schema columnmaps
    CREATE TABLE columnmaps (
                column_num INTEGER NOT NULL PRIMARY KEY,
                name TEXT NOT NULL,
                norm_name TEXT NOT NULL,
                UNIQUE (name, norm_name)
            );
    sqlite>

    sqlite> select * from columnmaps;
    0|#CHROM|_CHROM
    1|POS|POS
    2|ID|ID
    3|REF|REF
    4|ALT|ALT
    5|QUAL|QUAL
    sqlite>

Inspect the records table:

.. code-block:: sql

    sqlite> .schema records
    CREATE TABLE records (
                line_num INTEGER NOT NULL,
                column_num INTEGER NOT NULL,
                value TEXT NOT NULL,
                UNIQUE (line_num, column_num),
                FOREIGN KEY (column_num) REFERENCES columnmaps (column_num)
            );
    sqlite> select * from records;
    3|0|1
    3|1|12345
    3|2|rs567
    3|3|A
    3|4|G
    3|5|50
    4|0|2
    4|1|56789
    4|2|rs890
    4|3|T
    4|4|C
    4|5|44
    5|0|3
    5|1|98765
    5|2|rs123
    5|3|G
    5|4|T
    5|5|60
    6|0|1
    6|1|34567
    6|2|rs456
    6|3|C
    6|4|A
    6|5|55
    sqlite>

Get the ID values for each line:

.. code-block:: sql

    sqlite> select r_id.value, r_id.line_num
    from records r_id, columnmaps c_id
    where c_id.name = "ID"
    and c_id.column_num = r_id.column_num;
    rs567|3
    rs890|4
    rs123|5
    rs456|6
    sqlite>

Get the ID, REF and line number:

.. code-block:: sql

    sqlite> select r_id.value, r_ref.value, r_ref.line_num
    from records r_id, records r_ref, columnmaps c_id, columnmaps c_ref
    where c_ref.name = "REF"
    and c_id.name = "ID"
    and c_id.column_num = r_id.column_num
    and c_ref.column_num = r_ref.column_num
    and r_id.line_num = r_ref.line_num;
    rs567|A|3
    rs890|T|4
    rs123|G|5
    rs456|C|6
    sqlite>


=======
History
=======

0.1.0 (2023-11-12)
------------------

* First release on PyPI.
