Metadata-Version: 2.4
Name: extract-list
Version: 0.3
Summary: Extract a list from JSON or XML, save to excel, csv, etc.
Author: Tom Björkholm
Author-email: Tom Björkholm <klausuler_linnet0q@icloud.com>
License-Expression: MIT
Project-URL: Source code, https://bitbucket.org/tom-bjorkholm/extract-list
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: argcomplete>=3.6.3
Requires-Dist: config-as-json>=1.0
Requires-Dist: tableio>=0.9
Requires-Dist: tableio-cfg-json>=0.2
Requires-Dist: versionreporter>=0.2
Requires-Dist: xmltodict>=1.0.4
Requires-Dist: types-xmltodict>=1.0.1.20260408
Requires-Dist: pip>=26.1.1
Requires-Dist: setuptools>=82.0.1
Requires-Dist: build>=1.5.0
Requires-Dist: wheel>=0.47.0
Dynamic: author
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python

# extract-list

## Background

extract-list is a command line tool for turning selected data from JSON or XML
files into a list of columns. It is useful when the source data is structured,
but the result is needed in a spreadsheet, CSV file, text file, or another
table-like format.

## What it does

extract-list:

* reads data from an XML file or a JSON file
* extracts the configured parts of that input data
* writes the extracted data as rows and columns

Examples of output formats include:

* Excel
* CSV
* ODS
* HTML
* Markdown
* plain text
* JSON
* XML

The exact output choices available in an installed environment can be listed
with the `cfg-example --help` command. Some output formats are provided through
TableIO, and some are handled directly by extract-list.

## Installing it

extract-list requires Python 3.13 or newer. Python can be downloaded from
[https://www.python.org/downloads/](https://www.python.org/downloads/).

### Installing on macOS and Linux

```sh
pip3 install --upgrade extract-list
```

### Installing on Microsoft Windows

```sh
pip install --upgrade extract-list
```

## Version history

| Version | Date        | Python version  | Description                         |
|---------|-------------|-----------------|-------------------------------------|
| 0.2     | 06 Jan 2025 | 3.12.6 or newer | First released version              |
| 0.2.2   | 23 Mar 2025 | 3.13.2 or newer | Adapted to Python 3.13.2            |
| 0.2.3   | 23 Mar 2025 | 3.13.2 or newer | Fix in README only                  |
| 0.2.5   | 09 Apr 2025 | 3.13.3 or newer | Using newer dependencies            |
| 0.2.7   | 09 Jun 2025 | 3.12.x          | add version sub-command             |
| 0.2.8   | 09 Jun 2025 | 3.13 or newer   | add version sub-command             |
| 0.2.10  | 23 Jul 2025 | 3.12.x          | updated dependencies                |
| 0.2.11  | 23 Jul 2025 | 3.13 or newer   | updated dependencies                |
| 0.2.13  | 23 Nov 2025 | 3.12.x          | updated dependencies                |
| 0.2.14  | 23 Nov 2025 | 3.13 or newer   | updated dependencies                |
| 0.3     | 01 Jun 2026 | 3.13 or newer   | New output config; more formats     |

## Running the application

### Running on macOS and Linux

```sh
python3 -m extract_list --help
python3 -m extract_list cfg-example --help
python3 -m extract_list extract --help
python3 -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python3 -m extract_list extract -c example.cfg -i input.json -o output.xlsx
```

### Running on Microsoft Windows

```sh
python -m extract_list --help
python -m extract_list cfg-example --help
python -m extract_list extract --help
python -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python -m extract_list extract -c example.cfg -i input.json -o output.xlsx
```

## Suggested way to get started

1. Use the `cfg-example` sub-command to generate a few example configuration
   files.
2. Read each generated `.cfg` file together with its generated `.txt`
   description file.
3. Find the example that is closest to what you want to extract.
4. Modify that configuration file to match your input data and desired output.
5. Use the `extract` sub-command to read your data and write the output file.
6. Check the produced output. If needed, adjust the configuration and run the
   extraction again.

### Example configuration files

When `cfg-example` creates a configuration file, it also creates a text file
that describes the example and the configuration syntax. If the configuration
file is named `example.cfg`, the description file is named `example.txt`.

The generated `.txt` file is the best place to look for detailed configuration
reference text, because it is generated by the same installed version of
extract-list that will read the configuration.

## Migrating old configuration files

Version 0.3 uses the nested `output` configuration object. Older 0.2-style
configuration files that use settings such as `outfile_type`,
`outfile_encoding`, `out_csv_dialect`, or `outfile_excel_library` can be
migrated with the `migrate-cfg` sub-command.

### Migrating on macOS and Linux

```sh
python3 -m extract_list migrate-cfg -i old.cfg -o new.cfg
```

### Migrating on Microsoft Windows

```sh
python -m extract_list migrate-cfg -i old.cfg -o new.cfg
```

extract-list can still read the old configuration format with backward
compatibility handling, but migration is recommended before editing the
configuration further.

## Configuration file overview

The configuration file is written in JSON syntax:
[https://en.wikipedia.org/wiki/JSON](https://en.wikipedia.org/wiki/JSON).
The keywords and nesting are important. The order of keywords is not
significant. Indentation and line breaks are not significant.

The encoding for the configuration file must be UTF-8. US-ASCII is a subset of
UTF-8.

It is recommended to generate a configuration file with `cfg-example` and then
edit that file. Writing a configuration file from scratch is possible, but the
generated examples are a much safer starting point.

### Type of input file

The type of input file to read is determined by `infile_type`. `infile_type`
can have these values:

* `"JSON"`
* `"XML"`

Both JSON and XML are text file formats, so the input file has a character
encoding. This is specified with `infile_encoding`. Unless you know that you
need another encoding, leave this as it is in the generated example
configuration.

### Type of output file

Output files are configured in the nested `output` object. The output format is
selected with `output.format_name`. The value is case-insensitive and is
normalized when the configuration is read.

Examples of commonly useful `output.format_name` values include:

* `"Excel"`
* `"CSV"`
* `"ODS"`
* `"HTML"`
* `"md"`
* `"txt"`
* `"JSON"`
* `"XML"`
* `"Plaintxt"`

Run `python3 -m extract_list cfg-example --help` on macOS or Linux, or
`python -m extract_list cfg-example --help` on Windows, to see the full list of
output formats available in your installed environment.

`JSON`, `XML`, and `Plaintxt` output are handled inside extract-list. Other
table-style output is handled through TableIO. The same `output.format_name`
configuration member is used for both groups.

Text-based output formats can have a character encoding. This is configured
with `output.character_encoding`. Unless you know that you need another
encoding, leave this as it is in the generated example configuration.

Comma separated values files may differ slightly depending on the programs
used to read and write them and the locale used. CSV settings are configured in
the optional nested `output.csv` object.

TableIO may provide several implementations for a file format.
`output.implementation` can force a specific implementation. If it is omitted,
TableIO chooses the best available implementation.

The `outfile_border` and `outfile_filtered_area` values can be `"NO"`,
`"IF_AVAILABLE"`, or `"NEEDED"`. `"NO"` disables the feature request.
`"IF_AVAILABLE"` requests the feature when the selected output implementation
can provide it. `"NEEDED"` requires support for the feature.

### Data to extract

The input file is likely to contain more data than should appear in the
output. The data to extract is specified with `main_line` and `linked_lines`.

The `main_line` setting describes the main input records that become output
rows. Linked lines describe additional records that are tied to the main line.
A linked line is tied to a main line when a configured value in the linked line
has the same value as a configured value in the main line.

The `line` member of `main_line` and `linked_lines` is a list of strings. This
list is the path of keys to the input records. Directly below that path there
is either a list or a dictionary of records.

The records of the `main_line` are indexed in the input, either by integer
position in a list or by key in a dictionary. If you want this index or key in
the output, set `include_key` to `true`. If not, set it to `false`. The output
column name for this key is configured with `column_name_for_key`.

The `columns` member of `main_line` and `linked_lines` maps output column names
to paths inside each input record. Each path is written as a list of strings.

Sometimes one input record contains several nested records. For example, a
purchase order may contain several purchased items. Because the output is a
list of columns, such an input record must be expanded into several output
rows. The shared values from the input record are duplicated on those output
rows. The `expand_at` member contains the relative paths where this expansion
should happen.

There can be only one `main_line`. There may be any number of linked lines.
The linked lines are configured as an array in `linked_lines`.

For each linked line, `linked_column` and `linked_main_column` define how it is
tied to the main line. Both values are relative paths in the input records.
The value at `linked_column` in the linked-line record must equal the value at
`linked_main_column` in the main-line record.

The values used to link records may also be extracted as output columns, but
they do not have to be.

Several linked lines can be tied to the same main line.
`one_output_line_per_main_line` controls how this is handled. If it is `false`,
the main-line data is duplicated so the output has one row for each tied
combination of main-line and linked-line records. If it is `true`, multiple
linked lines for one main line are reported as an error.

Input records may be missing optional values. `missing_input_for_column`
controls how missing input data is handled. Its values are `"EMPTY"` and
`"ERROR"`. If it is `"EMPTY"`, the output column resulting from missing input
data is left empty.

### XML attributes

XML member values may be written either as nested objects or as attributes. If
the input has XML attributes, the attribute key has an `@` prefix. To handle
this, `in_xml_strip_at` can be set to `true` or `false`. If it is `true`, a
leading `@` character is stripped from each key.

The `out_xml_attributes` setting specifies column names that should be written
as XML attributes in XML output instead of as nested objects.

### Output column order

The order of the columns in the output is specified with `column_order`. The
value is a list of strings.

Specifying a column in `column_order` that has not been extracted is an error.
Extracting a column without specifying it in `column_order` is also an error.

### Output line order

The `order_rows_by` setting specifies which columns should be used for sorting
the output rows. The most significant column should be first in the list.

The default row order is based on the list of columns in `column_order`. Leave
`order_rows_by` as an empty list unless you need another specific order.

## Source code

Source code and tests are available at
[https://bitbucket.org/tom-bjorkholm/extract-list](https://bitbucket.org/tom-bjorkholm/extract-list).

## Test summary

- Test result: 1247 passed in 36s
- No flake8 warnings.
- No mypy errors found.
- No python layout warnings.
- Built version(s): 0.3
- Build and test using Python 3.14.5
