Metadata-Version: 2.4
Name: konsepy
Version: 0.5.0
Summary: Framework for build NLP information extraction systems using regular expressions.
Author-email: dcronkite <dcronkite+pypi@gmail.com>
Project-URL: Home, https://github.com/kpwhri/konsepy
Keywords: nlp,natural-language-processing,information-extraction,text-processing,text-mining,regular-expressions,regex,rule-based,clinical-nlp,healthcare,medical-nlp
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: 3.15
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Healthcare Industry
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: loguru>=0.7.3
Provides-Extra: dev
Requires-Dist: pytest>=9.0.3; extra == "dev"
Provides-Extra: ssplit
Requires-Dist: spacy; extra == "ssplit"
Provides-Extra: sas
Requires-Dist: sas7bdat; extra == "sas"
Provides-Extra: model
Requires-Dist: datasets; extra == "model"
Requires-Dist: transformers; extra == "model"
Provides-Extra: all
Requires-Dist: spacy; extra == "all"
Requires-Dist: sas7bdat; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: transformers; extra == "all"

# konsepy

Framework for build NLP information extraction systems using regular expressions. `konsepy` then enables leveraging the
NLP system to create a silver standard for fine-tuning a transformer model.

## Installation

* `konsepy` is designed to be used with the [`knosepy_nlp_template`](https://github.com/kpwhri/konsepy_nlp_template)
    * See the README there for current installation instructions.
* To use `konsepy` as a standalone entity:
    * Install with `pip`:
        * `pip install konsepy[all]`
        * For sentence-splitting corpora from fine-tuning a sentence based transformer, `spacy` will also need to be
          installed and configured.

## Usage

The package provides a centralized CLI tool `konsepy`.

### Building your NLP Package

To use `konsepy`, you need to create an NLP package (e.g., `my_nlp_package`) with the following structure. The best way
to get this format is to clone the [konsepy_nlp_template](https://github.com/kpwhri/konsepy_nlp_template):

```text
my_nlp_package/
├── __init__.py
└── concepts/
    ├── __init__.py
    └── my_concept.py
```

Each concept file (e.g., `my_concept.py`) must define:

* `REGEXES`: A list of regex-category pairs (and optional context functions).
    * See [Regex Arguments](#regex-arguments)
* `RUN_REGEXES_FUNC`: A function that executes the regexes and returns categories/matches (
  see [search functions, below](#search-functions))
* `CategoryEnum`: An `Enum` defining the possible categories for the concept.

#### Regex Arguments

When defining `REGEXES`, you can supply a variable number of arguments. The can be entirely customized by your
own [search function](#search-functions), but the standard argument list is:

* Position 0: Compile pattern (e.g., `re.compile('score: (?P<val>\d+))`)
* Position 1: Default value (enum) if the compile pattern matches (e.g., `MyCategory.SCORE`)
* Position 2: Post-processing function(s) (use a list/tuple if > 1) (e.g., `[is_negated]`)
    * This function can accept contextual information provided as:
        * m: regex match object
        * precontext: text in `m.start() - window` (default to 20 characters)
        * postcontext: text in `m.end() + window` (default to 20 characters)
        * text: full text
        * window: character window (int)
        * word_window: word window (int)
        * around: text in `m.start() - window to m.end() + window`
* Position 3: Pre-processing function(s) (use a list/tuple if > 1)
    * The functions should return start/end indices of the text that should be processed.
    * They can return (or yield) `None` or `start_index == end_index` if not text should be searched.

## Regex search helpers

`rxsearch` provides small utilities for classifying or extracting values from text
with ordered regex definitions.

The canonical search functions are:

```python
search_all_regex()
search_first_regex()
```

## Regex definition format

Each regex definition may contain up to four positions:

```python
(regex, default_value, postprocessors, preprocessors)
```

### Position 0: regex

A compiled regex pattern.

```python
re.compile(r'score:\s*(?P<target>\d+)')
```

A `None` regex acts as a sentinel. If a non-`UNKNOWN` result has already been
found, searching stops at the sentinel.

```python
REGEXES = [
    (KNOWN_REGEX, 'KNOWN'),
    (None, None),
    (UNKNOWN_REGEX, 'UNKNOWN'),
]
```

### Position 1: default value

The value yielded when the regex matches and no postprocessor overrides or skips
the result.

```python
REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]
```

### Position 2: postprocessors

Optional function, list, or tuple of functions.

Postprocessors receive contextual keyword arguments, including:

- `m`: regex match object
- `precontext`: text before the match
- `postcontext`: text after the match
- `text`: full text
- `window`: character context window
- `word_window`: word context window
- `around`: text around the match

A postprocessor may return:

| Return value     | Meaning                                                                      |
|------------------|------------------------------------------------------------------------------|
| `None`           | no override; try the next postprocessor, then fall back to the default value |
| `SKIP`           | skip this match entirely                                                     |
| `value`          | yield `value` instead of the default value                                   |
| `(value, match)` | yield `value` and use `match` for match/index output                         |

Example:

```python
import re

from konsepy.rxsearch import SKIP, search_all_regex


def skip_negated(*, precontext, **_):
    if 'no ' in precontext.lower():
        return SKIP
    return None


REGEXES = [
    (re.compile(r'diabetes'), 'DIABETES', skip_negated),
]

search = search_all_regex(REGEXES)

print(list(search('diabetes')))
print(list(search('no diabetes')))
```

Output:

```python
['DIABETES']
[]
```

### Position 3: preprocessors

Optional function, list, or tuple of functions.

Preprocessors receive the full text and should return or yield searchable
`(start, end)` regions.

They may return or yield:

- `None`, which is ignored
- `(start, end)`, which is searched
- `(start, start)`, which is ignored

Example:

```python
import re

from konsepy.rxsearch import search_all_regex


def first_sentence_only(text):
    end = text.find('.')
    if end == -1:
        yield 0, len(text)
    else:
        yield 0, end


REGEXES = [
    (
        re.compile(r'score:\s*\d+'),
        'SCORE',
        None,
        first_sentence_only,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10. score: 20.')))
```

Output:

```python
['SCORE']
```

## Basic classification

Use `search_all_regex()` to yield every matching result.

```python
import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))
```

Output:

```python
['HERO', 'PLACE']
```

## First result only

Use `search_first_regex()` to yield at most one result.

```python
import re

from konsepy.rxsearch import search_first_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_first_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))
```

Output:

```python
['HERO']
```

## Include match objects

Pass `include_match=True` to receive `(result, match)` tuples.

```python
import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

search = search_all_regex(REGEXES)

for value, match in search('old Väinämöinen sang', include_match=True):
    print(value, match.group(), match.start(), match.end())
```

Output:

```python
HERO
Väinämöinen
4
15
```

## Return matched text and indices

Use `get_all_regex_by_index()` to yield:

```python
(result, match_text, start, end)
```

Example:

```python
import re

from konsepy.rxsearch import get_all_regex_by_index

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

get_by_index = get_all_regex_by_index(REGEXES)

print(list(get_by_index('old Väinämöinen sang')))
```

Output:

```python
[('HERO', 'Väinämöinen', 4, 15)]
```

## Extracting `(?P<target>...)`

Use `extract_all_regex_target()` or `extract_first_regex_target()` to return
regex group values instead of default classification values.

By default, these helpers extract the named group `target`.

```python
import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES)

print(list(extract('score: 10 score: 25')))
```

Output:

```python
['10', '25']
```

## Extract and transform

Use `transform` to convert extracted values.

```python
import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES, transform=int)

print(list(extract('score: 10 score: 25')))
```

Output:

```python
[10, 25]
```

Falsey transformed values, such as `0`, are preserved.

```python
print(list(extract('score: 0')))
```

Output:

```python
[0]
```

## Extract a different group

Use `target` to extract a different group name or group index.

```python
import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'hero:\s*(?P<name>\w+)'), 'HERO'),
]

extract = extract_all_regex_target(REGEXES, target='name')

print(list(extract('hero: Aino')))
```

Output:

```python
['Aino']
```

## Configure extraction fallback

Extraction skips matches by default when the group is missing or unmatched.

```python
from konsepy.rxsearch import SKIP

extract = extract_all_regex_target(
    REGEXES,
    missing=SKIP,
    unmatched=SKIP,
)
```

To fall back to the regex default value, use `None`.

```python
extract = extract_all_regex_target(
    REGEXES,
    missing=None,
    unmatched=None,
)
```

If extraction returns `None`, later postprocessors may still run. If no
postprocessor returns a value, the default value is yielded.

## Use extraction as a postprocessor

Use `extract_group()` directly in position 2 when you want extraction behavior
inside regular `search_all_regex()` or `search_first_regex()` calls.

```python
import re

from konsepy.rxsearch import extract_group, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group(),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))
```

Output:

```python
['10']
```

Use `extract_group_as()` to transform the group.

```python
import re

from konsepy.rxsearch import extract_group_as, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group_as(transform=int),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))
```

Output:

```python
[10]
```

## Prevent overlapping duplicate matches

Pass `suppress_overlaps=True` to let earlier matches claim spans of text.
Later matches that overlap already-claimed spans are skipped.

This is useful when a specific pattern should override a more general one.

```python
import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'not\s+x'), 'NEGATED_X'),
    (re.compile(r'x'), 'X'),
]

search = search_all_regex(REGEXES)

print(list(search('not x')))
print(list(search('not x', suppress_overlaps=True)))
```

Output:

```python
['NEGATED_X', 'X']
['NEGATED_X']
```

The original text is not modified, so match indices and context windows remain
stable.

Non-overlapping later matches are still returned.

```python
print(list(search('not x and x', suppress_overlaps=True)))
```

Output:

```python
['NEGATED_X', 'X']
```

## Ignore preprocessing regions

Pass `ignore_indices=True` to search the whole text even when preprocessors are
defined. This is mainly useful in tests.

```python
import re

from konsepy.rxsearch import search_all_regex


def no_regions(text):
    return None


REGEXES = [
    (
        re.compile(r'Väinämöinen'),
        'HERO',
        None,
        no_regions,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen')))
print(list(search('Väinämöinen', ignore_indices=True)))
```

Output:

```python
[]
['HERO']
```

## Deprecated compatibility names

Use these names for new code:

```python
search_all_regex()
search_first_regex()
```

These older names remain available for compatibility, but emit
`DeprecationWarning`:

```python
search_all_regex_func()
search_first_regex_func()
search_all_regex_match_func()
search_and_replace_regex_func()
```

`search_and_replace_regex_func()` now delegates to overlap-suppressed search
instead of modifying the searched text. Prefer:

```python
search = search_all_regex(REGEXES)

results = list(search(text, suppress_overlaps=True))
```

## Regex Utilities

`konsepy` includes `KonsepyRegex` in `konsepy.rxutils` to allow for duplicate named groups in alternation branches:

```python
import re
from konsepy.rxutils import KonsepyRegex

pattern = KonsepyRegex(
    r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
    flags=re.I,
    allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched
```

You can also use the shorthand helper `rx_compile`:

```python
from konsepy.rxutils import rx_compile

pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')
```

Example of `my_concept.py`:

```python
import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject


class CategoryEnum(Enum):
    MENTION = 1
    NO = 0
    OTHER = 3


REGEXES = [
    (re.compile(r'my pattern', re.I),
     CategoryEnum.MENTION,
     [
         lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
         lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
     ]
     ),
]

# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50)  # defaults to 30
```

## Custom Search Functions

You can create your own search function by defining a function that returns a generator:

```python
def my_custom_search(regexes):
    def _search(text, include_match=False):
        for regex, category, *other in regexes:
            for m in regex.finditer(text):
                yield (category, m) if include_match else category

    return _search
```

## Running konsepy

```bash
# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/

# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/

# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/

# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/
```

For more detailed documentation and a template,
see [konsepy_nlp_template](https://github.com/kpwhri/konsepy_nlp_template).

## Roadmap

* Change labels to some metadata object to allow more diverse input sources and run info
