Metadata-Version: 2.4
Name: litgenemap
Version: 0.3.0
Summary: Gene extraction and bibliometric analysis from literature tables
Author: LitGeneMap contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/whitecrowr/litgenemap
Project-URL: Source, https://github.com/whitecrowr/litgenemap
Project-URL: Issues, https://github.com/whitecrowr/litgenemap/issues
Keywords: gene,bibliometrics,literature-mining,bioinformatics,co-occurrence
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: openpyxl>=3.1
Requires-Dist: networkx>=3.0
Requires-Dist: python-louvain>=0.16
Dynamic: license-file

# LitGeneMap

LitGeneMap is a Python package for extracting genes from literature tables and performing bibliometric analysis.

It is designed around a simple input rule: any literature table containing at least `title` and `abstract` can be used. By default, LitGeneMap analyzes `title`, `abstract`, and `keyword`, then produces gene frequency tables, optional temporal metrics, gene-gene co-occurrence, and full network/module outputs.

## Core design

- Minimum required literature columns: `title`, `abstract`
- Default analyzed text columns: `title`, `abstract`, `keyword`
- Compatible with `.xlsx`, `.csv`, and `.tsv`
- Works with bibliometrix/WoS-style exports, but is not limited to them
- Supports either:
  - `hgnc_complete_set.txt` for human analyses
  - a custom mapping table with `raw_term` and `standard_symbol` for any species
- Supports partial runs: `frequency`, `cooccurrence`, or `full`
- Includes a built-in default blacklist for highly ambiguous terms

## Installation

### From local source

```bash
pip install -e .
```

### After PyPI release

```bash
pip install litgenemap
```

## Quick start

### Human HGNC, full analysis

```bash
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --output results/
```

### Custom species dictionary, full analysis

```bash
litgenemap run --literature my_literature.csv --genes rice_dictionary.csv --gene-input dictionary --output results_rice/
```

### Frequency only

```bash
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --analysis-level frequency --output results_freq/
```

### Use a blacklist for ambiguous terms

```bash
litgenemap run --literature my_literature.xlsx --genes hgnc_complete_set.txt --gene-input hgnc --blacklist blacklist.txt --output results_clean/
```

## Example dataset

A bundled example dataset is provided in `example_data/` for quick validation of the pipeline and for reproducible demonstration purposes.

Included files:

- `example_data/demo_literature.csv`: example literature records
- `example_data/demo_dictionary.csv`: custom gene dictionary with alias-to-symbol normalization
- `example_data/demo_blacklist.txt`: optional custom blacklist for filtering ambiguous terms

Run the full pipeline with the bundled example data:

```bash
litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --output demo_output
```

This example covers:

- custom dictionary-based gene matching
- alias normalization such as `p53 -> TP53` and `HER2 -> ERBB2`
- gene frequency analysis
- gene-gene co-occurrence analysis
- downstream network and module generation

Optional blacklist test:

```bash
litgenemap run --literature example_data/demo_literature.csv --genes example_data/demo_dictionary.csv --gene-input dictionary --analysis-level full --blacklist example_data/demo_blacklist.txt --output demo_output_blacklist
```

## Input requirements

### Minimum required literature columns

- `title`
- `abstract`

### Default analyzed text columns

- `title`
- `abstract`
- `keyword`

### Optional metadata columns

- `year`
- `doi`
- `keywords_plus`

## Supported literature file formats

- `.xlsx`
- `.csv`
- `.tsv`

## Automatic column alias mapping

LitGeneMap automatically maps common source column names when possible:

- `title` <- `title` / `TI` / `TI_raw`
- `abstract` <- `abstract` / `AB` / `AB_raw`
- `keyword` <- `keyword` / `keywords` / `author_keywords` / `DE` / `DE_raw`
- `keywords_plus` <- `keywords_plus` / `ID`
- `year` <- `year` / `PY`
- `doi` <- `doi` / `DI`

## Gene input modes

### 1. `--gene-input hgnc`

Use a human HGNC raw table such as `hgnc_complete_set.txt`.

LitGeneMap will automatically:

- read the HGNC file
- keep approved genes by default
- keep protein-coding genes by default
- expand searchable terms from `symbol`, `alias_symbol`, and `prev_symbol`

Human gene data can be obtained from the [HGNC website](https://www.genenames.org/).

### 2. `--gene-input dictionary`

Use a custom mapping table for any species.

Minimum required columns:

- `raw_term`
- `standard_symbol`

Example:

```csv
raw_term,standard_symbol
TP53,TP53
p53,TP53
BRCA1,BRCA1
```

This mode is useful for:

- non-human species
- custom curated dictionaries
- domain-specific controlled vocabularies

## Default blacklist for ambiguous terms

LitGeneMap applies a built-in blacklist by default to reduce false positives caused by highly ambiguous short terms or common English words that may appear in HGNC aliases or custom dictionaries.

This is especially important for cases such as:

- `OF` being mapped to `BRIP1`
- very short or common words producing inflated gene frequency or co-occurrence counts

Default behavior:

- the built-in blacklist is applied automatically
- `--blacklist my_blacklist.txt` adds your own blocked terms on top of the built-in blacklist
- `--no-default-blacklist` disables the built-in blacklist

Recommended practice:

- keep the default blacklist enabled for routine analyses
- add your own blacklist for field-specific ambiguous terms
- only disable the default blacklist when you explicitly want raw matching behavior

## Recommended literature source

Literature tables exported from the R package **bibliometrix** are recommended.

However, LitGeneMap is **not limited to bibliometrix output**. Any tabular literature dataset containing at least `title` and `abstract` can be used.

## Analysis levels

### `frequency`

Outputs:

- normalized literature table
- article-gene hits
- article-gene matrix
- gene frequency table
- temporal metrics when `year` is available

### `cooccurrence`

Adds:

- gene-gene co-occurrence table

### `full`

Adds:

- network edge table
- module assignments
- evidence scores
- top genes by module

## Output files

Depending on the analysis level, LitGeneMap may produce:

- `articles_normalized.csv`
- `article_gene_hits.csv`
- `article_gene_matrix.csv`
- `gene_frequency.csv`
- `gene_cooccurrence.csv`
- `gene_network_edges.csv`
- `gene_modules.csv`
- `gene_module_evidence_table.csv`
- `top_genes_by_module.csv`

When using HGNC raw input, LitGeneMap may also export intermediate cleaned gene tables.

## Command-line help

```bash
litgenemap --help
litgenemap run --help
```

## Release workflow

### Build distributions

```bash
python -m pip install --upgrade build twine
python -m build
```

### Upload to TestPyPI

```bash
twine upload --repository testpypi dist/*
```

### Upload to PyPI

```bash
twine upload dist/*
```

## License

MIT
