Metadata-Version: 2.4
Name: dataflow-kg
Version: 0.10.0
Summary: Modern Data Centric AI system for Large Language Models
Author-email: Zhengpin Li <zpli@pku.edu.cn>
License: Apache-2.0
Project-URL: Github, https://github.com/Open-DataFlow/DataFlow-KG
Project-URL: Documentation, https://open-dataflow.github.io/DataFlow-Doc/
Project-URL: Bug Reports, https://github.com/Open-DataFlow/DataFlow-KG/issues
Keywords: Artificial Intelligence,Knowledge Graph
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: Free For Educational Use
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: <4,>=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.0.0
Requires-Dist: datasets
Requires-Dist: scipy
Requires-Dist: torch
Requires-Dist: torchvision
Requires-Dist: torchaudio
Requires-Dist: tqdm
Requires-Dist: transformers<4.54.0
Requires-Dist: accelerate
Requires-Dist: rapidfuzz
Requires-Dist: colorlog
Requires-Dist: librosa
Requires-Dist: appdirs
Requires-Dist: datasketch
Requires-Dist: httpx[socks]
Requires-Dist: modelscope
Requires-Dist: addict
Requires-Dist: pytest
Requires-Dist: rich
Requires-Dist: chonkie
Requires-Dist: pydantic
Requires-Dist: nltk
Requires-Dist: colorama
Requires-Dist: json5
Requires-Dist: tiktoken
Requires-Dist: sqlglot
Requires-Dist: gradio>5
Requires-Dist: fasttext-wheel
Requires-Dist: openai
Requires-Dist: sentencepiece
Requires-Dist: datasketch
Requires-Dist: presidio_analyzer[transformers]
Requires-Dist: vendi-score==0.0.3
Requires-Dist: google-api-core
Requires-Dist: google-api-python-client
Requires-Dist: contractions
Requires-Dist: cookiecutter
Requires-Dist: trafilatura
Requires-Dist: lxml_html_clean
Requires-Dist: pymupdf
Requires-Dist: cloudpickle
Requires-Dist: pandas
Requires-Dist: google-cloud-aiplatform>=1.55
Requires-Dist: google-cloud-bigquery
Requires-Dist: google-genai
Requires-Dist: gcsfs
Requires-Dist: networkx
Requires-Dist: pyvis
Provides-Extra: vllm
Requires-Dist: vllm<=0.9.2,>=0.7.0; extra == "vllm"
Requires-Dist: numpy<2.0.0; extra == "vllm"
Provides-Extra: vllm07
Requires-Dist: vllm<0.8; extra == "vllm07"
Requires-Dist: numpy<2.0.0; extra == "vllm07"
Provides-Extra: vllm08
Requires-Dist: vllm<0.9; extra == "vllm08"
Provides-Extra: kbc
Requires-Dist: vllm==0.6.3; extra == "kbc"
Requires-Dist: mineru[pipeline]==2.0.6; extra == "kbc"
Provides-Extra: mineru
Requires-Dist: mineru[all]; extra == "mineru"
Requires-Dist: numpy<2.0.0,>=1.24; extra == "mineru"
Requires-Dist: sglang[all]>=0.4.8; extra == "mineru"
Requires-Dist: pypdf; extra == "mineru"
Requires-Dist: reportlab; extra == "mineru"
Provides-Extra: myscale
Requires-Dist: clickhouse-driver; extra == "myscale"
Provides-Extra: sglang
Requires-Dist: sglang[all]; extra == "sglang"
Provides-Extra: litellm
Requires-Dist: litellm<2.0.0,>=1.70.0; extra == "litellm"
Provides-Extra: audio
Requires-Dist: librosa; extra == "audio"
Provides-Extra: vectorsql
Requires-Dist: sqlite-vec; extra == "vectorsql"
Requires-Dist: sqlite-lembed; extra == "vectorsql"
Requires-Dist: sentence_transformers; extra == "vectorsql"
Provides-Extra: pdf2model
Requires-Dist: llamafactory[metrics,torch]>=0.9.0; extra == "pdf2model"
Requires-Dist: vllm<0.9.2,>=0.7.0; extra == "pdf2model"
Requires-Dist: numpy<2.0.0,>=1.24; extra == "pdf2model"
Requires-Dist: mineru[pipeline]; extra == "pdf2model"
Requires-Dist: mineru-vl-utils; extra == "pdf2model"
Provides-Extra: eval
Requires-Dist: vllm<0.9.2,>=0.7.0; extra == "eval"
Provides-Extra: rag
Requires-Dist: lightrag-hku; extra == "rag"
Requires-Dist: asyncio; extra == "rag"
Dynamic: license-file

# DataFlow Knowledge Graph
*Knowledge graph data preparation with DataFlow style operators and pipelines*

<p align="center">
  <img src="static/dataflow-KG%20framework.png" alt="DataFlow-KG framework" width="100%">
</p>

<p align="center">
  <b>DataFlow Knowledge Graph</b>: An LLM-Driven Knowledge Graph Processing Library
</p>

<p align="center">
  Build, enrich, reason over, and operationalize knowledge graphs with composable operators.
</p>

<p align="center">
  <a href="https://github.com/OpenDCAI/DataFlow-KG">GitHub</a> |
  <a href="https://zhp-li197.github.io/DataFlow-KG-Doc/zh/">Documentation</a> |
  <a href="README.zh.md">中文 README</a>
</p>

---

## 0. News

## 1. 🤖 Overview

**DataFlow-KG** (short for DataFlow Knowledge Graph) is an LLM-driven knowledge graph processing library built on top of the [DataFlow](https://github.com/OpenDCAI/DataFlow) ecosystem. It is designed to provide reusable, extensible, and modular operators for knowledge graph construction, reasoning, retrieval, querying, and domain-specific applications. The original [DataFlow](https://github.com/OpenDCAI/DataFlow) project provides a clean, elegant, and highly extensible foundation for building practical data-centric LLM workflows.

Rather than treating KG workflows as isolated scripts, DataFlow-KG organizes graph capabilities into operator packages by graph type and application scenario. These operators can be composed into larger pipelines, including but not limited to:

- knowledge graph construction
- graph reasoning
- graph retrieval
- domain-specific knowledge graph applications

DataFlow-KG aims to serve as a unified infrastructure layer for research and development on graph-centric LLM applications.


## 2. ✨ Key Features

### 2.1. Modular Operator Library for KG Workflows
DataFlow-KG provides reusable operators that can be flexibly composed into pipelines for graph construction, graph enrichment, reasoning, retrieval, and task-specific graph processing. Operators are not standalone utilities. They are designed to be assembled into end-to-end workflows, enabling scalable and reproducible graph data engineering.

### 2.2 Unified Support for Multiple KG Paradigms
The library supports a broad range of graph settings in one framework, including general KG, commonsense KG, temporal KG, multimodal KG, hyper-relational KG, Graph RAG, and domain-specific KGs. As an extension of DataFlow, DataFlow-KG follows the same design philosophy of composable operators and pipeline-based processing, making it easy to integrate with broader data preparation workflows.

### 2.3. Research-to-Application Coverage
The framework is designed for both research scenarios and practical vertical applications, supporting graph processing tasks from foundational KG construction to specialized domain deployment.


## 3. 🔍 Installation

### 3.1. Create and activate a Python environment

```bash
conda create -n dfkg python=3.10
conda activate dfkg
````

### 3.2. Install DataFlow-KG

```bash
pip install uv
uv pip install dataflow-kg
```

If you want to enable **local GPU inference**, use:

```bash
conda create -n dfkg python=3.10
conda activate dfkg

pip install uv
uv pip install dataflow-kg[vllm]
```

> DataFlow-KG supports Python >= 3.10.

### 3.3. Verify the installation

You can check whether the installation is successful with:

```bash
dfkg -v
```

If the installation is correct and DataFlow-KG is the latest release, you will see something like:

```log
open-dataflow-kg codebase version: 0.9.0
        Checking for updates...
        Local version:  0.9.0
        PyPI newest version:  0.9.0
        You are using the latest version: 0.9.0.
```

In addition, the `dfkg env` command can be used to inspect the current hardware and software environment, which is useful for bug reporting:

```bash
dfkg env
```


## 4. 🚀 Quickstart

DataFlow-KG follows a **code generation + custom modification + script execution** workflow.  In practice, you initialize a project with the CLI, customize the generated pipeline script if needed, and then run the Python file to execute your workflow.

You can get started in **three steps**.

### 4.1. Initialize a project

Run the following command in an empty directory:

```bash
dfkg init
````

### 4.2. Choose a pipeline type

Pipelines with the same name across different folders are usually incremental variants with different dependency requirements:

| Directory       | Required Resources    |
| --------------- | --------------------- |
| `api_pipelines` | CPU + LLM API         |
| `gpu_pipelines` | CPU + API + local GPU |

> **Tip:** If you are new to DataFlow-KG, start with `api_pipelines`.
> Later, if you have a local GPU, you can replace `LLMServing` with a local model backend.


### 4.3. Run your first pipeline

Go into any pipeline directory, for example:

```bash
cd api_pipelines
```

Open one of the generated Python pipeline files. In most cases, you only need to check two configurations:

#### 4.3.1 Input data path

```python
self.storage = FileStorage(
    first_entry_file_name="<path_to_dataset>"
)
```

By default, this points to the provided example dataset, so you can run it directly.
You can also replace it with your own dataset path.

#### 4.3.2 LLM serving configuration

If you are using an API-based serving backend, set the API key first.

**Linux / macOS**

```bash
export DF_API_KEY=sk-xxxxx
```

**Windows CMD**

```bat
set DF_API_KEY=sk-xxxxx
```

**PowerShell**

```powershell
$env:DF_API_KEY="sk-xxxxx"
```

Then run the pipeline script:

```bash
python xxx_pipeline.py
```

---



## 5. 📚 Licence

DataFlow-KG is released under the **Apache License 2.0**.



## 6. 🎓 Citation
If you use DataFlow-KG in your research, please cite:

```bibtex
@misc{dataflowkg2026,
  title={DataFlow-KG: LLM-Driven Knowledge Graph Processing Library},
  author={DataFlow-KG Team},
  year={2026},
  howpublished={\url{https://github.com/OpenDCAI/DataFlow-KG}}
}
```
