Metadata-Version: 2.4
Name: cvfile-haystack
Version: 0.1.0
Summary: Haystack integration for the .cv open file format.
Project-URL: Homepage, https://cvfile.org
Project-URL: Repository, https://github.com/cvfile/cv
Project-URL: Issues, https://github.com/cvfile/cv/issues
Author: cvfile.org
License: Apache-2.0
Keywords: ats,converter,cv,haystack,pdf,pdfa,rag,resume
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Requires-Dist: cvfile<1,>=0.1.0
Requires-Dist: haystack-ai<3,>=2.8
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# cvfile-haystack

Haystack 2.x converter component for the [`.cv`](https://cvfile.org) open file format.

A `.cv` file is a PDF/A-3u file carrying a Markdown copy of the same content
(plus optional HTML and JSON Resume) as PDF Associated Files. Instead of OCR
ing the PDF, this component reads the embedded text payloads directly and
emits Haystack `Document` objects ready for indexing.

## Install

```bash
pip install cvfile-haystack
```

## Use

```python
from haystack_integrations.components.converters.cvfile import CVFileToDocument

converter = CVFileToDocument()
result = converter.run(sources=["resume.cv"])
documents = result["documents"]

for doc in documents:
    print(doc.meta["payload"], doc.meta["mime_type"], len(doc.content))
```

You get one `Document` per textual payload found in the file. The Markdown
copy (typically `resume.md`) is the one flagged with `meta["primary"] = True`.

### Primary only

If you only want the canonical Markdown copy and want to skip language
alternates and supplements:

```python
converter = CVFileToDocument(primary_only=True)
```

### Pipeline use

```python
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.cvfile import CVFileToDocument

store = InMemoryDocumentStore()
pipe = Pipeline()
pipe.add_component("read", CVFileToDocument(primary_only=True))
pipe.add_component("embed", SentenceTransformersDocumentEmbedder(model="BAAI/bge-m3"))
pipe.add_component("write", DocumentWriter(document_store=store))
pipe.connect("read.documents", "embed.documents")
pipe.connect("embed.documents", "write.documents")

pipe.run({"read": {"sources": ["resumes/jane.cv", "resumes/john.cv"]}})
```

## Metadata fields

| Key | Description |
|---|---|
| `source` | The file path (or stream name) the document came from |
| `payload` | Name of the embedded file (e.g. `resume.md`) |
| `mime_type` | MIME of the payload (`text/markdown`, `text/html`, `application/json`) |
| `relationship` | PDF Associated Files relationship (`Alternative` for primary alternates) |
| `language` | BCP 47 language tag for this payload |
| `primary` | `True` for the payload declared as primary in the file's XMP metadata |
| `cv_version` | Version of the `.cv` spec the file conforms to |
| `cv_generator` | Tool that produced the file, if recorded |

## License

Apache-2.0.
