pyaegean
Copyright 2026 Ryan Pavlicek

This product includes software developed by Ryan Pavlicek, licensed under the
Apache License, Version 2.0.

================================================================================
Third-party data and attributions
================================================================================

Linear A corpus (bundled text JSON: src/aegean/data/bundled/lineara/)
  Derived from the GORILA edition — Godart, L. & Olivier, J.-P. (1976–1985),
  "Recueil des inscriptions en linéaire A" (Études Crétoises) — via the
  open dataset at https://github.com/mwenge/lineara.xyz.
  Used as a scholarly reference corpus. Transliterations and structured
  metadata only; no facsimile imagery is bundled or redistributed.

Linear A facsimile / photographic imagery (fetched on demand, NOT redistributed)
  © École Française d'Athènes and respective rights holders. Downloaded by the
  user on demand from the linearaworkbench repository for academic reference;
  never re-hosted by this package.

Ancient Greek Dependency Treebank — AGDT v2.1 (fetched on demand, NOT redistributed)
  Perseus Ancient Greek Dependency Treebank, v2.1
  (https://github.com/PerseusDL/treebank_data), licensed CC BY-SA 3.0. Downloaded
  by the user on demand via aegean.greek.use_treebank() / use_parser() / use_tagger() /
  use_lemmatizer(); pyaegean builds derived artifacts (a lemma/morphology lexicon, a
  dependency-parser model, a POS-tagger model, and a lemmatizer model) in the local cache
  and neither bundles nor re-hosts the treebank itself. Attribute the AGDT in academic
  work that relies on it.

Liddell-Scott-Jones lexicon — Perseus LSJ (fetched on demand, NOT redistributed)
  A Greek-English Lexicon (Liddell, Scott, Jones), digitized by the Perseus Digital
  Library (https://github.com/PerseusDL/lexica), licensed CC BY-SA 4.0. Text provided
  under a CC BY-SA license by Perseus Digital Library, http://www.perseus.tufts.edu,
  with funding from The National Endowment for the Humanities. Data accessed from
  https://github.com/PerseusDL/lexica/. Downloaded by the user on demand via
  aegean.greek.use_lsj(); pyaegean builds a derived index in the local cache and
  neither bundles nor re-hosts the lexicon.

Neural Greek lemmatizer model (fetched on demand, NOT bundled)
  An ONNX seq2seq lemmatizer fine-tuned from bowphs/GreTa (Riemenschneider & Frank,
  "Exploring Large Language Models for Ancient Greek and Latin", Apache-2.0;
  arXiv:2410.12055) on the AGDT (CC BY-SA 3.0), the Pedalion treebanks (CC BY-SA 4.0),
  and the Gorman Ancient Greek treebanks (CC0). The resulting model — ONNX weights plus a
  derived gold lemma lookup — is licensed CC BY-SA 4.0. Downloaded by the user on demand
  via aegean.greek.use_neural_lemmatizer() (the [neural] extra); fetched to the local cache,
  never bundled in the Apache-2.0 wheel. Attribute GreTa and the treebanks in academic work.

PROIEL treebank — Ancient Greek (fetched on demand, NOT redistributed)
  The PROIEL treebank (https://github.com/proiel/proiel-treebank) — the Greek New Testament
  and Herodotus — licensed CC BY-NC-SA 3.0. Downloaded by the user on demand via
  aegean.greek.evaluate_on_proiel() as a neutral, out-of-AGDT evaluation set; pyaegean reads it
  locally in the cache and neither bundles nor re-hosts it (NonCommercial + ShareAlike). Cite:
  Dag T. T. Haug and Marius L. Jøhndal (2008), "Creating a Parallel Treebank of the Old
  Indo-European Bible Translations", Proc. LaTeCH 2008, pp. 27–34.

Aegean syllabic sign data — Unicode Character Database (bundled)
  The Linear B (src/aegean/data/bundled/linearb/), Cypriot (.../cypriot/), and Cypro-Minoan
  (.../cyprominoan/) sign inventories — and the phonetic maps for the two deciphered scripts —
  are derived from the Unicode Character Database (UnicodeData.txt — the "Linear B Syllabary",
  "Linear B Ideograms", "Cypriot Syllabary", and "Cypro-Minoan" blocks), Copyright
  © 1991–present Unicode, Inc., distributed under the Unicode License v3
  (https://www.unicode.org/license.txt); the only obligation is to retain this notice. Cypro-Minoan
  is undeciphered, so its signs carry conventional numbers (CM001 …) and no phonetic values. The
  bundled illustrative samples of transliterations and sign sequences are scholarly fact (after
  Ventris & Chadwick, Documents in Mycenaean Greek; Masson, Les inscriptions chypriotes syllabiques;
  and Ferrara, Cypro-Minoan Inscriptions), included as excerpts, not corpora.

Linear B corpora — DAMOS, LiBER (NOT bundled, NOT fetched by default)
  No openly-licensed Linear B tablet corpus exists. DAMOS (Database of Mycenaean at Oslo) is
  CC BY-NC-SA 4.0; LiBER (CNR) is all-rights-reserved. pyaegean bundles neither and fetches
  neither by default; set PYAEGEAN_LINEARB_CORPUS_URL to load your own licensed copy, which
  pyaegean parses locally and never re-hosts.

Planned (future versions; included here for forward attribution):
  - First1KGreek / Open Greek and Latin — CC BY.
  - Morpheus morphological data; LSJ lexicon (per applicable license).
  Each will be attributed and license-gated as it is integrated.

The structured-data layer of this package is licensed Apache-2.0; underlying
scholarly editions and imagery remain under their respective rights. Cite the
original editions in academic work (see Corpus.provenance.cite()).
