Dilemma - Third-Party Notices and Attributions
================================================

Dilemma's own source code is released under the MIT License (see LICENSE).

The data artifacts that Dilemma builds and distributes - the SQLite lookup
database, the trained character-level model, the corpus-frequency tables, and
the lemma/form corpus-attestation databases - are DERIVED from the third-party
sources listed below and remain subject to their respective licenses. They are
not covered by the MIT license that applies to the code.

Public-domain texts, and faithful digitizations of public-domain texts, are
credited here for scholarship and provenance; the digitization itself is not
treated as a separately licensable layer.

Each entry gives, where applicable: title, author/maintainer, source link, and
license. "Derived/extracted" means Dilemma uses data computed from the source
(e.g. form-to-lemma pairs, headword lists, token statistics) rather than
redistributing the source verbatim.


Lexica and dictionaries
-----------------------

- English Wiktionary and Greek Wiktionary (Wikimedia community).
  https://en.wiktionary.org/ , https://el.wiktionary.org/
  Obtained as JSONL via kaikki.org (https://kaikki.org/).
  License: CC BY-SA 3.0  (https://creativecommons.org/licenses/by-sa/3.0/).
  Glosses and inflection data are extracted and transformed.

- A Greek-English Lexicon (LSJ), H.G. Liddell, R. Scott, H.S. Jones,
  9th ed. (Oxford, 1940). Public domain. Headwords, forms, and POS via the
  LSJ9 processed exports (https://github.com/ciscoriordan/lsj9).

- Greek Lexicon of the Roman and Byzantine Periods, E.A. Sophocles.
  Public domain. TEI via Ionian University / Internet Archive
  (https://archive.org/details/pateres).

- Perseus Digital Library lexica - Lewis & Short, Pape, Bailly (headword
  filter only). https://www.perseus.tufts.edu/

- Diccionario Griego-Espanol (DGE), CSIC (headwords only).
  http://dge.cchs.csic.es/

- Lexicon of Greek Personal Names (LGPN), University of Oxford (proper names
  only). https://www.lgpn.ox.ac.uk/


Treebanks and annotated corpora
-------------------------------

- GLAUx, Alek Keersmaekers (2021).
  https://github.com/alekkeersmaekers/glaux
  License: CC BY-SA 4.0 for most texts; some source texts are more restrictive
  (e.g. CC BY-NC), with the per-text license recorded in GLAUx's metadata.
  GLAUx aggregates the dependency treebanks listed below.

- The Ancient Greek Dependency Treebanks (AGDT / Perseus), Perseus Project
  (G. Crane et al.). https://github.com/PerseusDL/treebank_data
  License: CC BY-SA 3.0 US. Dilemma extracts form-lemma pairs from this
  openly licensed original, NOT from the Universal Dependencies release
  (UD_Ancient_Greek-Perseus), which is CC BY-NC-SA 3.0.

- PROIEL Treebank, the PROIEL project.
  https://github.com/UniversalDependencies/UD_Ancient_Greek-PROIEL
  License: CC BY-NC-SA 3.0  (NonCommercial) - NOT used by Dilemma; excluded
  as NonCommercial (GLAUx independently re-annotates some of the same
  texts under its own CC BY-SA 4.0 license, which Dilemma does use).

- Gorman Treebanks, Vanessa Gorman.
  https://github.com/perseids-publications/gorman-trees
  License: CC BY-SA 4.0  (the authoritative TREEBANK_LICENSE).

- Pedalion Trees. https://github.com/perseids-publications/pedalion-trees
  License: CC BY-SA 4.0.

- Harrington Trees. https://github.com/perseids-publications/harrington-trees
  License: CC BY-SA 4.0.

- The Diorisis Ancient Greek Corpus, A. Vatri and B. McGillivray (2018).
  https://figshare.com/articles/dataset/The_Diorisis_Ancient_Greek_Corpus/6187256
  License: CC BY 4.0  (per the figshare record).

- First1KGreek, Open Greek and Latin.
  https://github.com/OpenGreekAndLatin/First1KGreek
  License: CC BY-SA 4.0.

- Patristic Text Archive (PTA).
  https://github.com/PatristicTextArchive/pta_data
  License: per-file Creative Commons licenses, recorded in each file's header.

- Patrologia Graeca (Migne). Public domain. Used via the Open Greek Corpus's
  corrected text (below), which derives from the Calfa OCR
  (https://github.com/calfa-co/Patrologia-Graeca, CC BY 4.0).

- Byzantine vernacular corpus, Francisco Riordan.
  https://github.com/ciscoriordan/byzantine-vernacular-corpus
  Now maintained as the `byzantine` source inside the Open Greek Corpus.

- Open Greek Corpus.
  https://github.com/open-greek/open-greek-corpus
  License: CC BY-SA 4.0, with per-work source and license recorded in its
  corpus_editions.json. Supplies the corrected Patrologia Graeca text and
  OCR'd editions read by the form-attestation pass, and the public_lexicon.tsv
  open-text frequency rollup (First1KGreek, corrected Patrologia Graeca,
  Perseus canonical-greekLit, byzantium.gr, Byzantine vernacular) merged into
  corpus_freq.json.


Evaluation data
---------------

- HNC Golden Corpus, CLARIN:EL. License: openUnderPSI.
  https://inventory.clarin.gr/corpus/870

- DBBE datasets, Swaelens et al. License: CC BY 4.0.
  https://github.com/coswaele/ByzantineGreekDatasets


Frequencies and assets
----------------------

- glossAPI / Wikisource Greek texts.
  https://huggingface.co/datasets/glossAPI/Wikisource_Greek_texts

- Flag icons, svg-flags. https://github.com/ciscoriordan/svg-flags


A note on source licensing
--------------------------

Every source in Dilemma's shipped data and models is openly licensed: no
NonCommercial-licensed source is included. The PROIEL treebank (CC BY-NC-SA,
with no permissive release) is excluded entirely. The Ancient Greek Dependency
Treebank is taken from the Perseus AGDT original (CC BY-SA 3.0 US), not the
NonCommercial Universal Dependencies repackaging (UD_Ancient_Greek-Perseus).
Gorman is CC BY-SA 4.0 and Diorisis is CC BY 4.0. The handful of NonCommercial
source texts within GLAUx, and the per-file NonCommercial texts within the
Patristic Text Archive, are filtered out at build time (see build/nc_filter.py).
What Dilemma extracts from any treebank is factual form-to-lemma annotation and
aggregate token statistics, not the syntactic apparatus.

The shipped Modern Greek POS / dependency tagger is trained on UD_Greek-GUD
(CC BY-SA 4.0) plus the CC BY-SA dialect treebanks, NOT on UD_Greek-GDT, which
is CC BY-NC-SA. The Modern Greek lemmatizer is built from Wiktionary and the
HNC Golden Corpus and is unaffected.
