Page 1 of 3 · cover
Data4Now
D4N-STATIN-ARCH-002 · v2.1
Technical proposal · STATIN engagement

Statistical Datalake Architecture

A medallion-zone, federation-first design for the Statistical Institute of Jamaica — replacing manual ingestion and ad-hoc spreadsheets with auditable, on-prem data flows.

Prepared for
Prepared for
Statistical Institute of Jamaica
Office of the Director General
Prepared by
Data4Now · Architecture practice
M. Chen · principal architect
Classification
Internal — client & partner distribution
Date issued
14 May 2026 · supersedes v2.0
Review window
14 May — 04 June 2026
Approvers
STATIN CIO · D4N principal · UN DESA observer
Ecosystem partners referenced
United Nations Development Programme
UN DESA Statistics
The World Bank
Global Partnership for Sustainable Development Data
Sustainable Development Solutions Network
Page 2 of 3 · table of contents
Data4Now
D4N-STATIN-ARCH-002 · v2.1
Contents

Document map

Table of contents

Sections are numbered for cross-reference. Indented entries are sub-sections. Page numbers match the printed pagination, not the PDF viewer.

01
Executive summaryThe case for a federated, on-prem datalake — in two pages.
3
02
Current challenges & opportunitiesWhere statistical workflows break today.
5
2.1
Manual ingestion bottlenecks
6
2.2
Spreadsheet sprawl & reconciliation drift
7
2.3
Storage & backup posture
8
03
Proposed architectureNiFi · MinIO · Trino · Airflow · JupyterHub on Kubernetes.
10
3.1
Ingestion — Apache NiFi
11
3.2
Storage — MinIO & medallion zones
13
3.3
Federation — Trino
15
3.4
Orchestration — Apache Airflow
17
3.5
Analysis — JupyterHub
18
04
Integration with existing infrastructureNutanix · Veeam · SQL Server · Active Directory.
19
05
Governance, RBAC & auditWho can read what, and how we prove it.
21
06
Rollout plan & milestones
23
07
Appendix · Glossary & references
24
Page 3 of 3 · body content with table + callout
Data4Now
D4N-STATIN-ARCH-002 · v2.1
§ 03 · Proposed architecture

Section 3.1

Ingestion — Apache NiFi replaces all manual data collection

Every external source — SFTP drops, mail attachments, vendor REST endpoints, JDBC pulls from MariaDB — becomes a single validated, audit-trailed pipeline. The admin's morning routine of downloading, renaming, and folder-moving disappears entirely.

3.1.1 — Sources covered at go-live

The first NiFi release covers five sources that together account for ~83% of the manual hours logged in the 2025 STATIN time-tracking audit. Each becomes a separate ProcessGroup with its own schema validator, retry policy, and provenance hash.

Table 3.1 — Day-one ingestion sources

SourceOwnerMethodRec/dayLatency
CPI submissionsSTATIN — PricesNiFi SFTP listener14,8205 min
Customs declarationsJamaica CustomsNiFi REST poll9,61015 min
Trade statisticsMin. of IndustryJDBC pull · MariaDB22,18010 min
Population estimatesSTATIN — DemographyManual upload24024 hr
Daily total50,250

3.1.2 — What this replaces

Replaces

Manual SFTP polling · CSV rename scripts · Excel VLOOKUP across three workbooks · USB-drive backups · the email thread titled "FINAL_v7_use this one.xlsx".

3.1.3 — Flows are deployed as code

Every ProcessGroup is versioned in the NiFi Registry, peer-reviewed in a pull request, and rolled out by Airflow. The full flow descriptor below registers the CPI listener.

Listing 3.1 — cpi-sftp-listener.flow.json

JSON · NiFi flow descriptorcpi-sftp-listener.flow.json
{
  "name": "cpi-sftp-listener",
  "zone": "raw",
  "schedule": "*/5 * * * *",
  "source": { "type": "sftp", "path": "/in/cpi/*.csv" },
  "validate": { "schema": "cpi.v3", "on_fail": "quarantine" },
  "sink": "s3://lake/raw/cpi/{yyyy}/{mm}/"
}

Listing 3.2 — registering & deploying the flow

d4n@statin-jump — zsh
$ d4n flow register --file cpi-sftp-listener.flow.json
  ✔ schema cpi.v3 resolved
  ✔ published to NiFi Registry @ rev f3a91c
$ d4n flow deploy --env prod --rev f3a91c
  ✔ deployed · next run 06:00 UTC