| Author | Devanshu |
| Project | saar — Codebase DNA extractor |
| GitHub | github.com/OpenCodeIntel/saar |
| Course | Reinforcement Learning for Agentic AI Systems |
| Date | April 2026 |
We present an end-to-end reinforcement learning system integrated into saar, a production CLI tool that extracts architectural patterns from codebases and generates AI context files. The RL layer learns which of eight hand-designed extraction profiles (action space) best fits each codebase type (state space) to maximise a composite quality reward. We implement three RL algorithms — UCB1 Contextual Bandit, REINFORCE with Baseline, and a Thompson Sampling Ensemble meta-agent — trained offline on synthetic episodes and updated online with each real extraction. Both trained agents significantly outperform a random baseline (UCB: 55% oracle-optimal, REINFORCE: 47% oracle-optimal, random: 10%; all p < 0.001 by Welch t-test). The system is self-contained, requires no external infrastructure, and persists learned policies to disk for continuous improvement.
| Component | File | Role |
|---|---|---|
StateEncoder | saar/rl/state_encoder.py | Maps CodebaseDNA → 20-D float32 ∈ [0,1] |
action_space | saar/rl/action_space.py | Defines K=8 profiles with depth multipliers |
RewardEngine | saar/rl/reward.py | Composite reward weighted by active profile |
SaarEnvironment | saar/rl/environment.py | Gym-style single-step loop |
UCBContextualBandit | saar/rl/agents/ucb_bandit.py | UCB1 with online k-means context |
REINFORCEAgent | saar/rl/agents/reinforce.py | Policy gradient, pure NumPy |
EnsembleAgent | saar/rl/agents/ensemble.py | Thompson Sampling meta-agent |
SaarSimulator | saar/rl/simulator.py | Synthetic episode generator |
PolicyStore | saar/rl/policy_store.py | Atomic JSON persistence |
The state encoder produces a 20-dimensional feature vector $s \in [0,1]^{20}$:
$$s = \begin{bmatrix} \underbrace{f_\text{py},\, f_\text{ts},\, f_\text{js},\, f_\text{other}}_{\text{language mix}} \;\Big|\; \underbrace{\mathbf{1}_\text{fastapi},\, \mathbf{1}_\text{django},\, \ldots}_{\text{framework flags (6)}} \;\Big|\; \underbrace{\log_{10}(N_\text{files}),\, \log_{10}(N_\text{fn}),\, \hat{h}}_{\text{scale (3)}} \;\Big|\; \underbrace{\mathbf{1}_\text{tests},\, \mathbf{1}_\text{auth},\, \mathbf{1}_\text{orm},\, \mathbf{1}_\text{docker}}_{\text{structural (4)}} \;\Big|\; \underbrace{r_\text{tribal},\, r_\text{offlimits},\, p_\text{async}}_{\text{tribal (3)}} \end{bmatrix}$$
where all scale features are log-normalised to $[0,1]$ with $\log_{10}(10{,}000)$ as ceiling.
$K = 8$ discrete extraction profiles. Each profile $a \in \{0,\ldots,7\}$ defines a depth multiplier vector $\mathbf{m}_a \in \mathbb{R}^{12}_{>0}$ over the twelve extractor modules:
$$\mathbf{m}_a = \{m^\text{auth}_a,\, m^\text{database}_a,\, m^\text{errors}_a,\, m^\text{logging}_a,\, m^\text{services}_a,\, m^\text{naming}_a,\, m^\text{imports}_a,\, m^\text{api}_a,\, m^\text{tests}_a,\, m^\text{frontend}_a,\, m^\text{config}_a,\, m^\text{middleware}_a\}$$
Multipliers in $\{0.5, 1.0, 1.5, 2.0\}$, where $2.0$ = high priority, $0.5$ = reduced priority.
The composite reward $r \in [-1, 1]$ is:
$$r = \text{clip}\!\left(2 \cdot \left( 0.4\,C(s,\mathbf{m}_a) + 0.3\,L(o, B) + 0.2\,D(s,\mathbf{m}_a) + 0.1\,e \right) - 1,\; -1,\; 1\right)$$
Profile-weighted section coverage $C(s, \mathbf{m}_a)$: fraction of detected DNA sections, weighted by the profile's multipliers for those sections:
$$C(s,\mathbf{m}_a) = \frac{\sum_{i} w_i(\mathbf{m}_a) \cdot \mathbf{1}[\text{section}_i \text{ present}]}{\sum_i w_i(\mathbf{m}_a)}, \quad w_i(\mathbf{m}_a) = \frac{1}{|E_i|}\sum_{k \in E_i} m_a^k$$
where $E_i$ is the set of extractor keys for section $i$. This makes $C$ depend on $a$, closing the RL loop.
Line efficiency $L(o, B) = \max(0, 1 - |o - B|/B)$ where $o$ = output lines, $B = 100$ = budget.
Profile-weighted diversity $D(s, \mathbf{m}_a) = \min\!\left(\frac{\sum_j m_a^{e_j} \cdot |\text{list}_j|}{20}, 1\right)$ over detected pattern lists.
Explicit feedback $e \in \{-1, 0, +1\}$ from saar rate good/bad.
Context assignment via cosine similarity to $C=6$ learned centroids $\{\mu_c\}_{c=1}^6$:
$$c^* = \arg\max_c \frac{\mu_c^\top s}{\|\mu_c\|\|s\|}$$
Online centroid update: $\mu_{c^} \leftarrow \mu_{c^} + \eta(s - \mu_{c^*})$, with $\eta = 0.01$.
UCB1 arm selection within context $c^*$:
$$a^ = \arg\max_{k} \left[ \hat{q}_{c^,k} + \sqrt{\frac{2 \ln N_{c^}}{n_{c^,k}}} \right]$$
where $\hat{q}_{c^,k}$ is the incremental mean reward for arm $k$ in context $c^$, $n_{c^,k}$ its pull count, $N_{c^} = \sum_k n_{c^,k}$. Optimistic initialisation: $\hat{q}_{c^,k} = 0.5$ on first pull. Cold-start: uniform random for first 48 pulls.
Incremental mean update: $\hat{q}_{c,k} \leftarrow \hat{q}_{c,k} + \frac{1}{n_{c,k}}(r - \hat{q}_{c,k})$
Policy network: $\pi_\theta(a|s) = \text{softmax}(W_2 \cdot \text{ReLU}(W_1 s + b_1) + b_2)$
Architecture: $20 \rightarrow 32 \rightarrow 8$, Xavier-uniform initialisation.
Baseline: EMA of rewards $b \leftarrow \alpha_b r + (1-\alpha_b)b$, with $\alpha_b = 0.1$.
Policy gradient ascent (single-step, $G = r$):
$$\delta = G - b, \quad \theta \leftarrow \theta + \alpha \cdot \text{clip}(\delta \cdot \nabla_\theta \log \pi_\theta(a|s),\, -1, 1)$$
Manual backpropagation through the two-layer MLP:
$$\nabla_{W_2} \log \pi = (e_a - \pi) \otimes h_1, \quad \nabla_{W_1} \log \pi = \big(W_2^\top(e_a - \pi) \odot \mathbf{1}[h_1^\text{pre} > 0]\big) \otimes s$$
Learning rate $\alpha = 0.01$, gradient clip $[-1, 1]$.
Each sub-agent $i \in \{\text{UCB}, \text{REINFORCE}\}$ has a Beta belief $\text{Beta}(\alpha_i, \beta_i)$ over its competence, initialised at $\text{Beta}(1,1)$ (uniform).
Selection: Sample $\theta_i \sim \text{Beta}(\alpha_i, \beta_i)$, select $i^ = \arg\max_i \theta_i$. Sub-agent $i^$ proposes action $a$.
Meta-update (Bernoulli with threshold $\tau = 0.5$):
$$\alpha_{i^} \leftarrow \alpha_{i^} + \mathbf{1}[r \geq \tau], \quad \beta_{i^} \leftarrow \beta_{i^} + \mathbf{1}[r < \tau]$$
Expected trust weight: $\mathbb{E}[\theta_i] = \frac{\alpha_i}{\alpha_i + \beta_i}$.
The ensemble also propagates the reward to the selected sub-agent for its own update, creating a two-level learning hierarchy.
Training is performed offline on synthetic episodes generated by SaarSimulator. Each episode:
This design ensures agents can learn from signal without requiring real codebase extractions at training time.
| Parameter | UCB | REINFORCE | Ensemble |
|---|---|---|---|
| Episodes | 500 | 500 | 500 (warm-start) |
| Seed | 42 | 42 | 42 |
| Learning rate | — | 0.01 | — |
| Baseline α | — | 0.1 | — |
| Contexts | 6 | — | — |
| UCB constant | 2.0 | — | — |
| Beta threshold τ | — | — | 0.5 |
SaarSimulator(seed=42).best_action, argmax probs).| Agent | Mean Reward | 95% CI | % Oracle-Optimal | t vs Random | p-value |
|---|---|---|---|---|---|
| Ensemble | 0.537 | [0.513, 0.561] | 58% | +16.2 | <0.001 |
| UCB Bandit | 0.525 | [0.501, 0.549] | 55% | +14.8 | <0.001 |
| REINFORCE | 0.493 | [0.469, 0.517] | 47% | +11.4 | <0.001 |
| Random baseline | 0.345 | [0.327, 0.363] | 10% | — | — |
_* indicates p < 0.05 vs random_
All three trained agents significantly outperform random. The Ensemble reaches the highest mean reward by dynamically routing between sub-agents, demonstrating the value of the Thompson Sampling hierarchy.
UCB convergence: After the 48-pull cold-start, UCB rapidly identifies high-reward arms within each context. The rolling-25 reward curve rises from ~0.50 to ~0.65 within the first 200 episodes, stabilising near 0.60.
REINFORCE convergence: The EMA baseline converges to ~0.50 within 150 episodes. The policy gradient updates progressively concentrate probability mass on oracle profiles, reaching ~0.55 rolling reward by episode 300.
Ensemble routing: After ~100 episodes of warm-start, the Ensemble assigns higher expected Beta weight to UCB (E[θ_UCB] ≈ 0.60 vs E[θ_RF] ≈ 0.55), consistent with UCB's better oracle-optimal rate.
Each saar extract . --rl invocation performs one online update using the real codebase's DNA as state and the profile-weighted reward as signal. For the saar repo itself (Data/ML codebase), the RL system consistently selects Profile 6 ("Data / ML") with reward ≈ +0.48, which improves with each run as the policy updates.
Codebase extraction is a one-shot query: you run it, get a result, and (optionally) give feedback. There is no sequential action within a single extraction. Single-step episodes are the natural fit, and they simplify the RL formulation to contextual bandits / single-step policy gradient without loss of generality.
Offline pre-training (SaarSimulator) avoids the cold-start problem: running 500 real extractions to train from scratch would take hours. The synthetic simulator provides a statistically faithful approximation (oracle heuristics are grounded in real codebase patterns).
Online fine-tuning (saar extract . --rl) allows the policy to adapt to the specific distribution of codebases a user actually works with. A developer who primarily uses React codebases will see their policy shift toward Profile 1 over time.
With K=8 discrete actions and a 20-D state space, a full DQN would be overkill and would require a replay buffer, target network, and Torch/TF dependency. UCB1 is theoretically optimal for this bandit setting (regret $O(\sqrt{KT \ln T})$), requires zero hyperparameter tuning beyond the exploration constant, and trains in under 1 second.
saar has no external dependencies in its core path. A PyTorch-based policy gradient would require 500MB of dependencies for a 20×32×8 MLP. Manual backpropagation through this tiny network takes 3 lines and is fully testable without a framework.
Thompson Sampling is asymptotically optimal for Bernoulli bandits and provides natural uncertainty quantification. Unlike ε-greedy ensemble routing, Thompson Sampling automatically balances exploration of the weaker agent with exploitation of the stronger one, without tuning ε.
| Challenge | Solution |
|---|---|
| RL loop closure without modifying DNAExtractor | Profile-weighted reward: each profile's multipliers change how section coverage is scored, making reward vary with action even for identical DNA |
| Cold-start with no real extraction data | SaarSimulator generates statistically grounded synthetic episodes; oracle heuristic mirrors real codebase archetypes |
| NumPy REINFORCE stability | Xavier initialisation + EMA baseline + gradient clipping to [-1,1] prevents divergence |
| UCB exploration in high-dimensional context | Online k-means with 6 centroids reduces the context space; cosine similarity handles normalised feature vectors |
| Policy persistence across sessions | Atomic JSON writes (write to .tmp, then os.replace) prevent corruption from interrupted runs |
| Online update in extract.py must never break extraction | Entire RL path wrapped in try/except; failures log a warning and fall through to default extraction |
The simulator's oracle (e.g., "python\_frac > 0.70 → backend profile") encodes assumptions about what constitutes a "good" profile for each codebase type. If these assumptions are wrong or culturally biased (e.g., treating Python-heavy ML codebases the same as Python-heavy web backends), the trained policy may systematically underserve certain user populations.
Mitigation: The oracle is transparent and editable in simulator.py. Users can retrain with modified heuristics. Online learning from real extractions corrects simulator bias over time.
saar rate good/bad feeds back into the reward function. If a subset of users systematically marks outputs "good" that are biased toward certain frameworks, the policy drifts.
Mitigation: Explicit feedback has the lowest weight (0.1 out of 1.0). The policy update per extraction is bounded by the UCB incremental mean / REINFORCE gradient clip.
Eight profiles is a coarse discretisation. A "Legacy / mixed" profile might be assigned to diverse codebases and generate suboptimal outputs for non-legacy mixed stacks.
Mitigation: The balanced Profile 2 ("Full-stack balanced") serves as a safe fallback. The reward function penalises profiles that don't fit (section coverage drops when high-weight sections are absent from the DNA).
State vectors are derived from local codebase analysis and never leave the machine. Policy files in ~/.saar/rl/ contain only learned numerical parameters, not code content.
SaarEnvironment.
# Clone and install
git clone https://github.com/OpenCodeIntel/saar
cd saar
python -m venv venv && source venv/bin/activate
pip install -e ".[rl]"
# Run full test suite (should pass 600+ tests)
pytest tests/ -q
# Train agents
python experiments/train_ucb.py
python experiments/train_reinforce.py
# Evaluate with statistical validation
python experiments/eval_comparison.py
# Run end-to-end
saar rl train --agent both
saar extract . --rl
saar rl status
All random seeds are fixed (seed=42 for training, seed=42 for test episodes). Results in experiments/results/ are deterministically reproducible.