πŸ“„ Published Β· UYES Journal Β· 2026

Uncertainty-Gated Temporal Credit

A plug-in advantage estimator for actor-critic reinforcement learning

UGTC dynamically blends short-horizon (low-variance) and long-horizon (low-bias) advantage estimates using a sigmoid gate driven by critic ensemble disagreement β€” resolving the bias–variance trade-off in temporal credit assignment.

Paper UYES License Python PyTorch
⭐ View on GitHub πŸ“„ Read Paper πŸ€— Live Demo

Key Features

πŸ”Œ

Backbone-Agnostic

Drop UGTC into any actor-critic algorithm by replacing the advantage computation. Tested with PPO, TD3, SAC.

🎯

Adaptive Credit Assignment

Automatically selects between short-horizon and long-horizon GAE estimates based on per-state uncertainty.

πŸ“

Fixed Hyperparameters

Ξ»_fast=0.80, Ξ»_slow=0.99, M=3, Ξ²=5.0. Same across all benchmarks β€” no per-task tuning required.

πŸ”¬

Ensemble Uncertainty

Slow critic ensemble disagreement provides calibrated uncertainty estimates without Bayesian inference.

⚑

Lightweight Overhead

Three small MLP value heads. Minimal parameter and compute overhead relative to actor network.

🌐

Multi-Language

Reference implementations in Python, C++ (header-only), and Java for portability.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              UGTC MODULE                                    β”‚
β”‚                                                                             β”‚
β”‚   Input: s (observation)                                                    β”‚
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚   Fast Critic    β”‚      β”‚          Slow Ensemble (M=3)               β”‚  β”‚
β”‚   β”‚   V_fast(s)      β”‚      β”‚   VΒΉ(s)    VΒ²(s)    VΒ³(s)                 β”‚  β”‚
β”‚   β”‚   Ξ»_fast = 0.80  β”‚      β”‚   (independent parameters, Ξ» = 0.99)      β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      └──────────────────┬──────────────────────── β”˜  β”‚
β”‚            β”‚                                   β”‚                            β”‚
β”‚            β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚            β”‚                     β”‚  Οƒ(s) = std(VΒΉ,VΒ²,VΒ³)(s)   β”‚            β”‚
β”‚            β”‚                     β”‚  Ensemble Disagreement       β”‚            β”‚
β”‚            β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚            β”‚                                   β”‚                            β”‚
β”‚            β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚            β”‚                     β”‚  EMA Normalization           β”‚            β”‚
β”‚            β”‚                     β”‚  Οƒ_EMA ← Ξ±Β·Οƒ_EMA + (1-Ξ±)Β·Οƒ  β”‚            β”‚
β”‚            β”‚                     β”‚  ΟƒΜ‚(s) = Οƒ(s) / (Οƒ_EMA + Ξ΅)  β”‚            β”‚
β”‚            β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚            β”‚                                   β”‚                            β”‚
β”‚            β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚            β”‚                     β”‚   Sigmoid Gate               β”‚            β”‚
β”‚            β”‚                     β”‚   u(s) = Οƒ(-Ξ²Β·(ΟƒΜ‚(s) - 1))   β”‚            β”‚
β”‚            β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚            β”‚                                   β”‚                            β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚   A^UGTC = u(s) Β· A^slow  +  (1 - u(s)) Β· A^fast                    β”‚  β”‚
β”‚   β”‚   Blended Advantage Estimate                                          β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      

Gate Behavior

Low uncertainty
u β†’ 1 β†’ use A^slow (accurate)
Medium uncertainty
u = 0.5 β†’ equal blend
High uncertainty
u β†’ 0 β†’ use A^fast (stable)

Mathematical Foundation

Generalized Advantage Estimation

\[ \delta_t = r_t + \gamma V(s_{t+1})(1 - d_t) - V(s_t) \] \[ A_t^{\text{GAE}} = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k} \]

UGTC Dual-Stream Computation

\[ A_t^{\text{fast}} = \text{GAE}\!\left(\tau,\, V_{\text{fast}},\, \lambda_{\text{fast}} = 0.80\right) \] \[ A_t^{\text{slow}} = \text{GAE}\!\left(\tau,\, \bar{V}_{\text{slow}},\, \lambda_{\text{slow}} = 0.99\right) \]

where \(\bar{V}_{\text{slow}} = \frac{1}{M}\sum_{m=1}^{M} V^m_{\text{slow}}\) (ensemble mean, M = 3)

Uncertainty Gate

\[ \sigma(s) = \text{std}\!\left(V^1_{\text{slow}}(s),\, \ldots,\, V^M_{\text{slow}}(s)\right) \] \[ \hat{\sigma}(s) = \frac{\sigma(s)}{\sigma_{\text{EMA}} + \varepsilon}, \qquad \sigma_{\text{EMA}} \leftarrow \alpha \cdot \sigma_{\text{EMA}} + (1-\alpha)\cdot\mathbb{E}[\sigma(s)] \] \[ u(s) = \sigma\!\left(-\beta \cdot (\hat{\sigma}(s) - 1)\right) \]

Blended Advantage

\[ \boxed{A_t^{\text{UGTC}} = u(s_t) \cdot A_t^{\text{slow}} + (1 - u(s_t)) \cdot A_t^{\text{fast}}} \]

Fixed Hyperparameters

ParameterSymbolValueDescription
Fast Ξ»\(\lambda_{\text{fast}}\)0.80GAE lambda for fast critic (low variance)
Slow Ξ»\(\lambda_{\text{slow}}\)0.99GAE lambda for slow ensemble (low bias)
Ensemble sizeM3Number of slow critic heads
Gate temperatureΞ²5.0Sigmoid sharpness
EMA momentumΞ±0.99Running uncertainty normalization

RL Algorithm Integrations

UGTC-PPO

On-policy

A^UGTC replaces standard GAE in the clipped surrogate objective. All UGTC critics trained via same regression pipeline.

UGTC-TD3

Off-policy

UGTC provides baseline correction for the actor: L = -(Q_min + Ξ·Β·A^UGTC). Twin-Q and delayed update preserved.

UGTC-SAC

Off-policy

V^UGTC replaces implicit value baseline in the entropy-regularized actor loss. Auto-Ξ± entropy tuning unchanged.

UGTC-DDPG

Extension

Proposed extension following TD3 integration logic. Not benchmarked in the paper β€” labeled as implementation assumption.

Quick Start

Installation

bash
git clone https://github.com/ethosoftai/ugtc.git
cd ugtc
pip install -e .

Minimal Usage

python
from ugtc import UGTCModule

# Create UGTC module (obs_dim=17 for Hopper-v4)
ugtc = UGTCModule(obs_dim=17)

# Replace standard GAE in your PPO update:
advantages = ugtc.compute_advantages(
    obs=obs,            # (T, obs_dim)
    next_obs=next_obs,  # (T, obs_dim)
    rewards=rewards,    # (T,)
    dones=dones,        # (T,)
    gamma=0.99,
)

# Same as before: normalize and use in clipped surrogate
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

Run an Example

bash
# UGTC-PPO on CartPole-v1 (no MuJoCo needed)
python examples/ugtc_ppo_cartpole.py

# UGTC-PPO on Hopper-v4 (requires MuJoCo)
python examples/ugtc_ppo_mujoco.py --env Hopper-v4

# UGTC-TD3 on Pendulum-v1
python examples/ugtc_td3_pendulum.py

Citation

@misc{dalar2026ugtc,
  author    = {Dalar, Yağız Ekrem},
  title     = {{UGTC}: Uncertainty-Gated Temporal Credit},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19715116},
  url       = {https://doi.org/10.5281/zenodo.19715116},
  note      = {Accepted β€” Ulysseus Young Explorers in Science (UYES) Journal.
               Journal DOI forthcoming.}
}