Reinforcement Learning for Portfolio Optimization

The Promise of RL in Finance

Traditional portfolio optimization — Markowitz mean-variance, Black-Litterman, risk parity — relies on assumptions that rarely hold in practice: normally distributed returns, stationary covariance matrices, and a known return distribution. Reinforcement learning offers a fundamentally different approach: let an agent interact with a market environment and learn an allocation policy directly from experience, without explicitly modeling the return distribution.

After a year of experimentation with Deep RL for portfolio management, I’m cautiously optimistic about the approach — but with significant caveats. The sim-to-real gap in financial RL is enormous, and most academic results don’t survive contact with production constraints.

Formulating the MDP

The first step is framing portfolio optimization as a Markov Decision Process:

State Space: At each timestep t, the agent observes a feature vector containing:

Current portfolio weights (asset allocations)
Rolling window of log-returns for each asset (e.g., last 60 bars)
Technical indicators: RSI, volatility, momentum scores
Cross-correlation matrix of recent asset returns
Available cash and current portfolio value

Action Space: The agent outputs a vector of target portfolio weights w = [w₁, w₂, ..., wₙ] where ∑wᵢ = 1 and wᵢ ≥ 0 (long-only) or wᵢ ∈ [-1, 1] (long-short). I use continuous action spaces with a softmax output layer for long-only and a tanh + normalization scheme for long-short.

Reward Function: This is where most RL portfolio projects succeed or fail. The naive approach — using raw portfolio return as the reward — leads to agents that maximize return with no regard for risk. They load up on the most volatile asset and ride momentum until they inevitably blow up.

Better reward formulations include:

Differential Sharpe Ratio — An online approximation of the Sharpe ratio that can be computed incrementally at each timestep. This naturally balances return and risk.
Risk-adjusted return — r_t - λ * σ²_t where λ controls risk aversion and σ²_t is the rolling portfolio variance.
Sortino-inspired — Penalize only downside deviation, which aligns better with how most investors actually think about risk.

I’ve found the differential Sharpe ratio to be the most stable training signal, though it requires careful hyperparameter tuning of the lookback window.

PPO vs SAC: Practical Comparison

I’ve extensively tested both Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) for portfolio allocation:

PPO is more stable during training and less sensitive to hyperparameters. It converges to reasonable policies quickly but tends to find conservative solutions — the agent learns to allocate heavily to low-volatility assets and holds relatively static portfolios. For risk-averse mandates, this is actually desirable.

SAC produces more dynamic allocation policies thanks to its entropy regularization, which encourages exploration. The resulting portfolios rebalance more frequently and capture short-term opportunities. However, SAC is significantly harder to tune — the temperature parameter α has an outsized impact on behavior, and the wrong value leads to either random noise or degenerate static policies.

In my experiments on a 10-asset crypto portfolio (BTC, ETH, SOL, and 7 mid-cap alts), SAC achieved a higher out-of-sample Sharpe ratio (1.4 vs 1.1 for PPO) but with substantially higher turnover and thus higher transaction costs. After accounting for realistic 15bps per-rebalance costs, the edge difference narrowed to near-zero.

The Sim-to-Real Gap

This is the elephant in the room for financial RL. The environment the agent trains in is a simulation — historical price data replayed as if the agent’s actions had no market impact. In reality:

Market impact — A large rebalance moves prices against you. The agent’s learned policy assumes infinite liquidity, which doesn’t exist.
Regime shifts — The training environment is stationary by construction (it’s replaying history), but real markets undergo fundamental regime changes. An agent trained on 2020-2022 data has never experienced a low-vol grinding market.
Execution latency — The agent assumes instant execution at observed prices. In practice, there’s latency between observation, decision, and execution that can span seconds to minutes.
Non-stationarity — The reward distribution shifts over time, violating the Markov assumption. Cross-asset correlations spike during crises, volatility clusters, and momentum factors cycle between profitable and unprofitable regimes.

Practical Mitigations

To narrow the sim-to-real gap, I use several techniques:

Domain randomization — Add noise to historical prices, randomly shift timestamps, and inject synthetic volatility spikes during training. This produces more robust policies.
Transaction cost annealing — Start training with zero costs and gradually increase to realistic levels. This helps the agent first learn what to trade, then learn to trade efficiently.
Ensemble policies — Train multiple agents on different time periods and use a meta-policy to blend their outputs.
Conservative deployment — In live trading, I cap maximum position sizes at 50% of what the agent recommends and implement hard drawdown stops independent of the agent’s policy.

Financial RL is a genuinely exciting research direction, but the gap between academic benchmarks and production profitability remains wide. The agents that work best in practice are modest in their ambitions — targeting Sharpe ratios of 1.0-1.5 with strict risk controls, rather than trying to be the next Medallion Fund.