AWS Cloud Operations · RL Environment & Training Pipeline

Cloud agents fail in production not because they don’t know the commands — but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agent’s own weak spots. After SFT → GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% → 89%, and intermediate-tier success climbed 81% → 87%.

About

Learn AWS by doing.

An OpenEnv-compatible RL environment where agents execute real AWS CLI commands against a vendored MiniStack simulator that responds with production-equivalent JSON. 120+ tasks across 5 tiers (warmup → expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection and drift-detection scenarios — every feature designed to keep the reward signal honest and prevent the agent from gaming it. Trained end-to-end with a 1,500-row synthetic SFT dataset and TRL GRPO with 8-way parallel rollouts on a single GPU.

120+Tasks
5Difficulty Tiers
34AWS Services
8Parallel Rollouts
+50ppExact-match Δ (SFT)
100%Format Compliance (post-SFT)
S3 EC2 DynamoDB Lambda SQS SNS IAM RDS API Gateway CloudFormation CloudWatch Kinesis SES Step Functions Secrets Manager ELBv2 Route53 Glue Athena EFS + 14 more
Tasks
Warmup
25 tasks

List resources — single read-only commands

  • Run one AWS CLI command to list or describe a resource type
  • S3 buckets, EC2 instances, DynamoDB tables, Lambda functions, RDS, EBS volumes
  • Graded by command_match — checks operation + service pair
  • No setup required, no state mutations
Beginner
25 tasks

Create single resources with verification

  • Create an S3 bucket, DynamoDB table, SQS queue, or Lambda function
  • Graded by resource_creation — verifies the exact resource exists in AWS Infrastructure Simulator
  • Introduces resource name validation — "my-bucket-2" won't satisfy a check for "my-bucket"
  • First tier where idempotency bonus (+0.02) can be earned
Intermediate
25 tasks

Multi-step workflows — create, configure, connect

  • Ordered sequences: create a bucket then enable versioning, create a table then add an item
  • Graded by multi_step — validates each step was completed in order
  • Chaos injection begins at 10% probability — resources may be silently mutated mid-episode
  • Rollback penalty (-0.1) starts to matter with multi-step create/delete patterns
Advanced
25 tasks

Cross-service architectures spanning multiple AWS services

  • Wire Lambda to SQS, configure API Gateway with integrations, build event-driven pipelines
  • Graded by multi_step + services — all required services must be configured
  • Chaos injection escalates to 20% probability — DynamoDB throughput, Lambda configs may change
  • Hints cost more: 3 hints = only 61% of max reward (0.85³ decay)
Expert
24 tasks + 9 drift

SRE incidents & drift detection — diagnose and fix

  • Fix overly permissive S3 policies, replace broad IAM inline policies, repair broken infra
  • Graded by state_checks — actual CLI commands run against MiniStack at grading time
  • Chaos injection at 30% probability — maximum perturbation frequency
  • 9 drift detection tasks — correct infra is provisioned, then 2–3 random mutations applied from a pool
  • Agent must audit environment, discover which resources drifted, and fix only those
  • Drift is randomized per episode — prevents memorization of fix sequences
Features

Curriculum & Training

Adaptive learning system that tracks mastery and selects optimal tasks.

Progressive Difficulty 5 tiers from warmup to expert SRE
Mastery Tracking Per-task graduation with sustained performance
Spaced Repetition Graduated tasks resurface at increasing intervals
Priority Selection Novelty, weakness, and recency scoring
Tier Progression Standard promotion and fast-track system

Reward Shaping

Dense reward signals that encourage operational discipline and real progress.

Rollback Penalty & Idempotency Bonus Operational discipline rewards
📈
Shaped Reward System Progress bonus, failure penalty, clamped rewards
Multi-Strategy Grading 5 grading strategies across tiers

Resilience & Adaptability

Features that test agent robustness under unpredictable conditions.

💡
Progressive Hint System 3-level hints with reward decay
Chaos Injection Engine Silent mid-episode perturbations
🔍
Drift Detection Tasks Randomized config drift per episode

Security Posture Audit

Tests reasoning about configuration state — working but insecure infrastructure the agent must analyze and harden.

🔒
Public S3 Bucket Lockdown Detect & fix overly permissive bucket policies
🛡
IAM Least Privilege Replace wildcard policies with scoped permissions
🔐
Secrets in Lambda Environment Move plaintext credentials to Secrets Manager

Anti-Reward-Hacking

8 defense layers that prevent the agent from gaming the reward system.

🔎
Ground-Truth Verification MiniStack queries for 20+ services
🛡
Command Allowlisting Only aws CLI commands allowed
🚫
Deduplication No reward for repeated commands
👁
Grader Invisibility Verification commands hidden from agent
🔍
No Verification Reward Read-only commands earn zero progress
Monotonic Progress Progress can only increase, never re-earn
🎯
Resource Name Validation Exact name match required
State Checks Verify final state, not command history
Results

SFT → GRPO Training Pipeline

Two-stage training on unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit — the base picked from an 11-model benchmark on 27 held-out prompts. Stage 1: LoRA SFT on 1,500 synthetic trajectories spanning 5 shapes. Stage 2: TRL GRPO with multi-turn rollouts, group-relative advantages, KL to SFT reference, and Optuna search over an 8-dim hyperparameter space.

1,500SFT Train Rows
G=8Rollouts / Step
200Final GRPO Steps
11Models Benchmarked
+66.7ppFormat Δ (SFT)
+50.0ppExact-match Δ (SFT)

Base-Model Selection

11 chat models × 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters: 41% exact match, 63% operation match, 3.1 s/call (3× faster than the 4B runner-up).

Top 4 candidate models on the held-out benchmark
Top 4 Candidate Models Exact match, operation match, latency — head-to-head on 27 held-out prompts.

Base vs SFT — Eval Delta

After running the SFT pipeline end-to-end, format compliance is now perfect and exact-match jumped from 39% to 89%.

Metric Base Post-SFT Δ
Format 33.3% 100.0% +66.7 pp
Exact match 38.9% 88.9% +50.0 pp
Service match 77.8% 88.9% +11.1 pp
Operation match 61.1% 88.9% +27.8 pp
Latency 2.03 s 1.40 s −0.63 s
Base vs SFT eval-metrics comparison
Base vs SFT — Eval Metrics Per-metric comparison on the held-out prompt set.
Dataset comparison: base vs SFT (per-row scores)
Dataset Comparison Per-row scores: base vs SFT on the SFT validation set.
RL env comparison: base vs SFT
Live RL Env Comparison Per-episode rewards on the live MiniStack-backed environment.

SFT Training Curves & Optuna

Best SFT trial (out of 6): lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1.

SFT loss curve over training
SFT Loss Curve Train + validation loss across the SFT run.
Optuna parameter importances
Optuna Parameter Importances Which hyperparameters mattered most for the SFT objective.
Optuna optimization history
Optuna History Best objective value over the 6-trial TPE search.

GRPO — Live Multi-step Env Eval

After 35 GRPO steps on top of the SFT adapter (best Optuna config: lr=1.6e-5, β=0.0021, T=0.99), re-evaluated end-to-end on 100+ episodes.

Metric Base + SFT + GRPO Δ
Overall success 86.8% 86.2% −0.5 pp
Beginner 96.2% 100.0% +3.8 pp
Intermediate 81.0% 87.0% +6.0 pp
Expert 22.2% 22.2% flat
Drift repair 22.2% 22.2% flat
Destructive-action fail 15.1% 14.7% −0.4 pp

Honest reading: the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers, but does not crack the expert-tier bottleneck. Longer runs and more curriculum exposure to expert tasks are next.

SFT vs GRPO metrics grid
SFT vs GRPO — Metrics Grid Side-by-side eval across all multi-step metrics.
SFT vs GRPO by tier
SFT vs GRPO — By Tier Where GRPO actually moves the needle (and where it doesn’t).
Qualitative rollouts on representative tasks
Qualitative Rollouts One sample episode per tier, post-GRPO.

GRPO Training Curves

Per-step training signals from the final 35-step GRPO run, plus the 4-trial Optuna search that picked the final config.

GRPO env reward over training
GRPO Env Reward Mean reward across G=8 rollouts at each training step.
GRPO per-tier reward curve
Per-Tier Reward Curve How each curriculum tier responds to GRPO updates.
GRPO final per-step training signals
Final Per-Step Signals Reward, KL, loss, and policy ratio across the final run.
GRPO Optuna trial comparison
GRPO Optuna Trials Reward trajectories across 4 Optuna trials.
GRPO Optuna parameter importances
GRPO Param Importances Which knobs moved GRPO objective the most.
GRPO Optuna optimization history
GRPO Optuna History Best objective value over the 4-trial search.
API

WebSocket

import websockets, json async def main(): async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws" ) as ws: # Reset environment await ws.send(json.dumps({ "type": "reset" })) obs = json.loads(await ws.recv()) # Execute a command await ws.send(json.dumps({ "type": "step", "data": {"command": "aws s3 ls"} })) obs = json.loads(await ws.recv()) if __name__ == "__main__": import asyncio asyncio.run(main())

Python Client

import asyncio from aws_rl_env import AwsRlEnv, AwsRlAction async def main(): async with AwsRlEnv.from_env( "sizzing/aws-rl-env" ) as env: result = await env.step( AwsRlAction(command="aws s3 ls") ) asyncio.run(main())
Play
Controls

Click New Episode to start

The curriculum assigns a task matching your skill level

Command
Ready.
State
Tier
Episode
Progress
Reward0.00
Steps0
Hints0
Chaos
Output
No output yet.
Command Log
# Command OK Reward
No commands executed yet
AWS Environment

Start an episode to see live infrastructure state.