AWS RL Environment

About

Learn AWS by doing.

An OpenEnv-compatible RL environment where agents execute real AWS CLI commands against a vendored MiniStack simulator that responds with production-equivalent JSON. 120+ tasks across 5 tiers (warmup → expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection and drift-detection scenarios — every feature designed to keep the reward signal honest and prevent the agent from gaming it. Trained end-to-end with a 1,500-row synthetic SFT dataset and TRL GRPO with 8-way parallel rollouts on a single GPU.

120+Tasks

5Difficulty Tiers

34AWS Services

8Parallel Rollouts

+50ppExact-match Δ (SFT)

100%Format Compliance (post-SFT)

S3 EC2 DynamoDB Lambda SQS SNS IAM RDS API Gateway CloudFormation CloudWatch Kinesis SES Step Functions Secrets Manager ELBv2 Route53 Glue Athena EFS + 14 more

Tasks

Warmup

25 tasks

List resources — single read-only commands

Run one AWS CLI command to list or describe a resource type
S3 buckets, EC2 instances, DynamoDB tables, Lambda functions, RDS, EBS volumes
Graded by command_match — checks operation + service pair
No setup required, no state mutations

Beginner

25 tasks

Create single resources with verification

Create an S3 bucket, DynamoDB table, SQS queue, or Lambda function
Graded by resource_creation — verifies the exact resource exists in AWS Infrastructure Simulator
Introduces resource name validation — "my-bucket-2" won't satisfy a check for "my-bucket"
First tier where idempotency bonus (+0.02) can be earned

Intermediate

25 tasks

Multi-step workflows — create, configure, connect

Ordered sequences: create a bucket then enable versioning, create a table then add an item
Graded by multi_step — validates each step was completed in order
Chaos injection begins at 10% probability — resources may be silently mutated mid-episode
Rollback penalty (-0.1) starts to matter with multi-step create/delete patterns

Advanced

25 tasks

Cross-service architectures spanning multiple AWS services

Wire Lambda to SQS, configure API Gateway with integrations, build event-driven pipelines
Graded by multi_step + services — all required services must be configured
Chaos injection escalates to 20% probability — DynamoDB throughput, Lambda configs may change
Hints cost more: 3 hints = only 61% of max reward (0.85³ decay)

Expert

24 tasks + 9 drift

SRE incidents & drift detection — diagnose and fix

Fix overly permissive S3 policies, replace broad IAM inline policies, repair broken infra
Graded by state_checks — actual CLI commands run against MiniStack at grading time
Chaos injection at 30% probability — maximum perturbation frequency
9 drift detection tasks — correct infra is provisioned, then 2–3 random mutations applied from a pool
Agent must audit environment, discover which resources drifted, and fix only those
Drift is randomized per episode — prevents memorization of fix sequences

Features

Curriculum & Training

Adaptive learning system that tracks mastery and selects optimal tasks.

↑

Progressive Difficulty 5 tiers from warmup to expert SRE

✓

Mastery Tracking Per-task graduation with sustained performance

↻

Spaced Repetition Graduated tasks resurface at increasing intervals

◎

Priority Selection Novelty, weakness, and recency scoring

⬆

Tier Progression Standard promotion and fast-track system

Reward Shaping

Dense reward signals that encourage operational discipline and real progress.

⚖

Rollback Penalty & Idempotency Bonus Operational discipline rewards

📈

Shaped Reward System Progress bonus, failure penalty, clamped rewards

★

Multi-Strategy Grading 5 grading strategies across tiers

Resilience & Adaptability

Features that test agent robustness under unpredictable conditions.

💡

Progressive Hint System 3-level hints with reward decay

⚡

Chaos Injection Engine Silent mid-episode perturbations

🔍

Drift Detection Tasks Randomized config drift per episode

Security Posture Audit

Tests reasoning about configuration state — working but insecure infrastructure the agent must analyze and harden.

🔒

Public S3 Bucket Lockdown Detect & fix overly permissive bucket policies

🛡

IAM Least Privilege Replace wildcard policies with scoped permissions

🔐

Secrets in Lambda Environment Move plaintext credentials to Secrets Manager

Anti-Reward-Hacking

8 defense layers that prevent the agent from gaming the reward system.

🔎

Ground-Truth Verification MiniStack queries for 20+ services

🛡

Command Allowlisting Only aws CLI commands allowed

🚫

Deduplication No reward for repeated commands

👁

Grader Invisibility Verification commands hidden from agent

🔍

No Verification Reward Read-only commands earn zero progress

↗

Monotonic Progress Progress can only increase, never re-earn

🎯

Resource Name Validation Exact name match required

☑

State Checks Verify final state, not command history

Results

SFT → GRPO Training Pipeline

Two-stage training on unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit — the base picked from an 11-model benchmark on 27 held-out prompts. Stage 1: LoRA SFT on 1,500 synthetic trajectories spanning 5 shapes. Stage 2: TRL GRPO with multi-turn rollouts, group-relative advantages, KL to SFT reference, and Optuna search over an 8-dim hyperparameter space.

1,500SFT Train Rows

G=8Rollouts / Step

200Final GRPO Steps

11Models Benchmarked

+66.7ppFormat Δ (SFT)

+50.0ppExact-match Δ (SFT)

Base-Model Selection

11 chat models × 27 held-out prompts. Qwen2.5-Coder-3B-Instruct wins on every metric that matters: 41% exact match, 63% operation match, 3.1 s/call (3× faster than the 4B runner-up).

Top 4 candidate models on the held-out benchmark

Top 4 Candidate Models Exact match, operation match, latency — head-to-head on 27 held-out prompts.

Base vs SFT — Eval Delta

After running the SFT pipeline end-to-end, format compliance is now perfect and exact-match jumped from 39% to 89%.

Metric	Base	Post-SFT	Δ
Format	33.3%	100.0%	+66.7 pp
Exact match	38.9%	88.9%	+50.0 pp
Service match	77.8%	88.9%	+11.1 pp
Operation match	61.1%	88.9%	+27.8 pp
Latency	2.03 s	1.40 s	−0.63 s

Base vs SFT — Eval Metrics Per-metric comparison on the held-out prompt set.

Dataset comparison: base vs SFT (per-row scores)

Dataset Comparison Per-row scores: base vs SFT on the SFT validation set.

Live RL Env Comparison Per-episode rewards on the live MiniStack-backed environment.

SFT Training Curves & Optuna

Best SFT trial (out of 6): lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1.

SFT Loss Curve Train + validation loss across the SFT run.

Optuna Parameter Importances Which hyperparameters mattered most for the SFT objective.

Optuna History Best objective value over the 6-trial TPE search.

GRPO — Live Multi-step Env Eval

After 35 GRPO steps on top of the SFT adapter (best Optuna config: lr=1.6e-5, β=0.0021, T=0.99), re-evaluated end-to-end on 100+ episodes.

Metric	Base + SFT	+ GRPO	Δ
Overall success	86.8%	86.2%	−0.5 pp
Beginner	96.2%	100.0%	+3.8 pp
Intermediate	81.0%	87.0%	+6.0 pp
Expert	22.2%	22.2%	flat
Drift repair	22.2%	22.2%	flat
Destructive-action fail	15.1%	14.7%	−0.4 pp

Honest reading: the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers, but does not crack the expert-tier bottleneck. Longer runs and more curriculum exposure to expert tasks are next.

SFT vs GRPO — Metrics Grid Side-by-side eval across all multi-step metrics.

SFT vs GRPO — By Tier Where GRPO actually moves the needle (and where it doesn’t).

Qualitative rollouts on representative tasks

Qualitative Rollouts One sample episode per tier, post-GRPO.

GRPO Training Curves

Per-step training signals from the final 35-step GRPO run, plus the 4-trial Optuna search that picked the final config.

GRPO Env Reward Mean reward across G=8 rollouts at each training step.

Per-Tier Reward Curve How each curriculum tier responds to GRPO updates.

Final Per-Step Signals Reward, KL, loss, and policy ratio across the final run.

GRPO Optuna Trials Reward trajectories across 4 Optuna trials.

GRPO Param Importances Which knobs moved GRPO objective the most.

GRPO Optuna History Best objective value over the 4-trial search.

API

WebSocket

import websockets, json

async def main():
	async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws"
	) as ws:
		# Reset environment
		await ws.send(json.dumps({
		"type": "reset"
		}))
		obs = json.loads(await ws.recv())

		# Execute a command
		await ws.send(json.dumps({
		"type": "step",
		"data": {"command": "aws s3 ls"}
		}))
		obs = json.loads(await ws.recv())

if __name__ == "__main__":
	import asyncio
	asyncio.run(main())
            

Python Client

import asyncio
from aws_rl_env import AwsRlEnv, AwsRlAction

async def main():
	async with AwsRlEnv.from_env(
	"sizzing/aws-rl-env"
	) as env:
		result = await env.step(
		AwsRlAction(command="aws s3 ls")
		)

asyncio.run(main())
            

Play

Controls

Click New Episode to start

The curriculum assigns a task matching your skill level

Command

Ready.

State

Tier—

Episode—

Progress

Reward0.00

Steps0

Hints0

Chaos—

Output

No output yet.

Command Log

#	Command	OK	Reward
No commands executed yet

AWS Environment

Start an episode to see live infrastructure state.

Links

Build the Future of AI

This project is made during an Hackathon by Team Vector, (Bangar Raju and Uday Kiran Padhy). Star it, fork it, break it, fix it — every episode makes AI agents better at cloud operations.

GitHub Repo 🤗 HF Space 🤗 SFT Adapter 🤗 Dataset

AWS Cloud Operations · RL Environment & Training Pipeline

Learn AWS by doing.

Curriculum & Training

Reward Shaping

Resilience & Adaptability

Security Posture Audit

Anti-Reward-Hacking

SFT → GRPO Training Pipeline

Base-Model Selection

Base vs SFT — Eval Delta

SFT Training Curves & Optuna

GRPO — Live Multi-step Env Eval

GRPO Training Curves

WebSocket

Python Client

Build the Future of AI

Command Detail