OpenEnv is an open-source framework from Meta and Hugging Face for creating standardized, isolated, and reusable environments for training and deploying AI agents, especially for Reinforcement Learning (RL) and agentic workflows. It offers a unified Gymnasium-style API, containerized execution (Docker), and a central hub on Hugging Face for sharing these environments. Unlike traditional frameworks that focus primarily on games and simulated environments, OpenEnv bridges the gap between research and production by providing a standardized interface for building, deploying, and evaluating AI agents across diverse domains.
As large language models increasingly act as tool-using agents (issuing API calls, manipulating external systems, and executing multi-step workflows, etc.), the quality of the environments they interact with becomes critical. These environments define what agents can observe, what actions they can take, and how reliably their behavior can be trained and evaluated.
The original OpenAI Gym (now Gymnasium) established the foundational pattern for RL environments:
While revolutionary for its time, OpenAI Gym was primarily designed for:
OpenEnv introduces several paradigm shifts. This example contrasts traditional stateless HTTP-based environment interactions with OpenEnv’s persistent WebSocket sessions, where environment state is maintained across multiple agent actions.
1. WebSocket-Based Persistent Sessions
# OLD: Stateless HTTP (OpenAI Gym style)
response = requests.post("/step", json={"action": action})
observation = response.json()["observation"]
# NEW: Persistent WebSocket connections (OpenEnv)
with EnvClient(base_url="ws://localhost:8004") as client:
result = client.reset() # Initialize session
for _ in range(100):
result = client.step(action) # Maintain state
# Session state preserved across interactionsBenefits:
2. Production-First Design
OpenEnv environments are containerized microservices with:
3. Pydantic-Based Type Safety
This example illustrates how OpenEnv uses Pydantic models to enforce structured, validated agent actions, reducing runtime errors and improving tool reliability:
# OLD: Dataclasses with manual validation
@dataclass
class Action:
command: str
params: dict
# NEW: Pydantic models with automatic validation
class Action(BaseModel):
command: str = Field(..., description="Command to execute")
params: Dict[str, Any] = Field(
default_factory=dict,
description="Command parameters"
)
@validator('command')
def validate_command(cls, v):
allowed = ['create', 'update', 'delete']
if v not in allowed:
raise ValueError(f"Command must be one of {allowed}")
return v
Benefits:
4. Factory Pattern for Concurrency
This example demonstrates how OpenEnv creates a new, isolated environment instance for each client session, enabling safe concurrency and multi-tenant usage:
# OLD: Shared environment instance (race conditions!)
env = MyEnvironment()
app = create_fastapi_app(env, Action, Observation)
# NEW: Factory creates isolated instances per session
def create_environment():
return MyEnvironment(config=load_config())
app = create_app(
create_environment, # Factory function
Action,
Observation,
env_name="my_env"
)
# Each WebSocket connection gets its own environment!Benefits:

OpenEnv supports environments across diverse application domains:
Use Case: Web navigation, form filling, UI testing, data extraction
Example Environment:
class BrowserAction(Action):
action_type: Literal["click", "type", "navigate", "scroll"]
selector: Optional[str] = Field(None, description="CSS selector")
text: Optional[str] = Field(None, description="Text to type")
url: Optional[str] = Field(None, description="URL to navigate to")
class BrowserObservation(Observation):
html: str = Field(..., description="Current page HTML")
screenshot: Optional[str] = Field(None, description="Base64 screenshot")
url: str = Field(..., description="Current URL")
success: bool = Field(..., description="Action succeeded")
Real-World Applications:
Use Case: Meeting scheduling, ACL management, multi-calendar coordination, permission gating, multi-user state tracking
Example Environment:
class CalendarAction(Action):
action_type: Literal["ListToolsAction", "ToolCallAction"]
tool_name: Optional[str] = Field(None, description="MCP tool name")
arguments: Dict[str, Any] = Field(default_factory=dict)
class CalendarObservation(Observation):
success: bool
tools_list: Optional[List[Dict[str, Any]]] = None
tool_result: Optional[Any] = None
error_message: Optional[str] = NoneReal-World Applications:
Use Case: Software development agents, bug fixing, code review
Example Environment:
class CodeAction(Action):
action_type: Literal["read_file", "write_file", "run_tests", "git_commit"]
file_path: Optional[str] = None
content: Optional[str] = None
commit_message: Optional[str] = None
class CodeObservation(Observation):
file_content: Optional[str] = None
test_results: Optional[Dict[str, Any]] = None
git_status: Optional[str] = None
success: boolReal-World Applications:
Use Case: Game-playing agents, multi-agent competition
Benefits over Traditional Gym:
Use Case: Algorithmic trading, portfolio optimization
Production Features:
Use Case: NLP agents, interactive fiction, conversational AI
Example:
OpenEnv maintains the classic RL loop but enhances it:
class Environment(ABC):
"""Abstract base for all OpenEnv environments."""
@abstractmethod
def reset(self) -> Observation:
"""Initialize new episode, return initial observation."""
pass
@abstractmethod
def step(self, action: Action) -> Observation:
"""Execute action, return observation (with reward/done)."""
pass
@property
@abstractmethod
def state(self) -> State:
"""Return current environment state (episode_id, step_count)."""
pass
Key Enhancements:

OpenEnv environments can expose MCP tools for agent interaction. This example shows how an agent discovers available MCP tools and invokes a specific tool as part of a multi-step workflow:
# Calendar Gym exposes 25+ MCP tools
tools = [
"calendars_list", # List user's calendars
"calendars_get", # Get calendar details
"events_list", # List events
"events_insert", # Create event
"events_update", # Update event
"acl_list", # List access control rules
"acl_insert", # Add calendar permission
# ... 18 more tools
]MCP Benefits:
Example Tool Call:
# Agent discovers tools
result = client.step(Action(action_type="ListToolsAction"))
print(result.observation.tools_list) # All 25 calendar tools
# Agent calls specific tool
result = client.step(Action(
action_type="ToolCallAction",
tool_name="events_insert",
arguments={
"calendarId": "primary",
"summary": "Team Standup",
"start": {"dateTime": "2026-01-10T09:00:00Z"},
"end": {"dateTime": "2026-01-10T09:30:00Z"}
}
))OpenEnv uses WebSocket for client-server communication. This example illustrates the JSON message structure exchanged between an agent and an OpenEnv environment during a single interaction step:
Connection Flow:
1. Client connects: ws://localhost:8004/ws
2. Server creates isolated environment instance
3. Client sends: {"action": "reset"}
4. Server responds: {"observation": {...}, "reward": 0, "done": false}
5. Client sends: {"action": {"type": "tool_call", "tool": "..."}}
6. Server responds: {"observation": {...}, "reward": 1, "done": false}
7. ... (multiple steps)
8. Client disconnects: Environment instance cleaned up
Message Format (JSON):
// Request
{
"action": {
"action_type": "ToolCallAction",
"tool_name": "calendars_list",
"arguments": {}
}
}
// Response
{
"observation": {
"success": true,
"tool_result": [...],
"error_message": null
},
"reward": 1.0,
"done": false,
"state": {
"episode_id": "abc-123",
"step_count": 5
}
}Advantages over HTTP:
Each OpenEnv environment is a self-contained Docker container:
# Multi-stage build for efficiency FROM python:3.11-slim AS builder WORKDIR /app COPY pyproject.toml uv.lock ./ RUN pip install uv && uv sync --frozen FROM python:3.11-slim COPY --from=builder /app/.venv /app/.venv COPY . /app ENV PATH="/app/.venv/bin:$PATH" HEALTHCHECK CMD curl -f http://localhost:8000/health CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0"]
Benefits:
OpenEnv provides a CLI for environment lifecycle:
# Initialize new environment from template openenv init my_new_env --output-dir ./envs # Build Docker image openenv build # Validate environment follows standards openenv validate --verbose # Push to Hugging Face Hub openenv push --org my-org --token $HF_TOKEN # Deploy to Kubernetes openenv deploy --replicas 3 --namespace production
Production environments include:
Example Metrics:
# Automatically collected
openenv_steps_total{env="calendar", status="success"} 1523
openenv_step_duration_seconds{env="calendar", quantile="0.95"} 0.23
openenv_active_sessions{env="calendar"} 47
openenv_errors_total{env="calendar", type="validation"} 12
Feature | OpenAI Gym | OpenEnv |
|---|---|---|
Transport | HTTP (stateless) | WebSocket (stateful) |
Deployment | Local script | Docker microservice |
Type Safety | Basic types | Pydantic validation |
Concurrency | Shared instance | Isolated instances |
Tool Calling | Not supported | MCP integration |
Production | Research-only | Production-ready |
Monitoring | Manual | Built-in metrics |
Scaling | Single machine | Kubernetes-native |
Domains | Games, robotics | Browsers, code, calendars, finance, etc. |
OpenEnv solves the “RL deployment gap” where:
What OpenEnv achieves:
Impact Areas:
To demonstrate how OpenEnv operates in practice and to contribute meaningful benchmarks to the ecosystem, Turing developed the Calendar Gym: a production-grade environment that captures the complexity of real-world scheduling, permissions, and multi-agent coordination. The following section details why calendars were chosen, how the environment was designed, and what it reveals about the strengths and limitations of today’s tool-using agents.
Turing selected calendar management as our flagship OpenEnv contribution for several strategic reasons:
Calendar systems exhibit challenging properties perfect for RL research:
Multi-Agent Coordination
State Space Complexity
Partial Observability
Turing has pioneered tool-using agents through SWE-bench, where agents:
Calendar Gym extends this expertise:
Calendar tasks provide objective verification:
# Example verifier: "Alice should have writer access to Bob's project calendar"
verifier = {
"verifier_type": "database_state",
"validation_config": {
"query": """
SELECT COUNT(*) FROM acls
WHERE calendar_id='bob-projects'
AND user_id='alice_manager'
AND role IN ('writer', 'owner')
""",
"expected_value": 1,
"comparison_type": "equals"
}
}This enables:
1. Multi-Tenancy with Database Isolation
Each agent session gets its own isolated database. This example demonstrates how each Calendar Gym session is assigned its own isolated database instance, ensuring reproducibility and preventing cross-session interference:
class MCPEnvironment(Environment):
def __init__(self, database_id: str, auth_token: Optional[str] = None):
self.database_id = database_id
self.session_manager = get_session_manager()
# Each database_id → separate SQLite file
self.db_engine = self.session_manager.get_engine(database_id)Why this matters:
2. Dual Protocol Support: OpenEnv + MCP
The Calendar Gym implements two protocols. This example highlights how the Calendar Gym supports both the OpenEnv interaction protocol and the MCP tool-calling protocol, enabling compatibility with a wide range of agent frameworks:
# OpenEnv protocol (step/reset)
@app.post("/step")
async def step(action: MCPAction) -> Dict[str, Any]:
observation = env.step(action)
return {
"observation": observation.model_dump(),
"reward": calculate_reward(observation),
"done": is_task_complete(observation)
}
# MCP protocol (JSON-RPC 2.0)
@app.post("/mcp")
async def mcp_endpoint(request: Request) -> Dict[str, Any]:
body = await request.json()
if body["method"] == "tools/list":
return {"result": {"tools": list_all_tools()}}
elif body["method"] == "tools/call":
tool_name = body["params"]["name"]
result = execute_tool(tool_name, body["params"]["arguments"])
return {"result": result}Benefits:
3. Header-Based Authentication and Context
Multi-user scenarios require per-request authentication:
@app.post("/step")
async def step(
request: Request,
action: MCPAction
) -> Dict[str, Any]:
# Extract headers
access_token = request.headers.get("x-access-token")
database_id = request.headers.get("x-database-id")
# Set context for this request
env.set_request_context(
database_id=database_id,
access_token=access_token
)
# Execute action with proper permissions
observation = env.step(action)
return {"observation": observation.model_dump()}4. Comprehensive Tool Suite (25+ Operations)
The Calendar Gym exposes the full Google Calendar API v3 via MCP tools:
Installation
# Clone repository git clone https://github.com/your-org/rl-gym.git cd rl-gym/calendar # Install dependencies (using uv for speed) pip install uv uv sync # Or traditional pip pip install -r requirements.txt
Running Locally
# Start server uvicorn main:app --reload --port 8004 # Test health check curl http://localhost:8004/health # View API docs open http://localhost:8004/docs
Running with Docker
# Build and run docker compose build --no-cache calendar docker compose up -d calendar # Check logs docker logs calendar-service -f # Test curl http://localhost:8010/health
Basic Agent Interaction
from openenv_wrapper.client import MCPEnvClient
from openenv_wrapper.data_models import MCPAction
# Connect to environment
with MCPEnvClient(base_url="http://localhost:8004") as client:
# Reset environment (initializes database with sample data)
result = client.reset()
print(f"Reset successful: {result.observation.success}")
# Discover available tools
result = client.step(MCPAction(action_type="ListToolsAction"))
print(f"Available tools: {len(result.observation.tools_list)}")
# List Alice's calendars
result = client.step(MCPAction(
action_type="ToolCallAction",
tool_name="calendars_list",
arguments={}
))
calendars = result.observation.tool_result["items"]
print(f"Alice has {len(calendars)} calendars")
# Create a new event
result = client.step(MCPAction(
action_type="ToolCallAction",
tool_name="events_insert",
arguments={
"calendarId": "primary",
"summary": "AI Research Sync",
"start": {"dateTime": "2026-01-15T14:00:00Z"},
"end": {"dateTime": "2026-01-15T15:00:00Z"}
}
))
print(f"Event created: {result.observation.success}")The Calendar Gym supports flexible reward shaping for RL training. This example shows how rewards are computed based on agent success, efficiency, and tool usage to guide learning and evaluation:
Built-in Reward Components
def calculate_reward(observation: MCPObservation) -> float:
"""Calculate reward based on observation."""
reward = 0.0
# Success/failure
if observation.success:
reward += 1.0
else:
reward -= 0.5
# Efficiency (fewer steps = better)
if observation.step_count < 5:
reward += 0.2
# Tool usage (prefer specific tools over generic)
if observation.tool_used in ["acl_patch", "events_update"]:
reward += 0.1 # Prefer targeted modifications
elif observation.tool_used in ["acl_insert", "events_insert"]:
reward -= 0.1 # Penalize creating new entities unnecessarily
return rewardCustom Reward Functions
For research, you can define domain-specific rewards:
# Example: Reward minimal ACL changes
def minimal_acl_changes_reward(observation: MCPObservation, initial_state: Dict) -> float:
if observation.tool_used.startswith("acl_"):
# Count ACL modifications
current_acls = query_acl_count()
initial_acls = initial_state["acl_count"]
# Penalize creating new ACLs
if current_acls > initial_acls:
return -0.5
# Reward modifying existing ACLs
return 0.5
return 0.0Verifiers are SQL-based checks that validate agent behavior:
Verifier Structure
{
"verifier_type": "database_state",
"name": "Alice_Has_Writer_Access",
"description": "Alice must have writer or owner role on Bob's project calendar",
"validation_config": {
"query": """
SELECT COUNT(*) AS count
FROM acls
WHERE calendar_id='bob-projects'
AND user_id='alice_manager'
AND role IN ('writer', 'owner')
""",
"expected_value": 1,
"comparison_type": "equals"
}
}Although the following insights are derived from rigorous benchmarking within the Calendar Gym, they reflect broader challenges inherent to real-world, tool-using agents: the failure modes and performance bottlenecks observed here extend beyond scheduling tasks, offering actionable lessons for the design, prompting, and evaluation of agentic systems operating in complex production environments.
Through extensive evaluation on the Calendar Gym, we’ve identified several critical insights about tool-using agents:
Observation: Agents excel at single-tool calls but struggle with multi-step workflows.
Why this matters:
Observation: Agents often fail when identifiers are ambiguous (e.g., “Bob’s calendar” vs. “bob-development” vs. “bob-personal”).
Benchmark Result:
Scenario: "Grant Alice access to Bob's project calendar"
- With explicit ID: 89% success
- With natural language description: 41% success
Agent Failure Modes:
Lesson: Agents need explicit lookup → validate → use patterns.
Observation: Even when agents select the correct tool, they often provide malformed arguments.
Error Breakdown (from 500 failed tool calls):
Common Argument Errors:
Lesson: Argument validation and example-driven prompting are essential.
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.