TEACHING MACHINES THROUGH GAMES

Simulation environments, datasets, and evals to train and test agents — for LLM/VLM and robotics teams.

Reinforcement learning environments to improve models

Long-horizon, physics-based, resource allocation, and other cognitive tasks
Skills that transfer to real-world tasks like math, reasoning, and tool use
Augmented by human interaction data from our consumer platform

Benchmarks that measure real-world performance

Evals for cognition and reasoning in embodied systems
Multi-agent scenarios that test persuasion, deception, and coordination
Dynamic environments that stress adaptation and exploration

See Benchmark

Environments

Task suites for long-horizon planning, real-time control, and generalization.

Endless

Flappy Bird

LLM-friendly version of the viral mobile hit with text-scaffolded physics. Moves are turn-based—agent receives current position, velocity, and upcoming pipe gaps, then decides TAP or do nothing. Physics constants are explicit so the model can project trajectories ahead.

Spatial

Snake

Classic Snake on an 8×8 grid, but the agent plans full paths to each apple rather than single moves. The challenge: reasoning about when body segments will clear before the head arrives. An HP system discourages guessing—reliable pathfinding beats trial-and-error.

Probabilistic

Blackjack

The classic card game, text-scaffolded for LLMs. Agent sees its hand total and the dealer's face-up card, then chooses HIT or STICK. Dealer plays by fixed rules (hits on 16 or below). Pure probabilistic reasoning—when to push your luck vs. stand pat.

Route Optimization

Taxi

A 5×5 city grid with walls and four pickup/dropoff spots. Agent drives to the passenger, picks them up, navigates around walls, and drops them at the destination. Multi-step logistics where planning the full route matters.

Long Horizon

Catan

1v1 Settlers of Catan, first to 7 Victory Points wins. Agent manages resources, trades, builds settlements and roads. Long-horizon strategy where early positioning shapes late-game options.

view all

Research

From Game Replays to Generalization

We ran RL on a small LLM across interactive game environments to study how skills learned in games transfer out of distribution. This work is grounded in our thesis that minds are scaffolded by the environments they act in, that intelligence lies in the loop between an agent and its world. We created environments inspired by classic arcade games and traditional RL benchmarks, then evaluated the trained models both in-game and on downstream math reasoning benchmarks.

See blogpost

TEACHING MACHINES THROUGH GAMES

Reinforcement learning environments to improve models

Benchmarks that measure real-world performance

Environments

From Game Replays to Generalization

Resources

4Wall Platform

TALES