Experience

With this project, we explored if LLM agents can compress what they know into a single paragraph and pass it down across generations to come.

An LLM agent exploring a partially observable world sometimes faces a tradeoff. Its context window is finite, the world is noisy, and most of what it sees is designed to mislead. Under a strict 750-token memory budget, the agent must decide what to keep and what to forget. It's a form of bounded rationality in the sense of Simon (1955). The interesting question is what happens when that compression becomes productive, ie. when an agent distills its experience into a paragraph and hands it to a successor who has never seen the environment before. We built a system where LLM agents navigate a POMDP over random geometric graphs, reproduce by compressing their knowledge into natural-language priors (max 150 words), and pass those priors to their children. The prior is the genome.

I. The Environment

The task is a POMDP tuple $(\mathcal{S}, \mathcal{A}, T, R, \Omega, O)$ over a random geometric graph $G = (V, E)$ . We scatter $n$ nodes uniformly in $[0,1]^2$ and connect any pair within radius $r$ :

$E = \{(i,j) : \|p_i - p_j\|_2 \leq r\}, \quad r = 0.35, \; n = 20$

A subset $k = 5$ nodes are designated as doors themed with colors and shapes (red arched door, blue narrow door, ...) chosen to have near-average degree so they are neither trivial nor dead ends. One door is the goal, placed at least $\delta_{\min}$ hops from the agent's start via BFS.

Each door emits $h$ hints and $d$ distractors drawn from five signal types: spatial, color, relational, narrative, and pattern. Hints are internally consistent as they agree on the goal's region, color, and near wrong regions, wrong colors, and sometimes contradict each other. The agent receives signals without labels. The only way to distinguish truth from noise is to notice that agreement implies reliability.

The agent observes its current node, 1-hop neighbors, and a random subset of unlabeled signals from nearby doors. Actions are free-text strings parsed against the neighbor list. The reward then becomes:

$R(s, a) = \begin{cases} +1 & \text{if } a = \texttt{enter}(d) \text{ and } d = d^* \\ -1 & \text{if } a = \texttt{enter}(d) \text{ and } d \neq d^* \\ 0 & \text{otherwise} \end{cases}$

Active Cloaking

An optional adversarial layer (inspired by my work with Professor Mucha and the paper written by DeGiovanni & Guevara Vasquez (2025)) mathematically suppresses signals near the goal. Given goal node $g$ , we define an inner cloaking region $\Omega$ and boundary annulus $\partial\Omega$ . The uncloaked signal potential $\mathbf{u}_{\text{ref}}$ solves the Dirichlet problem on the graph Laplacian $L = D - A$ :

$L\,\mathbf{u}_{\text{ref}} = \mathbf{0}, \quad u_{\text{ref}}(g) = 1, \quad u_{\text{ref}}(b) = 0 \;\;\forall b \in \mathcal{B}$

where $\mathcal{B}$ is the domain boundary. To cloak the goal, we build a modified Laplacian using the Dirichlet-to-Neumann (Schur complement) operator, which disconnects $\Omega$ from the exterior:

$L_{\text{cloak}} = L_{\mathcal{E}\mathcal{E}} - L_{\mathcal{E}\Omega}\,L_{\Omega\Omega}^{-1}\,L_{\Omega\mathcal{E}}$

Per-node signal visibility becomes the attenuated ratio $v(i) = \text{clip}\left(\frac{u_{\text{cloak}}(i)}{u_{\text{ref}}(i)},\; 0.05,\; 1.0\right)$ . At runtime, hints at low-visibility nodes are probabilistically flipped into random distractors. Agents far from the goal are effectively blinded until they penetrate the cloak.

II. Agent Architecture

Each agent wraps a two-tier LLM setup with an expensive reasoning model like GPT-4.1-mini for decisions, prior compression, and convention proposals, and a cheap utility model like Gemini 2.0 Flash for context summarization, evidence extraction, and question formulation. Every call goes through retry logic with exponential backoff.

Context Management

Agents carry a rolling buffer of $(\text{observation}, \text{action}, \text{reasoning})$ entries. Token budget is estimated as $|\text{text}|/4$ . When utilization exceeds 80% of $C = 750$ tokens, the oldest half is compressed by the utility model into a 2–3 sentence summary emphasizing transferable heuristics. This creates a recency gradient with recent experience is detailed, older experience progressively abstracted.

Bayesian Belief Tracking

When this is enabled, the agent maintains a categorical posterior $\mathbf{b} \in \Delta^{k-1}$ over door identities. At each step, the utility model extracts evidence tuples $(d_j, \rho_j)$ from observed signals. The multiplicative update rule is

$b_i \leftarrow \begin{cases} b_i \cdot (1 + \rho_j) & \text{if } i = j \;\text{(supported door)} \\ b_i \cdot (1 - 0.3\,\rho_j) & \text{otherwise} \end{cases}$

followed by renormalization $\mathbf{b} \leftarrow \mathbf{b} / \|\mathbf{b}\|_1$ . Entropy, MAP door, and belief on the true goal are tracked as per-step trajectories and injected into the agent's system prompt.

III. Reproduction & Fertility

At reproduction, the reasoning model compresses the agent's last 12 context entries into a prior. This can be maximum 150 words covering signal reliability, navigation strategy, and doors to avoid. The child inherits this prior via system prompt, spawns at the same start node with an empty buffer. Performance differences come from inherited knowledge.

Three reproduction triggers are available here. These includes periodic (every $\tau$ interactions), on-success (upon finding the goal), and novelty-based (when the agent's own experience has changed enough to be worth distilling). The novelty trigger computes Jaccard distance between the older and recent halves of the context buffer:

$J_{\text{novelty}} = 1 - \frac{|W_{\text{old}} \cap W_{\text{new}}|}{|W_{\text{old}} \cup W_{\text{new}}|} \;\geq\; \theta$

where $W_{\text{old}}$ and $W_{\text{new}}$ are the word sets from the first and second halves of the context. Reproduction fires when $J_{\text{novelty}} \geq \theta$ (default 0.7). The intuition: high novelty means the agent has accumulated genuinely new information worth passing on.

We define fertility as the mean number of reproductive events per agent under a given strategy. The central question of Experiment I is whether content-aware triggers (novelty) outperform fixed schedules ( $\tau$ ), especially when transferring to unseen environments.

IV. Results

Main Evaluation (250 trials)

The grandchild phenomenon. On the hardest graph instance (shortest path = 9 hops), the no-prior baseline failed at the 300-step budget. The oracle with full state information took 98 steps, cycling between spatially close but topologically distant nodes. A third-generation agent solved it in 5 steps because its inherited prior encoded experiential route knowledge ("Nodes 6, 20, 26, 32 form a reliable left-side corridor") that shortcut the topology in ways raw coordinates couldn't. Replicated across 5 additional high difficulty seeds.

Condition	Success	Median Steps	Mean Steps
Prior inheritance	93%	12.0	31.9
Oracle (full state)	97%	8.0	16.2
No prior (blank)	78%	34.5	45.1
Random prior	84%	28.0	41.3
Random walk	81%	71.0	81.6

Fertility Ablation

Fixed-interval strategies overfit as every-30 achieves the lowest steps on the training graph (18.7) but degrades +180% on the harder graph. Novelty θ=0.7 degrades only +37%. The prior is environment-robust because the trigger fires on information content.

Condition	Steps	Births	Steps (hard)	Δ%
every 3	43.0	11.3	—	—
every 7	22.7	5.0	—	—
every 15	21.3	4.0	48.7	+129%
every 30	18.7	4.3	52.3	+180%
success only	24.0	4.0	—	—
novelty θ=0.7	28.0	4.0	38.4	+37%

Emergent Conventions

Agents spontaneously developed stable naming conventions across generations. In one lineage, gen-0 wrote full sentences, gen-1 compressed to imperative rules, gen-2 produced terse shorthand ("Red arched door, lower-left. Trust red, ignore yellow."). Linguistic drift increased with environment complexity as parent-child Jaccard similarity dropped from 0.338 in small graphs to 0.246 in large. This is consistent with cultural transmission theory (Henrich 2015). In one trial, a gen-2 child autonomously overrode its parent's incorrect prior ("URGENT: inherited target is WRONG. DISCARD."), showing iterated learning dynamics where compression bottlenecks naturally filter inaccurate information (Kirby et al. 2008).

V. Experiment Suite

Exp	Question	Conditions	Trials
A	Do priors help at all?	inherited vs. no-prior	250
C	Do agents invent stable shorthand?	small / medium / large graphs	250
E	Does a shared skill library beat individual inheritance?	no-lib, prior-only, prior+library	250
I	What is the optimal reproduction frequency?	fixed intervals, success-only, novelty thresholds	250
H	Can agents beat mathematically-hidden goals?	cloaked / uncloaked / cross-transfer	future

VI. Stack

Component	Tool
LLM access	langchain-dartmouth
Reasoning model	GPT-4.1-mini
Utility model	Gemini 2.0 Flash
Graph math	NumPy, SciPy sparse
Cloaking	graph Laplacian, Schur complement
Dependency management	uv
Output	JSON + text transcripts

Because Mom Said So