Go back to Experience

Because Mom Said So

Priors in Evolutionary Agents

With this project, we explored if LLM agents can compress what they know into a single paragraph and pass it down across generations to come.

An LLM agent exploring a partially observable world sometimes faces a tradeoff. Its context window is finite, the world is noisy, and most of what it sees is designed to mislead. Under a strict 750-token memory budget, the agent must decide what to keep and what to forget. It's a form of bounded rationality in the sense of Simon (1955). The interesting question is what happens when that compression becomes productive, ie. when an agent distills its experience into a paragraph and hands it to a successor who has never seen the environment before. We built a system where LLM agents navigate a POMDP over random geometric graphs, reproduce by compressing their knowledge into natural-language priors (max 150 words), and pass those priors to their children. The prior is the genome.

I. The Environment

The task is a POMDP tuple (S,A,T,R,Ω,O)(\mathcal{S}, \mathcal{A}, T, R, \Omega, O) over a random geometric graph G=(V,E)G = (V, E). We scatter nn nodes uniformly in [0,1]2[0,1]^2 and connect any pair within radius rr:

E={(i,j):pipj2r},r=0.35,  n=20E = \{(i,j) : \|p_i - p_j\|_2 \leq r\}, \quad r = 0.35, \; n = 20

A subset k=5k = 5 nodes are designated as doors themed with colors and shapes (red arched door, blue narrow door, ...) chosen to have near-average degree so they are neither trivial nor dead ends. One door is the goal, placed at least δmin\delta_{\min} hops from the agent's start via BFS.

Each door emits hh hints and dd distractors drawn from five signal types: spatial, color, relational, narrative, and pattern. Hints are internally consistent as they agree on the goal's region, color, and near wrong regions, wrong colors, and sometimes contradict each other. The agent receives signals without labels. The only way to distinguish truth from noise is to notice that agreement implies reliability.

The agent observes its current node, 1-hop neighbors, and a random subset of unlabeled signals from nearby doors. Actions are free-text strings parsed against the neighbor list. The reward then becomes:

R(s,a)={+1if a=enter(d) and d=d1if a=enter(d) and dd0otherwiseR(s, a) = \begin{cases} +1 & \text{if } a = \texttt{enter}(d) \text{ and } d = d^* \\ -1 & \text{if } a = \texttt{enter}(d) \text{ and } d \neq d^* \\ 0 & \text{otherwise} \end{cases}

Active Cloaking

An optional adversarial layer (inspired by my work with Professor Mucha and the paper written by DeGiovanni & Guevara Vasquez (2025)) mathematically suppresses signals near the goal. Given goal node gg, we define an inner cloaking region Ω\Omega and boundary annulus Ω\partial\Omega. The uncloaked signal potential uref\mathbf{u}_{\text{ref}} solves the Dirichlet problem on the graph Laplacian L=DAL = D - A:

Luref=0,uref(g)=1,uref(b)=0    bBL\,\mathbf{u}_{\text{ref}} = \mathbf{0}, \quad u_{\text{ref}}(g) = 1, \quad u_{\text{ref}}(b) = 0 \;\;\forall b \in \mathcal{B}

where B\mathcal{B} is the domain boundary. To cloak the goal, we build a modified Laplacian using the Dirichlet-to-Neumann (Schur complement) operator, which disconnects Ω\Omega from the exterior:

Lcloak=LEELEΩLΩΩ1LΩEL_{\text{cloak}} = L_{\mathcal{E}\mathcal{E}} - L_{\mathcal{E}\Omega}\,L_{\Omega\Omega}^{-1}\,L_{\Omega\mathcal{E}}

Per-node signal visibility becomes the attenuated ratio v(i)=clip(ucloak(i)uref(i),  0.05,  1.0)v(i) = \text{clip}\left(\frac{u_{\text{cloak}}(i)}{u_{\text{ref}}(i)},\; 0.05,\; 1.0\right). At runtime, hints at low-visibility nodes are probabilistically flipped into random distractors. Agents far from the goal are effectively blinded until they penetrate the cloak.

II. Agent Architecture

Each agent wraps a two-tier LLM setup with an expensive reasoning model like GPT-4.1-mini for decisions, prior compression, and convention proposals, and a cheap utility model like Gemini 2.0 Flash for context summarization, evidence extraction, and question formulation. Every call goes through retry logic with exponential backoff.

Context Management

Agents carry a rolling buffer of (observation,action,reasoning)(\text{observation}, \text{action}, \text{reasoning}) entries. Token budget is estimated as text/4|\text{text}|/4. When utilization exceeds 80% of C=750C = 750 tokens, the oldest half is compressed by the utility model into a 2–3 sentence summary emphasizing transferable heuristics. This creates a recency gradient with recent experience is detailed, older experience progressively abstracted.

Bayesian Belief Tracking

When this is enabled, the agent maintains a categorical posterior bΔk1\mathbf{b} \in \Delta^{k-1} over door identities. At each step, the utility model extracts evidence tuples (dj,ρj)(d_j, \rho_j) from observed signals. The multiplicative update rule is

bi{bi(1+ρj)if i=j  (supported door)bi(10.3ρj)otherwiseb_i \leftarrow \begin{cases} b_i \cdot (1 + \rho_j) & \text{if } i = j \;\text{(supported door)} \\ b_i \cdot (1 - 0.3\,\rho_j) & \text{otherwise} \end{cases}

followed by renormalization bb/b1\mathbf{b} \leftarrow \mathbf{b} / \|\mathbf{b}\|_1. Entropy, MAP door, and belief on the true goal are tracked as per-step trajectories and injected into the agent's system prompt.

III. Reproduction & Fertility

At reproduction, the reasoning model compresses the agent's last 12 context entries into a prior. This can be maximum 150 words covering signal reliability, navigation strategy, and doors to avoid. The child inherits this prior via system prompt, spawns at the same start node with an empty buffer. Performance differences come from inherited knowledge.

Three reproduction triggers are available here. These includes periodic (every τ\tau interactions), on-success (upon finding the goal), and novelty-based (when the agent's own experience has changed enough to be worth distilling). The novelty trigger computes Jaccard distance between the older and recent halves of the context buffer:

Jnovelty=1WoldWnewWoldWnew    θJ_{\text{novelty}} = 1 - \frac{|W_{\text{old}} \cap W_{\text{new}}|}{|W_{\text{old}} \cup W_{\text{new}}|} \;\geq\; \theta

where WoldW_{\text{old}} and WnewW_{\text{new}} are the word sets from the first and second halves of the context. Reproduction fires when JnoveltyθJ_{\text{novelty}} \geq \theta (default 0.7). The intuition: high novelty means the agent has accumulated genuinely new information worth passing on.

We define fertility as the mean number of reproductive events per agent under a given strategy. The central question of Experiment I is whether content-aware triggers (novelty) outperform fixed schedules (τ\tau), especially when transferring to unseen environments.

IV. Results

Main Evaluation (250 trials)

The grandchild phenomenon. On the hardest graph instance (shortest path = 9 hops), the no-prior baseline failed at the 300-step budget. The oracle with full state information took 98 steps, cycling between spatially close but topologically distant nodes. A third-generation agent solved it in 5 steps because its inherited prior encoded experiential route knowledge ("Nodes 6, 20, 26, 32 form a reliable left-side corridor") that shortcut the topology in ways raw coordinates couldn't. Replicated across 5 additional high difficulty seeds.
ConditionSuccessMedian StepsMean Steps
Prior inheritance93%12.031.9
Oracle (full state)97%8.016.2
No prior (blank)78%34.545.1
Random prior84%28.041.3
Random walk81%71.081.6

Fertility Ablation

Fixed-interval strategies overfit as every-30 achieves the lowest steps on the training graph (18.7) but degrades +180% on the harder graph. Novelty θ=0.7 degrades only +37%. The prior is environment-robust because the trigger fires on information content.

ConditionStepsBirthsSteps (hard)Δ%
every 343.011.3
every 722.75.0
every 1521.34.048.7+129%
every 3018.74.352.3+180%
success only24.04.0
novelty θ=0.728.04.038.4+37%

Emergent Conventions

Agents spontaneously developed stable naming conventions across generations. In one lineage, gen-0 wrote full sentences, gen-1 compressed to imperative rules, gen-2 produced terse shorthand ("Red arched door, lower-left. Trust red, ignore yellow."). Linguistic drift increased with environment complexity as parent-child Jaccard similarity dropped from 0.338 in small graphs to 0.246 in large. This is consistent with cultural transmission theory (Henrich 2015). In one trial, a gen-2 child autonomously overrode its parent's incorrect prior ("URGENT: inherited target is WRONG. DISCARD."), showing iterated learning dynamics where compression bottlenecks naturally filter inaccurate information (Kirby et al. 2008).

V. Experiment Suite

ExpQuestionConditionsTrials
ADo priors help at all?inherited vs. no-prior250
CDo agents invent stable shorthand?small / medium / large graphs250
EDoes a shared skill library beat individual inheritance?no-lib, prior-only, prior+library250
IWhat is the optimal reproduction frequency?fixed intervals, success-only, novelty thresholds250
HCan agents beat mathematically-hidden goals?cloaked / uncloaked / cross-transferfuture

VI. Stack

ComponentTool
LLM accesslangchain-dartmouth
Reasoning modelGPT-4.1-mini
Utility modelGemini 2.0 Flash
Graph mathNumPy, SciPy sparse
Cloakinggraph Laplacian, Schur complement
Dependency managementuv
OutputJSON + text transcripts