Experience

A user opens the File menu. Save is not where it was yesterday. Five seconds of scanning later, they find it at position 7, near Export. The application reorganized the menu to "optimize for recent usage patterns." The user is now slower and annoyed. This is a problem with adaptive user interfaces these days (although interfaces like these are not that common precisely because of problems like this). Personalization leads to more efficiency, but reorganization destroys the spatial memory users depend on. Findlater & McGrenere (2004) showed that static menus frequently outperform adaptive ones because relearning costs exceed efficiency gains.

Existing evaluation methods are sometimes inaccessible (model-based RL requiring formal HCI modeling expertise) or slow (human studies with recruitment, compensation, weeks of setup). We were thinking about a third option. Why not see what happens when agents to embody users in a specific cognitive states like fresh, fatigued, distracted and have them predict how a human would respond to a menu that just rearranged itself. We then validated those predictions against actual humans.

I. Adaptive Menu System

We implemented a web-based menu interface with 11 items drawn from standard application menus (File, Edit, Document operations), organized into semantic groups that participants could learn during training. Adaptation occurs after every 20 selections (one block), which simulates session-based policies common in real deployments.

Three Radicality Levels

Level	Description
Mild	Visual emphasis only (10% font size increase, bold weight on frequent items), at most 2 position changes, groups preserved exactly. Example: Save moves from position 3 to 4.
Moderate	3–5 position changes allowed. Groups can reorder entirely, but items stay within their original group. Example: the Edit Operations group moves above File Operations; Save shifts from position 3 to 5 within its group.
Radical	Unlimited position changes. Groups are fully restructured by the clustering algorithm. Items cross semantic boundaries. Example: a new "Frequent" group is created from $\{\text{Save}, \text{Copy}, \text{Print}\}$ , pulled from three different original groups.

Semantic Clustering

Menu reorganization comes from a multi-signal clustering algorithm that combines three complementary inputs into a weighted distance matrix, then applies $k$ -means:

Temporal co-occurrence. From usage logs, we build a co-occurrence matrix $C$ where $C_{ij}$ counts how often items $i$ and $j$ are selected within the same session window. This shows behavioral patterns because we believed that items used together should stay together.

Word embeddings. Semantic similarity between item labels via cosine distance on embeddings. This catches relationships that usage data sometimes misses (Save and Save As are semantically close even if one is rarely used).

Designer constraints. Hard-coded associations from domain knowledge (e.g., Undo/Redo must stay grouped). These override data-driven signals where expert knowledge is definitive.

The combined distance matrix is:

$D_{ij} = w_1 \cdot d_{\text{temporal}}(i,j) + w_2 \cdot d_{\text{semantic}}(i,j) + w_3 \cdot d_{\text{constraint}}(i,j)$

An MCTS-based optimizer then searches over possible menu layouts, constrained by the radicality level, to find reorganizations that balance predicted efficiency gains against disruption cost.

II. LLM Agent Architecture

Each agent wraps GPT-4 with a `CognitiveState` dataclass that modulates attention, memory, strategy selection, and workload perception. The key modeling decisions come from Bailly et al. (2014) and Cockburn et al. (2007).

Cognitive States

State	Description
Fresh	Stable attention, strong spatial memory, tendency toward foraging search. Represents best-case interaction.
Fatigued	Reduced attention with decay over time, weaker memory encoding, heavy reliance on serial (top-to-bottom) search. Reading time increases by a fatigue multiplier.
Distracted	Fragmented attention, rapid forgetting between searches, needs larger/more obvious targets. Highest variance in performance.

Search Strategies

Following Bailly et al.'s models of menu search, each agent selects among four strategies:

Serial search. Linear top-to-bottom scanning. Selection time grows linearly with target position $p$ :

$T_{\text{serial}}(p) = \sum_{i=1}^{p} T_{\text{read}}(i), \quad T_{\text{read}}(i) = \frac{\delta}{1 + B(i)}$

where $\delta$ is the cautious inspection cost and $B(i)$ is ACT-R activation for item $i$ , which increases with expertise.

Foraging search. The agent identifies the semantic group likely to contain the target, scans to that group header, then searches within the group. Faster than serial when groupings are stable, catastrophic when groups are restructured.

Recall search. Direct access based on spatial memory. Near-instant when the item hasn't moved; triggers a full rescan or strategy switch when it has. This is where adaptation inflicts its sharpest cost.

Random search. Worst-case baseline! items are checked in arbitrary order.

Strategy selection is probabilistic, weighted by cognitive state. Fresh agents use recall 85% of the time in mild conditions but collapse to serial 65% of the time under radical changes. Distracted agents default to random search 57% of the time in radical conditions.

NASA-TLX Modeling

Each agent also predicts subjective workload (mental demand, effort, frustration) calibrated against the six NASA-TLX dimensions. This gives a proxy for the qualitative experience that pure timing data misses.

III. Evaluation

Now comes the evaluation part.

Agent Simulation (2,700 trials)

We ran 9 conditions (3 radicality levels $\times$ 3 cognitive states), 100 trials each, 3 task types per trial. Tasks covered item location, multi-step workflows, and immediate recall after adaptation. Each trial: 20 tasks on the baseline menu (Block 1), adaptation, 20 tasks on the new layout (Block 2), 20 more tasks (Block 3).

Human Study (720 trials)

12 Dartmouth students (6F, 6M, ages 19–22), between-subjects design with 4 participants per radicality condition. Sessions averaged 22 minutes. Adaptation occurred unannounced between Block 1 and Block 2. We collected task times, errors, click positions, NASA-TLX scores, 13-item subjective ratings, and semi-structured interview transcripts coded with Cohen's $\kappa = 0.79$ .

IV. Results

Performance Impact

The spatial memory cliff. Humans can update a few position changes through active maintenance in working memory (moderate condition). When changes exceed this capacity (radical condition), users must rebuild their entire mental model from scratch. 8 of 12 participants reported spatial memory violations; all 4 radical participants used language like "weird," "infuriating," and "lost."

The radicality threshold is sharp. Mild adaptations ( $\leq 2$ position changes) caused minimal disruption. Participants recovered within about 12 trials. Moderate adaptations (3–5 changes) required several dozen trials for partial recovery. Radical adaptations ( $\geq 6$ changes with regrouping) doubled task time and never converged to baseline within our 60-trial window. Learning curve models predicted over 100 trials for radical users to approach full recovery.

Condition	Block 2 Slowdown	Recovery by Block 3	Acceptance
Mild	+14.6%	92%	100%
Moderate	+44.3%	72%	50%
Radical	+105.6%	55%	0%

Agent–Human Alignment

Agent predictions fell within one standard deviation of human performance for mild and moderate conditions. For radical adaptations, agents captured the direction and rough magnitude of performance degradation but systematically underestimated disruption. Three failure modes:

Emotional responses. Agents predicted performance costs but not frustration, trust breakdown, or anxiety about future changes. All of these appeared in interviews.

Abandonment behavior. One participant (P11) expressed desire to stop using the menu entirely and switch to keyboard shortcuts. Agents always complete tasks within the menu.

Individual differences. One radical participant explicitly linked difficulty to ADHD. Agents model aggregate behavior. They dont really model individual variation.

Cognitive State Effects

Agent simulations showed that fatigued and distracted agents experienced steeper performance drops than fresh agents across all radicality levels. Distracted agents nearly doubled their completion time from mild to radical. Although we didn't manipulate cognitive state in the human study, pilot observations and interview data supported this because the same participant who used efficient foraging strategies at 9am relied on serial scanning at 4pm after a long day.

V. Design Guidelines

From the combined agent and human data, we derived conservative recommendations for adaptive interface designers:

Category	Guideline
Safe zone	Visual emphasis up to 15%, at most 3 position changes, reorder groups but preserve internal structure, limit major changes to roughly once per week.
Danger zone	Crossing semantic boundaries without permission, reorganizing more than 30% of items at once, silent or repeated changes.
Required controls	Adaptation should be opt-in, visually indicated, and reversible. Users should be able to override any grouping and adjust adaptation intensity.

Adaptive Menus and LLM Agents