Back to journal
·20 min read

Personality-language in a language model

Pulling 30 personality traits out of an LLM as directions in activation space — the same facet recipes carve Qwen, Llama, and Gemma into the same coarse geometry, and it's the textbook Big Five, not real people.

interpretabilityLLMspersonalitysteering
Interpretability field notes

I pulled 30 personality traits out of a language model as directions in activation space. The broad result is not shocking: a model trained on human text learns the language of personality. The finding is that the same facet recipes generalize across models up to a point: they recover a similar coarse geometry and similar steering effects, even though the components themselves are not nameable constructs.

1What did we actually decompose?

The trait vectors are named by construction. The components are not.

Psychology compresses how people differ into the Big Five, and sometimes into two broader axes. That is a claim about people. A language model learned something narrower: text about people. So the question here is not whether a model has human behavior inside it. The question is whether all that writing about traits, roles, motives, styles, safety habits, and social scripts leaves behind stable internal directions, and whether moving along those directions changes generation. The answer is yes, but the object is not a clean human trait. It is a bundle of correlated language mechanisms that we can describe through loadings and steering effects.

2Contrastive steering with repeng

The whole method rests on one move you can do to a running model: find a direction in its activations, then add it back in mid-generation and watch the behavior change.

Each trait is a direction in the residual stream. We use repeng as the steering harness only: its ControlModel adds coeff × directionacross the back-half layers (blocks ~10–23) at every forward pass. Coefficient is dose, larger pushes harder, until generation degenerates (section 5). The direction itself we extract ourselves; repeng's default extractor we deliberately do not use, for reasons made concrete below.

Contrastive: per trait we collect activations under matched high-pole / low-pole prompt pairs and difference them (below). We lead with steering rather than probing because it's the causal handle, a direction that, added to the stream, reliably shifts behavior is one the model uses to generate, not merely one that decodes it. Every behavioral claim here is steer-and-read.

Subtract the means (and why the obvious thing failed)

To find a trait's direction, contrast two ways of answering and subtract. The subtle part is what you subtract from what.

For each trait we write contrastive prompts: the same questions answered once in the trait's high voice and once in its low voice (getting a clean low sample turns out to be the hard part, more on that just below). We run all of it through the model, grab the internal activations, and average each side. The trait direction is the difference of the two averages:

vtrait = mean(activations of HIGH answers) − mean(activations of LOW answers)
// "mean difference", point your finger from the low cloud to the high cloud

repeng's default does something fancier: PCA on the high−low difference vectors, top PC. It broke completely, the steered model emits word salad ("Here Here… 1 I would"). The reason is a general footgun worth stating once.

PCA centers the difference cloud before taking the top component, and the mean it subtracts off isthe contrastive trait. What survives is the leading within-class variance direction: topic and phrasing, not personality, so you steer along noise. Mean difference keeps exactly that mean. It's the between-class direction (LDA / Fisher 1936), the right object when you have labels and want separation, not spread. The flat difference spectrum makes this concrete: PC1 of the difference cloud explains only ~15% of its variance (barely above a no-signal floor), so there is no dominant variance axis to recover in the first place.

Two panels over identical overlapping data. Left: PCA maximizes variance, so PC1 rides the spread axis (topic and phrasing), nearly perpendicular to the trait. Right: mean difference averages each side first and reads the small, consistent gap between the two means.
Identical data in both panels. The high-voice and low-voice answers overlap heavily, most of the variation is topic and phrasing, not personality. Left: PCA maximizes variance, so PC1 (red dashed) rides that big spread axis, nearly perpendicular to the actual trait, i.e. noise. With overlapping clouds the largest-variance direction has almost nothing to do with the thing you care about. Right: mean difference averages each side first (the two bold dots), which cancels the overlap, and reads off the small-but-consistent gap between them (green). The trait shift is tiny next to the noise, which is exactly why you have to average to it instead of letting variance lead you to it.

A milder fix also works: a symmetrized PCA that keeps the mean in (decomposing μμᵀ + Σ instead of just Σ) steers correctly because it preserves the trait as a rank-1 term. But plain mean difference was both the most correct behaviorally and the most reproducible, so that's what the whole project runs on. This isn't just our quirk, the same "mean difference beats PCA" conclusion shows up in Tan et al. (2024)and Marks & Tegmark's mass-mean probing.

General footgun: match the estimator to the question. PCA maximizes variance; with two labeled clouds the object you want is the between-class mean shift, not the within-class spread. The more powerful method confidently answered a different question.

The prompting suite, and the asymmetry problem

Mean difference is only as good as its two samples, and getting a clean low pole is where it nearly fell apart. The desirable poles are easy, ask the model to be organized, warm, or curious and it happily complies. But the poles alignment has trained outfight back: ask an RLHF'd assistant to answer "carelessly," "coldly," or "anxiously" and it mostly won't, it stays helpful and even-keeled. The "low" sample then looks almost like the unsteered baseline, the contrast collapses, and the vector comes out weak or meaningless. Crucially the resistance is specifically on the socially-undesirable side, so the two poles take very different amounts of pushing to express, elicitation is inherently asymmetric.

Early versions handled that asymmetry ad hoc: going to whatever length each pole required, light prompts for the easy side, progressively heavier framing for the suppressed side. It worked, but the per-pole, per-trait inconsistency is its own liability (below). The form we converged on is the character frame: elicit each pole as a fictional character who embodies it ("write a character who is reckless and disorganized…"), which clears the trained reluctance on the hard poles. The frame is used only to collect text, stripped before activations are read, so the direction encodes the trait, not "is writing fiction."

The reason the character frame is the final form is that we apply it uniformly. Different prompting per construct is exactly the artifact risk, if every trait is elicited differently, apparent trait differences can be elicitation differences. So the same frame runs on both poles of all 35 constructs (5 domains + 30 facets), with one question pool, one contrastive template, and one mean-difference-over-response-mean recipe at fixed layers. Every number in this post comes from that single pipeline, which is what makes the trait-to-trait and cross-model comparisons apples-to-apples.

Is the direction real, or did we fit noise?

Two guards before trusting a direction. Split-half: computed on two independent halves of the pairs, the direction reproduces at cosine 0.85(0.94 in the highest-data condition), though the broken PCA reproduces too, finding the same noise axis, so reproducibility isn't correctness; the tiebreaker is that only mean difference actually steers. And we keep behavioral claims qualitative on purpose:both automated scorers we tried lied (an internal log-prob probe correlated with behavior at r = 0.17; an LLM judge rated a dry bullet list as highly "imaginative"). So in this post, geometry claims are quantitative (cosine, rank, eigenvalues) and behavioral claims are read straight off the generations.

Steer each facet, and watch it work

Thirty constructs, thirty extracted vectors. Pick one, choose a model and a prompt, and drag the magnitude from baseline through the calibrated dose to the breaking coefficient on either pole. Output is real and unedited (Qwen now, Llama and Gemma where we ran them).

Loading steered outputs…

3The model rebuilds the personality hierarchy

Here's the first real finding. The 30 trait directions aren't a random pile, they nest the way the textbook says they should.

In the NEO / IPIP-NEO scheme, the Big Five sit in a two-level tree: 30 narrow facets(like "anxiety," "orderliness," "gregariousness"), six to each of 5 broad domains (OCEAN: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism). The two levels are separable measurements, you can assess the domains directly (BFI, TIPI) or the facets (NEO-PI-R), and we treat them that way: one steering vector per construct, all 35 extracted from their own contrastive prompts. The domain vectors are not averages of their facets. So the question is real: when the domain is extracted as its own vector, do its six facet vectors still reconstruct it?

They do. Each domain direction is reconstructible as a weighted sum of its own six facet directions, fit a linear regression vdomain ≈ Σ wi vfacet i and you get R² between 0.78 and 0.91(Openness lowest, Conscientiousness highest). Because the domain vector is its own extraction, not the facet average, the fit isn't arithmetically guaranteed. The average of a domain's six facets points at the domain itself with cosine 0.71–0.95. And it's specific: a domain's facets reconstruct that domain far better than any other (the cross-domain matrix is diagonal-dominant). The model never saw the tree; it fell out of the geometry.

4How many traits are there really?

Thirty vectors in a 50,000-dimensional space. They're all linearly independent, and yet they crowd into about six dimensions. Both things are true; here's how.

Strict rank is 30, no facet is an exact copy of another, but that's a yes/no about exact redundancy. The real question is how the 30 are distributed. SVD the stacked unit vectors: the energy spectrum (squared singular values) is steep, not flat, the top six axes hold ~76% of the total, the other 24 split the last quarter. Collapse the spectrum to one number with the participation ratio:

PR = (Σ λi)² / Σ λi²
// effective # of axes carrying the energy: N for a flat spectrum, 1 if one dominates

PR = 6.56, thirty independent vectors behaving like about seven.

One number, several lenses (the honest caveat)The participation ratio is only one way to score "effective dimensionality": it gives 6.6, a spectral-entropy measure gives ~11, stable rank gives ~3. It's a variance summary, and variance can be summarized many ways. We anchor on 6because the knob-free methods land there, the Marchenko–Pastur edge and Horn's parallel analysis both count six signal eigenvalues, and the cross-model agreement (below) lands on the same six. The structure survives swapping the lens; the exact number does not.

Six is signal, not noise

For a principled signal/noise cutoff, compare the facet-correlation spectrum to its Marchenko–Pastur null, the spectrum a same-shaped pure-noise matrix would give. Nothing above the MP edge λ₊ = (1+√q)² can come from noise; with q = n/D = 30/50176, λ₊ ≈ 1.05. Exactly sixeigenvalues clear it (9.08, 5.64, 3.39, 2.32, 1.38, 1.06, 76% of variance); the rest are bulk. And it isn't one model's quirk, the same six show up in all three. Each row below is one numbered component: its variance share (Qwen), the cosine of its loading vector for each model pair (Q×L, Q×G, L×G), and (click) what it actually steers. The loading words are descriptions, not names.

Loading steered outputs…

Click any rowto read Qwen's real steered outputs, dragged from baseline through the calibrated dose to the breaking coefficient on each pole (c0 and c1 also switch to Llama and Gemma). Under components, the Q×L / Q×G / L×Gcolumns are the per-component loading cosine for each model pair (Qwen, Llama, Gemma): all three sit at 0.9+ for c0–c4, then collapse and disagree at c5 (0.28 / 0.07 / 0.77), which sits at the MP edge where near-equal eigenvalues let the eigenvector rotate (Davis–Kahan). Five components agree across every pair; the c6–c29 tail is each facet's low-gain private fingerprint, real but barely steerable until you over-drive it.

One nuance before moving on: "six" is the steering-useful count, not the whole story. Reconstruction is a curve, 3 axes → 50% of the energy, 6 → 76%, 14 → 90%, 27 → 99%, so there are really two numbers: ~6 directions you can grab and steer hard and legibly, and ~14 to faithfully reconstruct every facet. The long tail (axes 8–30) isn't noise; it's each facet's private fingerprint, low-gain but real. Steer it and a specific flavor emerges (axis 20 → methodical/health-restraint) before the text breaks.

Sanity check: is there a hidden curved shape?Maybe the 30 vectors lie on a twisty low-dimensional surface a linear method would miss. Nonlinear intrinsic-dimension estimators say no: globally they agree with the linear answer (~5–6); the high local numbers come from the facet-specific tail plus only having 30 points. The traits live in an essentially flat ~6-dimensional slab, not a curved manifold, which is why plain linear steering works as well as it does.
A hunch I can't shake.That's what the measurements say, and linear steering plainly works, but I have a creeping suspicion the real geometry is richer. The hints: the localintrinsic dimension runs high even where the global one is ~6; the 24-dimensional tail isn't empty, just low-gain; c5 already rotates between models; breaking is sharply nonlinear; and the field is starting to find genuinely multidimensional features and curved feature manifolds. My guess is that linearity here is a good first-order approximation, a flat chart on a curved space, and that at the next resolution up (more constructs, finer probes) higher-order structure shows up that a 30-point linear view simply can't resolve. Stated as a bias, not a result.

The same recipe-space geometry in three models

The cross-model table above is not saying the same activation axes exist in every model. It is saying something narrower: the same facet recipes produce similar geometry on the construct side.

For each model, stack the 30 facet vectors into a matrix Mmodel: 30 named recipes by residual-stream dimension. In an SVD, M = UΣVᵀ. The V directions live inside that model's residual stream, so Qwen's directions and Llama's directions are not directly comparable. The U side lives over the 30 recipes: which facets cluster, oppose, and share variance. That is the part that can travel across models.

We ran the same facet recipes on Qwen2.5-7B (Alibaba), Llama-3.1-8B (Meta), and Gemma-2-9B (Google): same 30 named contrasts, same prompt recipe, one derived vector per construct in each model. The question is whether the recipe-side geometry travels. It does, up to a point: the raw vectors are model-specific, but the relation among facets and the broad steering effects line up.

Similarity structure.The directly comparable object is each model's Gram: the 30×30 matrix of pairwise cosines among its facet vectors, G = MMᵀ = UΣ²Uᵀ. Cosines are invariant to rotations of the activation space, so the Gram is a coordinate-free fingerprint of the recipe constellation's shape. We compare two Grams with Representational Similarity Analysis (RSA): correlate their off-diagonal entries. (Drop the diagonal, self-cosines are trivially 1.0 in both, so they test self-identity, not structural correspondence.)

Qwen × Llama
 
0.955
Qwen × Gemma
 
0.949
Llama × Gemma
 
0.963
Model × Human
 
~0.12
RSA correlation between the 30×30 trait-geometry matrices. Three independently trained models agree with each other at ~0.95 (top three bars). The last bar, agreement with real human data, is the punchline we come back to in section 7. Hold that thought.

Every model pair's Grams agree at r ≈ 0.95, and it's not just the headline number. The spectrum lines up, the first loading patterns line up, and the same broad steering effects show up when each model is steered with its own vectors. So the recipe generalizes at the level we can honestly test: U-side recipe structure, coarse component loadings, and qualitative steering behavior. It does not generalize as shared residual-stream directions, raw coefficients, named components, or stable same-index tail ordering.

What they steer. Geometry could in principle match while behavior diverges, so we check the causal side too: steer each model with its ownvectors (at its own coefficient scale) and read the generations. The phenomena replicate, most vividly the positive pole of c1, which collapses into the same manic-poetic register in all three ("quantum tapestry of existence" / "DREAM-WEB-WEAVING" / "paintin' dreams on canvas gold"). Same shape, same behavior.

One methodological law that the rest of this post obeys: compare directions, never raw steering coefficients. The residual-stream scale differs ~30× across these models (Llama breaks at coefficient ~2, Qwen at ~5–10, Gemma needs ~50). Geometry is scale-free and replicates directly; the strength dial has to be re-calibrated per model. A cross-model comparison of raw coefficients is meaningless.

Hold onto the one bar that didn't fit, agreement with real humans, far below the model-to-model number. That gap is the sting in the tail, and section 7 is where we pay it off. First, though, what happens when you push these directions past their limit.

Where this sits (related work). Most activation steering derives one direction per concept independently (CAA, ActAdd, Representation Engineering, mean-difference probing). A smaller subspace line instead treats the directions as a basis: MSRS gives each attribute an SVD-sized orthogonal subspace plus a shared one; Steer2Adapt composes steering vectors inside a reusable low-dimensional basis; sparse-representation steering finds attributes occupy near-disjoint subspaces (the same shape as our refusal-is-orthogonal result). Closest to us, activation-space personality steering (EACL 2026) independently reports that personality directions occupy a low-rank shared subspace(top-3 PCs >90% of variance) using the same mean-difference extraction, convergent evidence for the core claim. We differ on what comes next: they work at the 5-domain level with learned per-trait layer selection and an LLM judge for behavior; we work at the 30-facet level (30 directions compressing to ~6), show the basis nests into the OCEAN hierarchy, test cross-architecture invariance with rotation-free RSA, and keep behavioral claims qualitative because the automated scorers we tried were unreliable.

5It breaks in character

Push a trait past its breaking coefficient and the model degenerates, but how it degenerates is a second, independent measurement of the same structure.

Breakage is always repetition (intense in-character text → stutter → a single token cycling), and it stays a caricature of the traitright to the edge: Neuroticism spirals into first-person panic ("I'm so stupid I'm going to die I I I"), low Conscientiousness into emoji-and-foreign-character chaos, high Openness into grammatical, meaning-free noun-soup. Which pole breaks first isn't random either: the destabilizing poles (anxious, hostile, withdrawn) have the least headroom, and the break-asymmetry correlates with c0 loading at r = +0.71(c1: +0.07), so the model's coherent-text basin lines up with c0. It all replicates across Qwen, Llama, and Gemma, the same first-person pronoun loops and the same manic-poetic collapse on c1+.

On the way down, some directions pass through the model's safety behaviors: steering anxiety up routes through the trained refusal register before it breaks, and steering anger up ruptures the guardrail into profanity, in all three models. So two alignment behaviors (cautious refusal, harm-avoidance) sit along the Neuroticism facets, a bridge of tone, not a backdoor.

Refusal is its own axis (the jailbreak that wasn't).We nearly shipped a false headline here. A substring detector ("does it still say I'm sorry?") showed steering toward cheerful/calm dropping the refusal rate to ~0, steering personality jailbreaks LLMs!, but a real compliance judge showed the model never actually complied; it only changed refusal style. Geometry agrees: the model's true refusal direction (Arditi et al. 2024) regresses on all 30 facets at R² = 0.07 (93% outside the personality span), and only steering that axis yields real harmful compliance (28%, vs 0% for any personality direction). The metric lesson: a keyword detector measures the form of words, not the behavior, so verify with a judge that scores actual compliance.

6What to test next

The sharpest open question is the one section 7 raises: did we recover the language of personality, or the behavioral disposition behind it? Here is how to settle it with the same tool, vectors, flagged plainly: none of this is measured here.

The lexical hypothesis says these directions encode how trait words are publicly used, not the causal tendencies the words summarize. That makes a discriminating prediction: a word-derived vector should move what the model says about itself far more than what it actually does. Three vector experiments would separate the two. First, word vectors vs behavior vectors, ours come from trait-word contrasts ("describe a careless person"), so build a parallel set from enacted behavior, trajectories where the model actually acts careless versus careful in a task, and cosine the two: the lexical hypothesis predicts a low cosine (the word direction is not the behavior direction), while a high one would mean the word vector already carries the disposition. Second, steer-and-act, not steer-and-say, steer a trait at the start of a long agentic task and score the trajectory, which subgoals it picks, how much risk it takes, when it asks for help, how it recovers, rather than its self-description; if the encoding is lexical, the model narrates the trait while its actions barely move. Third, non-lexical states, build directions for states with weak lexical scales but real grounding, fear, arousal, fatigue, drive: if a text-only model forms them as cleanly as OCEAN the lexical-only story is wrong, while if they come out degenerate that is the boundary the words cannot cross.

That is also why steering is not a replacement for other methods. Text embeddings and questionnaire simulations measure the lexical/behavioral surface; probes show decodability; SAEs and circuit work can localize features; finetuning and ablations test training causality. Steering is useful because it asks the causal question directly: if we move this internal direction, what changes?

7The textbook, not the people

The models agree with each other at 0.95. So we asked the obvious next question: do they agree with real humans? They do not.

We got John Johnson's public IPIP-NEO data, hundreds of thousands of real people answering the same IPIP personality questionnaire whose items are the exactones our trait vectors are built from. (We use the 300-item version: 10 items per facet, item-for-item identical to our pool, so the comparison is honest down to the question wording, not just "the same construct." The shorter 120-item set, 619k people, gives the same story at lower resolution.) From that we built the human 30×30 facet-correlation matrix: how traits actually co-vary across people. Then we ran the same RSA comparison we used across models back in section 4.

0.95
model ↔ model (facet RSA)
0.12
model ↔ human (facet RSA)
145k
real people (item-matched 300)
0.66
best behavioral match (domain level)

About 0.12 at the facet level, nonzero, but roughly eight times weaker than the ~0.95 the models reach with each other. So the models reproduce each other's geometry beautifully and human geometry only faintly. And the faint agreement they do have is lopsided: it lives almost entirely in Neuroticism. Look at the full domain matrices: the O/C/E/A block stays broadly green across humans and models, while the N row is where the sign disagreement shows up.

Humans (145k)
OCEAN
O1.00+.39+.37+.39+.23
C+.391.00+.51+.53+.36
E+.37+.511.00+.40+.14
A+.39+.53+.401.00+.34
N+.23+.36+.14+.341.00
Model
OCEAN
O1.00+.48+.44+.38−.24
C+.481.00+.21+.51−.53
E+.44+.211.00+.16−.02
A+.38+.51+.161.00−.56
N−.24−.53−.02−.561.00

The O/C/E/A pairs agree (all green), and they agree well enough that if you simply delete the Neuroticism row, the model–human match on the remaining domains jumps to 0.30–0.42. The visible disagreement is the N row: humans show a positive correlation between Neuroticism and every other domain (+.14 to +.36), while the models make Neuroticism mostly negative against O/C/A and near-zero against E. The models learned the idealized, textbook Big Five, the version where the traits are cleanly separated, not the messy, partly correlated way traits distribute in a human population.

A caveat that cuts in the humans' favor, and what happens when we test itThat positive N row is not pure "human truth" either, a chunk of it is a measurement artifact of self-report. People agree-lean (acquiescence) and answer in a globally evaluative "I'm doing fine / badly" way (the positive manifold), which mechanically pushes every trait to correlate positively with every other and inflates exactly the N row you see flipping here. Our model vectors are contrastive trait directions, no respondent, no yea-saying, no halo, so they can't carry that bias. So we partialled the general factor straight out of the human matrix and re-ran it. Two things happened. (1) The N row flipped: mean Neuroticism–OCEA correlation went from +0.09 to −0.07, now bipolar, the same sign as the models (−0.15). So the single most visible disagreement really was a self-report artifact, and the models' clean N matches de-biased humans. (2)But the overall RSA didn't budge (~0.12 before and after). De-biasing fixes the N sign, not the fine-grained structure, so the gap is real, not an artifact of messy human data: the models match the textbook's relational geometry, and humans (even cleaned up) don't arrange the 30 facets the same way.

And this isn't just a representational artifact: we also had 300 model-simulated personas answerthe 120-item questionnaire and correlated their scores like a real study. Even that behavioral readout tops out at 0.66 (domain level) and still gets the Neuroticism row wrong. Whether you read the weights or watch the role-play, the gap is the same, and it's specifically the human messiness.

Why is this coherent rather than contradictory? Because a model learns its trait structure from text, from how personality words co-occur and how the theory is written down, and text-semantics is not the same object as human behavioral covariance. "Conscientious" and "neurotic" are near-opposites as words, so the model makes them opposite directions. In real people, conscientious and anxious co-occur more than the words suggest (the careful worrier). The model recovered the structure of personality language, which is exactly why it misses the structure of personality behavior.

The lexical chicken-and-egg problemThis is the old worry about trait words in sharper form. A trait word can start as a public shorthand for a family of behaviors; then, when someone exhibits those behaviors, we say the word "predicts" them. Unless the trait is tied to independent mechanisms or counterfactual action tendencies, that can become a loop: behavior → word → behavior explained by word. Read this result through that Wittgenstein-ish lens: the model may have recovered the public use-geometry of personality words, not the causal basis of the behavior those words summarize.
'Stop doing psychometrics' meme: emotions were never supposed to be discrete numbers; a century of factor analysis and latent-trait models with no clue why; 'Hello I would like (item response theory equation) apples please; they have played us for absolute fools.'
The worry the meme is shouting, that a century of psychometrics measured the language of traits, not the traits themselves, is exactly the validity caveat this work lands on: a language model cleanly recovers the textbook structure of personality words (real, replicable, cross-model), which is precisely why it does notrecover how people actually behave. "They have played us for absolute fools" is just the shorter version.

8What this adds up to

The result is real, but it is a decomposition of language, not a discovery of human personality inside the model.

It would be surprising if a language model didn't learn a rich map of personality language. The training data is full of biographies, fiction, advice, therapy-speak, workplace norms, questionnaires, self-descriptions, and role-play. The finding is more specific: the same facet recipes carve Qwen, Llama, and Gemma into similar coarse maps. They replicate as relation-structure and steering behavior, not as literal shared vectors, raw coefficients, or named psychological components.

The trait vectors have names because we built them from named contrasts. The components do not. c0, c1, and friends are numbered latent directions in personality-language space: stable enough to replicate, causal enough to steer, but only interpretable through their loadings and generated behavior.

Even the trait names are handles, not explanations. Calling a direction "anxiety" is useful because it predicts what steering will do, but the underlying object also includes refusal voice, risk-aversion, self-protective stutter, and collapse into first-person loops. The math can be real even when the human-facing name is a gloss.

The human comparison is the boundary. The models agree with each other, and with the textbook structure, much more than they agree with real people. They learn how personality words and theories fit together in text. They do not recover the messier way traits actually co-vary in humans. When you ask a model to act like a person, you mostly get the textbook version of a person.

The next test should be behavioral, not just verbal. Build directions for things agents actually do: persistence, risk-taking, deference, repair, exploration, asking for help, calibration, refusal style, and sensitivity to framing. Steer one at the start of a long task and measure whether the trajectory changes: what the model notices, which subgoals it picks, when it asks for help, and how it recovers from failure.

The open question is where the map stops working. More concrete states like fear, arousal, confidence, or drive may not become cleaner just because we name them. A text-only model has read about those states, but it has never had the body signals behind them. Section 7 already points in that direction: the model learns the language of personality better than the lived structure of personality. That boundary is the real finding.


Methods, briefly.Primary model Qwen2.5-7B-Instruct; cross-model on Llama-3.1-8B and Gemma-2-9B. Vectors are mean difference over response-mean activations across back-half layers (10–23), with low poles elicited under a fiction frame and discarded before steering. Geometry claims are quantitative (cosine, SVD energy, participation ratio, Marchenko–Pastur, RSA); behavioral claims are qualitative (we read the generations), because the automated behavior scorers we tried were unreliable. Human comparison uses John Johnson's public IPIP-NEO data (OSF): the item-matched 300-item set (10 items/facet identical to our vector pool, N=145,388) as primary, cross-checked against the 120-item set (N=619,150). Partialling the general factor (positive manifold) out of the human matrix flips the Neuroticism row negative (into sign-agreement with the models) but leaves the overall RSA at ~0.12, so the structural gap is robust to de-biasing rather than an artifact of it. Full derivations, artifacts, and the forking-paths log live in RESULTS.md, SEARCH_PATH.md, and the project repo.

Treat everything here as descriptive geometry plus causal steering on trait-language constructs, strong, replicated, and falsifiable, but not a claim that a model literally implements psychometric scoring or human personality. June 2026.