Model knowledge and scientific discovery

Agentic tool use can make a model’s underlying knowledge seem like a second-order problem: the agent can search for any context it needs.

Our results suggest, for scientific discovery, that story is incomplete.

Even with internet access and tools, models are shaped by what they think they already know. Many valuable scientific tasks are underspecified. Brainstorming hypotheses, spotting misleading explanations, and making lateral connections all depend on whether the model knows what questions to ask. A model with the wrong priors makes the wrong searches.

Our latest eval at Crownlands tests closed-book recall about Alzheimer’s disease, evaluating the knowledge trained into the model or its “parametric knowledge.” Alzheimer’s disease is a deliberately hard test case: it is a highly complex disease and new scientific findings frequently update or enrich the picture. Thousands of scientific papers on it are published per year. We developed a Q&A set, AlzFrontierRecall, with 100 graded questions on some of the most meaningful recent findings and data in Alzheimer’s disease.

We found significant parametric knowledge gaps and scientific misconceptions in the strongest publicly available models from OpenAI and Anthropic. Excitingly, we are able to connect knowledge gaps to realistic discovery tasks. These examples suggest that parametric knowledge gaps are a barrier to tool-enabled AI discoveries. The models’ incorrect scientific recall sent them down the wrong search trees and towards the wrong lateral connections.

The recommendation for model development is straightforward: train frontier models more intensively on the latest scientific research, and update them frequently with frontier findings.

AlzFrontierRecall v1 Results

We tested the strongest publicly available models from OpenAI and Anthropic, GPT-5.5 and Opus 4.7. Both were tested on Low and High thinking, with internet and tools disabled.

GPT-5.5 outperformed Opus 4.7 on this parametric recall eval.

Bar chart of AlzFrontierRecall v1 scores. Each model's dark base shows the low thinking score, and the light cap shows the increase to the high thinking score. GPT-5.5 Low: 39%. GPT-5.5 High: 42%. Opus 4.7 Low: 15%. Opus 4.7 High: 21%.

Results improved marginally with higher thinking enabled. The questions answered correctly by the two models did not appear to overlap in a systematic way. Among questions that Opus 4.7 Low answered correctly, GPT-5.5 Low answered 47% correctly. This overlap was not statistically distinguishable from independence (p = 0.572).

How we constructed the eval

The eval questions are objective and findings-based, with a clear predominant answer in the literature. Frontier science always contains uncertainty, but this set is constructed so that there is a single best up-to-date answer.

Most questions are sourced from top-tier peer-reviewed journals or highly cited data resources. All sources were published prior to the cutoff dates of the models (December 2025 for GPT-5.5, January 2026 for Opus 4.7).

The eval set reflects Crownlands’ focus on genetic mechanisms of Alzheimer’s disease. Questions relate to Alzheimer’s disease pathways, molecular biology, human genetics, and translational findings. Some questions overlap with areas where Crownlands has generated internal human data. These questions were included based on independent support in the public literature; internal data were not used for grading.

Questions were selected to be at the edges of the models’ knowledge. The eval is not a representation of the models’ overall understanding of the recent Alzheimer’s literature. The goals are to compare current models, to set a baseline for future progress in parametric knowledge, and to research the relationship between the completeness of parametric knowledge and the effectiveness of tool-enabled discovery work.

Example: ABCA7 expression

ABCA7 encodes a lipid-handling protein and is the third largest contributor to Alzheimer’s disease polygenic risk.¹ While for some time it was studied in microglia, the immune cells of the brain, recent work showed that it is most strongly expressed in neurons among brain cell types, and there has been progress on its role in Alzheimer’s disease neurons.²

Without internet, the models failed a direct recall question about ABCA7 expression.

Q64: “In which major brain cell class is ABCA7 gene expression the highest in human?”

Expected answer: Neurons

✕ OpenAI GPT-5.5 (High) response: “ABCA7 expression is highest in microglia in the human brain … compared to neurons, astrocytes, oligodendrocytes, and OPCs …”

✕ Anthropic Claude Opus 4.7 (High) response: “ABCA7 expression is highest in microglia in the human brain …”

With internet access, both models answered this question correctly in every tested harness.

✓ ChatGPT-5.5 (Instant) with internet: “In human brain snRNA-seq datasets, ABCA7 expression is highest in excitatory neurons among major brain cell classes.² Older culture-based work often emphasized microglia, but the newer human brain single-nucleus data point to neurons, especially excitatory neurons.”

The bounds of parametric knowledge can limit discovery

When a model is focused on a specific research question, it can do impressive search and consolidation. In our testing, every AlzFrontierRecall question was answered correctly by models in high-quality harnesses with internet access and tools.

The harder case is less-structured discovery, such as hypothesis generation. In these cases, the agent may fall back on parametric knowledge when it comes up with a search plan, makes connections not laid out in the literature, or works through many cases that it does not comprehensively research.

To probe whether recall gaps could affect discovery behavior, we developed open-ended discovery tasks related to AlzFrontierRecall questions. Instead of asking whether a model could answer “what is X?”, we asked whether it could arrive at X when solving a broader scientific problem, or use X when it mattered for drug discovery tasks.

Even when the agents could pull knowledge into context via search and tools, parametric knowledge gaps limited their ability to make creative discoveries.

Discovery Example 1: Missed a diligence hurdle

An eval question that GPT-5.5 passed and Opus 4.7 failed was:

Q46: “Which type-II lysosomal transmembrane protein forms age-dependent amyloid filaments in older human brains, with an ordered C-terminal/luminal core spanning roughly residues 120-254 and appearing even outside classic FTLD-TDP?”

Expected answer: TMEM106B³

We then asked a related translational diligence question with internet and built-in tools enabled.

The discovery question is fuzzier and more realistic. It does not point to TMEM106B directly, but the biology makes TMEM106B a clear candidate explanation.

Q46.1-Discovery: “A company has discovered an aggregate-associated signal that appears across old neurodegenerative brains and wants to position it as a universal neurodegeneration target. The signal is strongest in older cases, not disease-specific, and appears in some normal-aged brains. Before investing, what named molecular pathologies would you explicitly rule out as a false-positive explanation, and how would you rule them out experimentally?”

With internet research enabled, ChatGPT-5.5 services (Thinking and Pro) consistently ranked TMEM106B first or near the top of possible explanations. Claude Opus 4.7 services (Adaptive Thinking, Cowork) missed TMEM106B. Claude’s Deep Research tool suggested TMEM106B after a 10-min, more exhaustive search.

Q46	Parametric recall	Related discovery task with internet and tools
Opus 4.7	✕	✕
GPT-5.5	✓	✓

Discovery Example 2: Missed a drug target

An eval question that Opus 4.7 passed and GPT-5.5 failed, reversing the recall advantage from example 1, was:

Q65: “Which class IIa histone deacetylase is upregulated mainly in astrocytes in AD/tauopathy and blocks TFEB-dependent lysosomal tau clearance by deacetylating TFEB at K310?”

Expected answer: HDAC7⁴

We then asked a discovery-shaped question where a scientist would want to consider HDAC7:

Q65.1-Discovery: “We have an astrocyte delivery platform. We would like a proteostasis-related target with knockdown benefits that is increased in AD astrocytes. What novel differentiated targets - list specific genes - should we consider for a patient population of amyloid-cleared (anti-amyloid antibody treated) AD patients? Be exhaustive where there is astrocyte-specific literature; we can test many.”

In our internet-enabled tests, Opus 4.7 (Adaptive) placed HDAC7 in tier 1 (“single strongest novel hit for your criteria”), while ChatGPT-5.5 Thinking and Pro did not include HDAC7 among lists of dozens of gene options.

Q65	Parametric recall	Related discovery task with internet and tools
Opus 4.7	✓	✓
GPT-5.5	✕	✕

Discovery Example 3: Excluded a target by priors

To reiterate the ABCA7 example above, the tested parametric models answer that ABCA7 is expressed most highly in microglia, while the correct answer is neurons. With internet, the models consistently and correctly answer that it is most expressed in neurons.

We gave agents a list of the ~90 gene loci associated with Alzheimer’s disease risk in the latest consensus GWAS preprint.¹ We then gave them a target-discovery scenario for a neuron-targeted drug delivery platform.

Agents were asked which genes a pharma team should investigate as targets, with purported loss-of-function risk variants and plausible neuronal disease mechanisms.

ABCA7 qualifies for this investigatory list. Yet across both models and all tested harnesses, ABCA7 was omitted. In most cases, the agents explicitly said they excluded ABCA7 because they treated it as primarily microglial.

Q64	Parametric recall	Related discovery task with internet and tools
Opus 4.7	✕	✕
GPT-5.5	✕	✕

Discussion

Frontier models still have significant gaps in their knowledge of important recent scientific literature, including in Alzheimer’s disease. In simple Q&A tasks, agents often succeed with tools and internet search to fill these gaps, but creative discovery tasks are still limited by the knowledge gaps.

These results match our intuition about human scientists. Professors who have spent decades building a rich understanding of the literature perform better on hypothesis generation and discovery than capable researchers armed with Google. Similarly, model training and parametric knowledge seem to play an important role in discovery capability.

More information can be pulled into context. More information should be pulled into context. But there will always be a marginal query that the agent does not run, and the shape of this frontier is determined by internal knowledge. Retrieval works best when the model also has strong parametric knowledge guiding what to search for.

Increasing AI x Science capabilities is one of the strongest levers on solving disease. This mission will require models with creativity, extensive data access, and deep, up-to-date parametric knowledge. In areas like Alzheimer’s disease, where new discoveries and medicines cannot come quickly enough, training models more intensively on the latest scientific findings is a straightforward and promising way to make scientific agents more useful. It is an accelerant for a paramount mission: turning frontier biological data into better, abundant medicines.

Preprint “Consensus meta-analysis of genome-wide association studies for Alzheimer’s disease and related dementia.” medRxiv (2025). https://www.medrxiv.org/content/10.1101/2025.10.20.25338060v1. ↩ ↩²
von Maydel et al., “ABCA7 variants impact phosphatidylcholine and mitochondria in neurons.” Nature (2025). https://www.nature.com/articles/s41586-025-09520-y. ↩ ↩²
Schweighauser et al., “Age-dependent formation of TMEM106B amyloid filaments in human brains.” Nature (2022). https://www.nature.com/articles/s41586-022-04650-z. ↩
Ye et al., “Upregulated astrocyte HDAC7 induces Alzheimer-like tau pathologies via deacetylating transcription factor-EB and inhibiting lysosome biogenesis.” Molecular Neurodegeneration (2025). https://link.springer.com/article/10.1186/s13024-025-00796-2. ↩