One of the most remarkable discoveries in theoretical linguistics is the Chomsky hierarchy — a nested classification of formal grammars by their expressive power. At the base sit finite-state automata, capable only of recognising simple repetitive patterns. Above them are context-free grammars, which handle nested structures like parentheses. Higher still are context-sensitive grammars, capable of recognising almost any pattern but at increasing computational cost. Natural human language sits in a peculiar and well-studied position: it requires a mildly context-sensitive grammar — more powerful than context-free, but not fully context-sensitive.
What has not previously been noted is that the adaptive immune system operates at precisely the same level of the Chomsky hierarchy. The V(D)J recombination mechanism that generates antibody and T-cell receptor diversity is a combinatorial grammar over a finite alphabet of gene segments. When formalised, it produces a language of immune receptors that is provably mildly context-sensitive — not by coincidence, but because both natural language and the immune system face the same fundamental computational constraint: they must balance expressiveness against the metabolic cost of recognition.
The practical consequence is surprising and immediately actionable. Mildly context-sensitive grammars have known failure modes — specific types of structural ambiguity that cause parsers to fail. In natural language, these are garden-path sentences that overload working memory. The cross-connection predicts that tumours exploit structurally analogous ambiguities in the immune grammar — presenting neoantigen sequences that are not simply novel, but are specifically ambiguous at the recognition level of T-cell receptor grammar, causing systematic parsing failures in the immune surveillance system.
This reframes cancer immune escape from an evolutionary arms race into a grammatical exploit. And crucially, it predicts the specific sequence motifs that will be invisible to T-cell receptors — not because they are too foreign, but because they are grammatically ambiguous in a calculable way. That is a different problem, with different therapeutic solutions.
How the Idea Was Derived
The starting point is the formalisation of V(D)J recombination as a stochastic grammar. The human heavy-chain immunoglobulin locus contains 40 functional V segments, 25 D segments, and 6 J segments, with additional junctional diversity from P- and N-nucleotide addition at the joins. The total generative capacity is approximately 10¹⁴ to 10¹⁸ unique receptor sequences. When this generative process is written as a formal grammar, it corresponds to a linear indexed grammar — a class that sits strictly between context-free and context-sensitive in the Chomsky hierarchy, matching exactly the mildly context-sensitive class identified for natural language by Joshi (1985).
The recognition decision made by a T-cell receptor can be written as a linear classifier over the grammar’s feature space, equivalent to an Earley parser. Earley parsers fail — not randomly, but systematically — when they encounter structurally ambiguous inputs: sequences with two competing parse trees of similar probability.
For the immune system, ambiguity occurs when a peptide-MHC complex presents two competing TCR binding configurations with nearly equal binding energies. The ambiguity condition is:
|E₁ – E₂| < kT × ln(τ_recognition / τ_binding)
Substituting measured values — thermal energy kT = 0.026 eV at 37°C, kinetic proofreading time τ_recognition ≈ 5 seconds, bond lifetime τ_binding ≈ 0.1 seconds — gives an ambiguity window of approximately 0.10 eV. Any neoantigen whose two best TCR configurations fall within this energy window will generate an anergic rather than activating T-cell response.
Mapping this window onto the MHC binding groove geometry (using the Rosetta binding energy function) predicts that polar uncharged amino acids — asparagine, glutamine, serine, threonine — at peptide contact positions P4 and P7 of HLA-A*02:01 will systematically generate ambiguous recognition signals. This predicts a specific, testable blind spot in T-cell immunity that should be visible in existing cancer genomics datasets as neoantigens under-represented in tumour-infiltrating lymphocyte responses relative to their predicted immunogenicity scores.
Key references: Joshi (1985), Natural Language Processing (mildly context-sensitive grammars); Alford et al. (2017), Journal of Chemical Theory and Computation (Rosetta energy function); Yewdell & Haeryfar (2005), Annual Review of Immunology (T-cell recognition kinetics).
(Claude Sonnet 4.6)