{"id":237345,"date":"2026-05-18T14:31:41","date_gmt":"2026-05-18T19:31:41","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2026\/05\/generalization-dynamics-of-lm-pre-training"},"modified":"2026-05-18T14:31:41","modified_gmt":"2026-05-18T19:31:41","slug":"generalization-dynamics-of-lm-pre-training","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2026\/05\/generalization-dynamics-of-lm-pre-training","title":{"rendered":"Generalization Dynamics of LM Pre-training"},"content":{"rendered":"<p>An AI has a limited amount of \u201ccapacity\u201d (brainpower). Early in training, it develops quick, shallow circuits to memorize data because that\u2019s the easiest way to get the right answer. Later, it develops complex circuits for actual reasoning. Because space is limited, these two internal systems are constantly competing for control. Whichever type of data the AI happens to be reading in a specific moment determines which circuit wins the battle.<\/p>\n<hr>\n<p>People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this <em>mode-hopping<\/em>. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use <a href=\"http:\/\/14.97.175.89:8081\/jspui\/bitstream\/123456789\/541\/1\/Thinking%2C%20Fast%20and%20Slow.pdf\">System 1<\/a> instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment \u2014 then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre-or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that \u201csimpler solutions generalize better\u201d<\/p>\n<p>Building general AI without generalization is doable but meh. We want an intelligence that learns deep, transferable structure, not a parrot that matches shallow patterns. Real generalization would unblock many today\u2019s key open problems: data-efficient (online) learning, <a href=\"https:\/\/arxiv.org\/pdf\/2004.07780\">short<\/a><a href=\"https:\/\/openai.com\/index\/where-the-goblins-came-from\/\">cut<\/a> <a href=\"https:\/\/arxiv.org\/pdf\/2409.12822\">learning<\/a>, transfer capabilities from verifiable domains (<a href=\"https:\/\/arxiv.org\/pdf\/2501.12948\">math<\/a>, <a href=\"https:\/\/www.anthropic.com\/glasswing\">coding<\/a>) to broader non-verifiable yet <a href=\"https:\/\/arxiv.org\/pdf\/2510.04374\">economically valuable domains<\/a>, and maintain a coherent character that <a href=\"https:\/\/arxiv.org\/pdf\/2406.05946\">truly<\/a> aligns with human values.<\/p>\n<p>The distinction between parrots and intelligence is computational. Parrots <a href=\"https:\/\/transformer-circuits.pub\/2022\/in-context-learning-and-induction-heads\/index.html\">repeat<\/a> in-context <a href=\"https:\/\/arxiv.org\/pdf\/2312.09230\">patterns<\/a>; intelligence infers in-context <a href=\"https:\/\/arxiv.org\/pdf\/2310.15213\">functions<\/a>. Parrots encode a persona as bags of disconnected facts and traits; intelligence learns a shared persona representation that <a href=\"https:\/\/arxiv.org\/pdf\/2512.09742\">connects<\/a> all. Parrots memorize reasoning steps; intelligence forms general reasoning circuits for <a href=\"https:\/\/arxiv.org\/abs\/2402.14811\">entity tracking<\/a>, <a href=\"https:\/\/arxiv.org\/pdf\/2501.12948\">backtracking<\/a>, or even for highly abstract concepts like <a href=\"https:\/\/arxiv.org\/pdf\/2310.06824\">truth<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An AI has a limited amount of \u201ccapacity\u201d (brainpower). Early in training, it develops quick, shallow circuits to memorize data because that\u2019s the easiest way to get the right answer. Later, it develops complex circuits for actual reasoning. Because space is limited, these two internal systems are constantly competing for control. Whichever type of data [\u2026]<\/p>\n","protected":false},"author":709,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[41,2229,6,8],"tags":[],"class_list":["post-237345","post","type-post","status-publish","format-standard","hentry","category-information-science","category-mathematics","category-robotics-ai","category-space"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/237345","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/709"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=237345"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/237345\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=237345"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=237345"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=237345"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}