{"id":192890,"date":"2024-07-13T09:24:00","date_gmt":"2024-07-13T14:24:00","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2024\/07\/learning-to-express-reward-prediction-error-like-dopaminergic-activity-requires-plastic-representations-of-time"},"modified":"2024-07-13T09:24:00","modified_gmt":"2024-07-13T14:24:00","slug":"learning-to-express-reward-prediction-error-like-dopaminergic-activity-requires-plastic-representations-of-time","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2024\/07\/learning-to-express-reward-prediction-error-like-dopaminergic-activity-requires-plastic-representations-of-time","title":{"rendered":"Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/learning-to-express-reward-prediction-error-like-dopaminergic-activity-requires-plastic-representations-of-time2.jpg\"><\/a><\/p>\n<p>One of the variables in TD algorithms is called reward prediction error (RPE), which is the difference between the discounted predicted reward at the current state and the discounted predicted reward plus the actual reward at the next state. TD learning theory gained traction in neuroscience once it was demonstrated that firing patterns of dopaminergic neurons in the ventral tegmental area (VTA) during reinforcement learning resemble RPE<sup><a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 5\" title=\"Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593&ndash;1599 (1997).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR5\" id=\"ref-link-section-d100461977e534\">5<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 9\" title=\"A model of how the basal Ganglia generate and use neural signals that predict reinforcement. in Models of Information Processing in the Basal Ganglia (eds. Houk, J. C., Davis, J. L. & Beiser, D. G.) (The MIT Press, 1994). https:\/\/doi.org\/10.7551\/mitpress\/4708.003.0020.\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR9\" id=\"ref-link-section-d100461977e537\">9<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 10\" title=\"Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936&ndash;1947 (1996).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR10\" id=\"ref-link-section-d100461977e540\">10<\/a><\/sup>.<\/p>\n<p>Implementations of TD using computer algorithms are straightforward, but are more complex when they are mapped onto plausible neural machinery<sup><a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9&ndash;44 (1988).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR11\" id=\"ref-link-section-d100461977e548\">11<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Ludvig, E. A., Sutton, R. S. & Kehoe, E. J. Evaluating the TD model of classical conditioning. Learn. Behav. 40305&ndash;319 (2012).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR12\" id=\"ref-link-section-d100461977e548_1\">12<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 13\" title=\"Namboodiri, V. M. K. How do real animals account for the passage of time during associative learning? Behav. Neurosci. 136383&ndash;391 (2022).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR13\" id=\"ref-link-section-d100461977e551\">13<\/a><\/sup>. Current implementations of neural TD assume a set of temporal basis-functions<sup><a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 13\" title=\"Namboodiri, V. M. K. How do real animals account for the passage of time during associative learning? Behav. Neurosci. 136383&ndash;391 (2022).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR13\" id=\"ref-link-section-d100461977e555\">13<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 14\" title=\"Ludvig, E. A., Sutton, R. S. & Kehoe, E. J. Stimulus representation and the timing of reward-prediction errors in models of the Dopamine system. Neural Comput 20, 3034&ndash;3054 (2008).\" href=\"https:\/\/www.nature.com\/articles\/s41467-024-50205-3#ref-CR14\" id=\"ref-link-section-d100461977e558\">14<\/a><\/sup>, which are activated by external cues. For this assumption to hold, each possible external cue must activate a separate set of basis-functions, and these basis-functions must tile all possible learnable intervals between stimulus and reward.<\/p>\n<p>In this paper, we argue that these assumptions are unscalable and therefore implausible from a fundamental conceptual level, and demonstrate that some predictions of such algorithms are inconsistent with various established experimental results. Instead, we propose that temporal basis functions used by the brain are themselves learned. We call this theoretical framework: <b>F<\/b>lexibly <b>L<\/b>earned <b>E<\/b>rrors in E<b>x<\/b>pected Reward, or FLEX for short. We also propose a biophysically plausible implementation of FLEX, as a proof-of-concept model. We show that key predictions of this model are consistent with actual experimental results but are inconsistent with some key predictions of the TD theory.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the variables in TD algorithms is called reward prediction error (RPE), which is the difference between the discounted predicted reward at the current state and the discounted predicted reward plus the actual reward at the next state. TD learning theory gained traction in neuroscience once it was demonstrated that firing patterns of dopaminergic [\u2026]<\/p>\n","protected":false},"author":661,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1523,41,47],"tags":[],"class_list":["post-192890","post","type-post","status-publish","format-standard","hentry","category-computing","category-information-science","category-neuroscience"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/192890","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/661"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=192890"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/192890\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=192890"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=192890"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=192890"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}