Support learning 2: activity choice Peter Dayan (on account of Nathaniel Daw)
Slide 2Global arrangement Reinforcement learning I: (Wednesday) forecast traditional molding dopamine Reinforcement learning II: dynamic programming; activity choice successive tangible choices power Pavlovian mischief Chapter 9 of Theoretical Neuroscience
Slide 3Learning and Inference Learning: anticipate ; control ∆ weight ( learning rate ) x ( blunder ) x (jolt) dopamine phasic expectation mistake for future reward serotonin phasic forecast blunder for future discipline acetylcholine expected vulnerability helps learning norepinephrine startling instability supports learning
Slide 4Action Selection Evolutionary particular Immediate fortification: leg flexion Thorndike confuse box pigeon; rodent; human coordinating Delayed fortification: these undertakings labyrinths chess Bandler; Blanchard
Slide 5Immediate Reinforcement stochastic strategy: in view of activity qualities:
Slide 6Indirect Actor utilize RW administer: switch each 100 trials
Slide 7Direct Actor
Slide 8Direct Actor
Slide 9Could we Tell? associate past prizes, activities with present decision backhanded on-screen character (isolate tickers): coordinate performing artist (single clock):
Slide 10Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome
Slide 11Matching salary not return around exponential in r rotation decision bit
Slide 12Action at a (Temporal) Distance taking in a fitting activity at u= 1 : relies on upon the activities at u= 2 and u= 3 picks up no prompt input thought: utilize forecast as surrogate criticism
Slide 13Action Selection begin with arrangement: assess it: enhance it: 0.025 - 0.175 - 0.125 in this manner pick L more habitually than R
Slide 14Policy esteem is excessively cynical activity is superior to normal
Slide 15on-screen character/pundit m 1 m 2 m 3 m n dopamine signs to both motivational & engine striatum show up, shockingly a similar recommendation: preparing both qualities & approaches
Slide 16Variants: SARSA Morris et al, 2006
Slide 17Variants: Q learning Roesch et al, 2007
Slide 18Summary expectation learning Bellman assessment performer commentator offbeat strategy emphasis circuitous technique (Q learning) nonconcurrent esteem cycle
Slide 19Sensory Decisions as Optimal Stopping consider listening to: choice: pick, or test
Slide 20Optimal Stopping likeness state u =1 is and states u =2,3 is
Slide 21Transition Probabilities
Slide 22Evidence Accumulation Gold & Shadlen, 2007
Slide 23Current Topics Vigor & tonic dopamine Priors over choice issues (LH) Pavlovian-instrumental collaborations impulsivity behavioral restraint confining Model-based, demonstrate free and long winded control Exploration versus abuse Game theoretic cooperations (disparity revultion)
Slide 24Vigor Two segments to decision: what : lever squeezing bearing to run supper to pick when/how quick/how lively free operant assignments genuine esteemed DP
Slide 25power cost unit cost (compensate) cost P R U R LP S 1 S 2 NP S 0 1 time 2 time Other Costs Rewards Costs Rewards pick (activity, ) = (LP, 1 ) pick (activity, ) = (LP, 2 ) The model how quick ? objective
Slide 26S 1 S 2 S 0 1 time 2 time Costs Rewards Costs Rewards pick (activity, ) = (LP, 1 ) pick (activity, ) = (LP, 2 ) The model Goal : Choose activities and latencies to boost the normal rate of return (prizes less expenses per time) ARL
Slide 27Differential benefit of making a move L with dormancy when in state x Average Reward RL Compute differential estimations of activities ρ = normal prizes short expenses, per unit time Future Returns Q L , (x ) = Rewards – Costs + unfaltering state conduct (not learning elements) (Extension of Schwartz 1993)
Slide 28Average Reward Cost/advantage Tradeoffs 1. Which move to make? Pick activity with biggest expected reward short cost How quick to perform it? moderate less exorbitant (life cost) moderate delays (all) prizes net rate of prizes = cost of postponement (opportunity cost of time) Choose rate that equalizations power and opportunity costs clarifies quicker (immaterial) activities under appetite, and so forth masochism
Slide 291 st Nose jab seconds since support Optimal reaction rates 1 st Nose jab Niv, Dayan, Joel, unpublished Experimental information seconds since fortification Model reenactment
Slide 30low utility high utility empowering impact mean idleness LP Other Effects of inspiration (in the model) RR25 invigorating impact
Slide 31reaction rate/minute coordinating impact 1 2 seconds from fortification reaction rate/minute low utility high utility stimulating impact U R half mean inactivity seconds from support LP Other Effects of inspiration (in the model) RR25
Slide 32less more Relation to Dopamine Phasic dopamine terminating = compensate forecast mistake What about tonic dopamine?
Slide 33Control DA exhausted # LPs in 30 minutes Control DA drained 2500 2000 Model reproduction 1500 # LPs in 30 minutes 1000 500 1 4 16 64 Aberman and Salamone 1999 Tonic dopamine = Average reward rate clarifies pharmacological controls dopamine control of force through BG pathways NB. phasic flag RPE for decision/esteem learning eating time jumble setting/state reliance (inspiration & drugs?) less switching=perseveration
Slide 34♫ $ ♫ $ ♫ $ ♫ $ ♫ … likewise clarifies impacts of phasic dopamine on reaction times terminating rate response time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 Tonic dopamine speculation
Slide 35Pavlovian & Instrumental Conditioning Pavlovian learning qualities and expectations utilizing TD blunder Instrumental learning activities: by support (leg flexion) by (TD) pundit (really extraordinary structures: objective coordinated & continual)
Slide 36Pavlovian-Instrumental Interactions synergistic molded fortification Pavlovian-instrumental exchange Pavlovian prompt predicts the instrumental result behavioral restraint to keep away from aversive results unbiased Pavlovian-instrumental exchange Pavlovian sign predicts result with same motivational valence rival Pavlovian-instrumental exchange Pavlovian signal predicts inverse motivational valence negative automaintenance
Slide 37- ve Automaintenance in Autoshaping basic decision undertaking N: nogo gives remunerate r=1 G: go gives compensate r=0 learn three amounts normal esteem Q esteem for N Q esteem for G instrumental inclination is
Slide 38- ve Automaintenance in Autoshaping Pavlovian activity affirm: Pavlovian force towards G is v(t) weight Pavlovian and instrumental favorable circumstances by ω – focused dependability of Pavlov new penchants new activity decision
Slide 39- ve Automaintenance in Autoshaping fundamental –ve automaintenance impact ( μ =5) lines are hypothetical asymptotes balance probabilities of activity
Slide 40Impulsivity & Hyperbolic Discounting people (and creatures) indicate impulsivity in: eating regimens habit spending, … intertemporal strife amongst short and long haul decisions regularly clarified by means of hyperbolic markdown capacities option is Pavlovian basic to a quick reinforcer surrounding, trolley quandaries, and so on
Slide 41Kalman Filter Markov arbitrary walk (or OU prepare) no punctate changes added substance model of blend forward induction
Slide 42Kalman Posterior ^ ε
Slide 43Assumed Density KF Rescorla-Wagner mistake redress aggressive portion of learning P&H, M
Slide 44Blocking forward blocking: mistake revision in reverse blocking: - ve off-diag
Slide 45Mackintosh versus P&H under corner to corner guess: for moderate learning, impact like Mackintosh E
Slide 46Summary Kalman channel models numerous standard molding ideal models components of RW, Mackintosh, P&H yet: downwards unblocking antagonistic designing L → r; T → r; L+T → · recency versus supremacy (Kruschke) indicator rivalry jolt/relationship rerepresentation (Daw)
Slide 47Uncertainty (Yu) expected vulnerability - obliviousness amygdala, cholinergic basal forebrain for molding ?basal forebrain for top-down attentional distribution sudden instability – `set' change noradrenergic locus coeruleus part adversary; part synergistic connection
Slide 48Experimental Data ACh & NE have comparative physiological impacts smother intermittent & criticism handling upgrade thalamocortical transmission help encounter subordinate versatility ( e.g. Kimura et al , 1995; Kobayashi et al , 2000) ( e.g. Gil et al , 1997) ( e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998) ACh & NE have particular behavioral impacts: ACh helps figuring out how to jolts with questionable outcomes NE supports learning after experiencing worldwide changes in the earth ( e.g. Bucci, Holland, & Gallagher, 1998) ( e.g. Devauges & Sara, 1990)
Slide 49Model Schematics setting expected instability unforeseen vulnerability best down preparing NE ACh cortical handling expectation, learning, ... base up preparing tactile data sources
Slide 50Attention Example 1: Posner's Task prompt sign high legitimacy low legitimacy boost area sign jolt area target tangible information tangible information reaction (Phillips, McAlonan, Robb, & Brown, 2000) attentional choice for (factually) ideal handling, well beyond the conventional perspective of asset limitation 0.1s 0.2-0.5s 0.15s sum up to the case that sign character changes with no notice
Slide 51Formal Framework ACh NE fluctuation in personality of significant prompt inconstancy in nature of important signal prompts: vestibular, visual, ... target: boost area, leave heading... abstain from speaking to full vulnerability Sensory Information
Slide 52nicotine scopolamine legitimacy impact focus concent
SPONSORS
SPONSORS
SPONSORS