Support taking in 2: activity choice

2018 days ago, 645 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Support learning 2: activity choice Peter Dayan (on account of Nathaniel Daw)

Slide 2

Global arrangement Reinforcement learning I: (Wednesday) forecast traditional molding dopamine Reinforcement learning II: dynamic programming; activity choice successive tangible choices power Pavlovian mischief Chapter 9 of Theoretical Neuroscience

Slide 3

Learning and Inference Learning: anticipate ; control ∆ weight  ( learning rate ) x ( blunder ) x (jolt) dopamine phasic expectation mistake for future reward serotonin phasic forecast blunder for future discipline acetylcholine expected vulnerability helps learning norepinephrine startling instability supports learning

Slide 4

Action Selection Evolutionary particular Immediate fortification: leg flexion Thorndike confuse box pigeon; rodent; human coordinating Delayed fortification: these undertakings labyrinths chess Bandler; Blanchard

Slide 5

Immediate Reinforcement stochastic strategy: in view of activity qualities:

Slide 6

Indirect Actor utilize RW administer: switch each 100 trials

Slide 7

Direct Actor

Slide 8

Direct Actor

Slide 9

Could we Tell? associate past prizes, activities with present decision backhanded on-screen character (isolate tickers): coordinate performing artist (single clock):

Slide 10

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Slide 11

Matching salary not return around exponential in r rotation decision bit

Slide 12

Action at a (Temporal) Distance taking in a fitting activity at u= 1 : relies on upon the activities at u= 2 and u= 3 picks up no prompt input thought: utilize forecast as surrogate criticism

Slide 13

Action Selection begin with arrangement: assess it: enhance it: 0.025 - 0.175 - 0.125 in this manner pick L more habitually than R

Slide 14

Policy esteem is excessively cynical activity is superior to normal

Slide 15

on-screen character/pundit m 1 m 2 m 3 m n dopamine signs to both motivational & engine striatum show up, shockingly a similar recommendation: preparing both qualities & approaches

Slide 16

Variants: SARSA Morris et al, 2006

Slide 17

Variants: Q learning Roesch et al, 2007

Slide 18

Summary expectation learning Bellman assessment performer commentator offbeat strategy emphasis circuitous technique (Q learning) nonconcurrent esteem cycle

Slide 19

Sensory Decisions as Optimal Stopping consider listening to: choice: pick, or test

Slide 20

Optimal Stopping likeness state u =1 is and states u =2,3 is

Slide 21

Transition Probabilities

Slide 22

Evidence Accumulation Gold & Shadlen, 2007

Slide 23

Current Topics Vigor & tonic dopamine Priors over choice issues (LH) Pavlovian-instrumental collaborations impulsivity behavioral restraint confining Model-based, demonstrate free and long winded control Exploration versus abuse Game theoretic cooperations (disparity revultion)

Slide 24

Vigor Two segments to decision: what : lever squeezing bearing to run supper to pick when/how quick/how lively free operant assignments genuine esteemed DP

Slide 25

power cost unit cost (compensate) cost P R U R  LP S 1 S 2 NP S 0  1 time  2 time Other Costs Rewards Costs Rewards pick (activity,  ) = (LP,  1 ) pick (activity,  ) = (LP,  2 ) The model how quick ? objective

Slide 26

S 1 S 2 S 0  1 time  2 time Costs Rewards Costs Rewards pick (activity,  ) = (LP,  1 ) pick (activity,  ) = (LP,  2 ) The model Goal : Choose activities and latencies to boost the normal rate of return (prizes less expenses per time) ARL

Slide 27

Differential benefit of making a move L with dormancy  when in state x Average Reward RL Compute differential estimations of activities ρ = normal prizes short expenses, per unit time Future Returns Q L ,  (x ) = Rewards – Costs + unfaltering state conduct (not learning elements) (Extension of Schwartz 1993)

Slide 28

Average Reward Cost/advantage Tradeoffs 1. Which move to make? Pick activity with biggest expected reward short cost How quick to perform it? moderate  less exorbitant (life cost) moderate  delays (all) prizes net rate of prizes = cost of postponement (opportunity cost of time) Choose rate that equalizations power and opportunity costs clarifies quicker (immaterial) activities under appetite, and so forth masochism

Slide 29

1 st Nose jab seconds since support Optimal reaction rates 1 st Nose jab Niv, Dayan, Joel, unpublished Experimental information seconds since fortification Model reenactment

Slide 30

low utility high utility empowering impact mean idleness LP Other Effects of inspiration (in the model) RR25 invigorating impact

Slide 31

reaction rate/minute coordinating impact 1 2 seconds from fortification reaction rate/minute low utility high utility stimulating impact U R half mean inactivity seconds from support LP Other Effects of inspiration (in the model) RR25

Slide 32

less more Relation to Dopamine Phasic dopamine terminating = compensate forecast mistake What about tonic dopamine?

Slide 33

Control DA exhausted # LPs in 30 minutes Control DA drained 2500 2000 Model reproduction 1500 # LPs in 30 minutes 1000 500 1 4 16 64 Aberman and Salamone 1999 Tonic dopamine = Average reward rate clarifies pharmacological controls dopamine control of force through BG pathways NB. phasic flag RPE for decision/esteem learning eating time jumble setting/state reliance (inspiration & drugs?) less switching=perseveration

Slide 34

♫ $ ♫ $ ♫ $ ♫ $ ♫ … likewise clarifies impacts of phasic dopamine on reaction times terminating rate response time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 Tonic dopamine speculation

Slide 35

Pavlovian & Instrumental Conditioning Pavlovian learning qualities and expectations utilizing TD blunder Instrumental learning activities: by support (leg flexion) by (TD) pundit (really extraordinary structures: objective coordinated & continual)

Slide 36

Pavlovian-Instrumental Interactions synergistic molded fortification Pavlovian-instrumental exchange Pavlovian prompt predicts the instrumental result behavioral restraint to keep away from aversive results unbiased Pavlovian-instrumental exchange Pavlovian sign predicts result with same motivational valence rival Pavlovian-instrumental exchange Pavlovian signal predicts inverse motivational valence negative automaintenance

Slide 37

- ve Automaintenance in Autoshaping basic decision undertaking N: nogo gives remunerate r=1 G: go gives compensate r=0 learn three amounts normal esteem Q esteem for N Q esteem for G instrumental inclination is

Slide 38

- ve Automaintenance in Autoshaping Pavlovian activity affirm: Pavlovian force towards G is v(t) weight Pavlovian and instrumental favorable circumstances by ω – focused dependability of Pavlov new penchants new activity decision

Slide 39

- ve Automaintenance in Autoshaping fundamental –ve automaintenance impact ( μ =5) lines are hypothetical asymptotes balance probabilities of activity

Slide 40

Impulsivity & Hyperbolic Discounting people (and creatures) indicate impulsivity in: eating regimens habit spending, … intertemporal strife amongst short and long haul decisions regularly clarified by means of hyperbolic markdown capacities option is Pavlovian basic to a quick reinforcer surrounding, trolley quandaries, and so on

Slide 41

Kalman Filter Markov arbitrary walk (or OU prepare) no punctate changes added substance model of blend forward induction

Slide 42

Kalman Posterior ^ ε 

Slide 43

Assumed Density KF Rescorla-Wagner mistake redress aggressive portion of learning P&H, M

Slide 44

Blocking forward blocking: mistake revision in reverse blocking: - ve off-diag

Slide 45

Mackintosh versus P&H under corner to corner guess: for moderate learning, impact like Mackintosh E

Slide 46

Summary Kalman channel models numerous standard molding ideal models components of RW, Mackintosh, P&H yet: downwards unblocking antagonistic designing L → r; T → r; L+T → · recency versus supremacy (Kruschke) indicator rivalry jolt/relationship rerepresentation (Daw)

Slide 47

Uncertainty (Yu) expected vulnerability - obliviousness amygdala, cholinergic basal forebrain for molding ?basal forebrain for top-down attentional distribution sudden instability – `set' change noradrenergic locus coeruleus part adversary; part synergistic connection

Slide 48

Experimental Data ACh & NE have comparative physiological impacts smother intermittent & criticism handling upgrade thalamocortical transmission help encounter subordinate versatility ( e.g. Kimura et al , 1995; Kobayashi et al , 2000) ( e.g. Gil et al , 1997) ( e.g. Bear & Singer, 1986; Kilgard & Merzenich, 1998) ACh & NE have particular behavioral impacts: ACh helps figuring out how to jolts with questionable outcomes NE supports learning after experiencing worldwide changes in the earth ( e.g. Bucci, Holland, & Gallagher, 1998) ( e.g. Devauges & Sara, 1990)

Slide 49

Model Schematics setting expected instability unforeseen vulnerability best down preparing NE ACh cortical handling expectation, learning, ... base up preparing tactile data sources

Slide 50

Attention Example 1: Posner's Task prompt sign high legitimacy low legitimacy boost area sign jolt area target tangible information tangible information reaction (Phillips, McAlonan, Robb, & Brown, 2000) attentional choice for (factually) ideal handling, well beyond the conventional perspective of asset limitation 0.1s 0.2-0.5s 0.15s sum up to the case that sign character changes with no notice

Slide 51

Formal Framework ACh NE fluctuation in personality of significant prompt inconstancy in nature of important signal prompts: vestibular, visual, ... target: boost area, leave heading... abstain from speaking to full vulnerability Sensory Information

Slide 52

nicotine scopolamine legitimacy impact focus concent