Synopsis of part I: forecast and RL

0
0
1844 days ago, 670 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Outline of part I: forecast and RL Prediction is vital for activity determination The issue: expectation of future reward The calculation: worldly contrast learning Neural usage: dopamine subordinate learning in BG An exact computational model of learning permits one to look in the cerebrum for "shrouded factors" hypothesized by the model Precise (regulating!) hypothesis for era of dopamine terminating designs Explains expectant dopaminergic reacting, second request molding Compelling record for the part of dopamine in established molding: expectation mistake goes about as flag driving learning in expectation regions

Slide 2

measured terminating rate demonstrate forecast blunder expectation mistake speculation of dopamine at end of trial:  t = r t - V t (simply like R-W) Bayer & Glimcher (2005)

Slide 3

Global arrangement Reinforcement learning I: forecast traditional molding dopamine Reinforcement learning II: dynamic programming; activity choice Pavlovian bad conduct power Chapter 9 of Theoretical Neuroscience

Slide 4

Action Selection Evolutionary detail Immediate fortification: leg flexion Thorndike baffle box pigeon; rodent; human coordinating Delayed support: these errands labyrinths chess Bandler; Blanchard

Slide 5

Immediate Reinforcement stochastic approach: in light of activity qualities:

Slide 6

Indirect Actor utilize RW administer: switch each 100 trials

Slide 7

Direct Actor

Slide 8

Direct Actor

Slide 9

Could we Tell? correspond past prizes, activities with present decision roundabout performing artist (isolate timekeepers): coordinate on-screen character (single clock):

Slide 10

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Slide 11

Matching salary not return around exponential in r rotation decision part

Slide 12

Action at a (Temporal) Distance taking in a suitable activity at u= 1 : relies on upon the activities at u= 2 and u= 3 picks up no quick input thought: utilize forecast as surrogate criticism

Slide 13

Action Selection begin with approach: assess it: enhance it: 0.025 - 0.175 - 0.125 accordingly pick R more every now and again than L;C

Slide 14

Policy esteem is excessively skeptical activity is superior to normal

Slide 15

performer/commentator m 1 m 2 m 3 m n dopamine signs to both motivational & engine striatum show up, shockingly a similar proposal: preparing both qualities & arrangements

Slide 16

Variants: SARSA Morris et al, 2006

Slide 17

Variants: Q learning Roesch et al, 2007

Slide 18

Summary expectation learning Bellman assessment on-screen character faultfinder offbeat strategy cycle circuitous technique (Q learning) nonconcurrent esteem emphasis

Slide 19

Direct/Indirect Pathways coordinate: D1: GO; gain from DA increment backhanded: D2: noGO ; gain from DA diminish hyperdirect (STN) defer activities given emphatically appealing decisions Frank

Slide 20

Frank DARPP-32: D1 impact DRD2: D2 impact

Slide 21

Current Topics Vigor & tonic dopamine Priors over choice issues (LH) Pavlovian-instrumental associations impulsivity behavioral restraint encircling Model-based, display free and long winded control Exploration versus abuse Game theoretic collaborations (disparity abhorrence)

Slide 22

Vigor Two segments to decision: what : lever squeezing heading to run dinner to pick when/how quick/how lively free operant assignments genuine esteemed DP

Slide 23

power cost unit cost (compensate) cost P R U R  LP S 1 S 2 NP S 0  1 time  2 time Other Costs Rewards Costs Rewards pick (activity,  ) = (LP,  1 ) pick (activity,  ) = (LP,  2 ) The model how quick ? objective

Slide 24

S 1 S 2 S 0  1 time  2 time Costs Rewards Costs Rewards pick (activity,  ) = (LP,  1 ) pick (activity,  ) = (LP,  2 ) The model Goal : Choose activities and latencies to augment the normal rate of return (prizes less expenses per time) ARL

Slide 25

Differential benefit of making a move L with idleness  when in state x Average Reward RL Compute differential estimations of activities ρ = normal prizes less expenses, per unit time Future Returns Q L ,  (x ) = Rewards – Costs + relentless state conduct (not learning elements) (Extension of Schwartz 1993)

Slide 26

Average Reward Cost/advantage Tradeoffs 1. Which move to make? Pick activity with biggest expected reward short cost How quick to perform it? moderate  less expensive (energy cost) moderate  delays (all) prizes net rate of prizes = cost of deferral (opportunity cost of time) Choose rate that parities force and opportunity costs clarifies speedier (insignificant) activities under appetite, and so on masochism

Slide 27

1 st Nose jab seconds since fortification Optimal reaction rates 1 st Nose jab Niv, Dayan, Joel, unpublished Experimental information seconds since support Model recreation

Slide 28

low utility high utility invigorating impact mean idleness LP Other Effects of inspiration (in the model) RR25 stimulating impact

Slide 29

reaction rate/minute coordinating impact 1 2 seconds from fortification reaction rate/minute low utility high utility empowering impact U R half mean inactivity seconds from fortification LP Other Effects of inspiration (in the model) RR25

Slide 30

less more Relation to Dopamine Phasic dopamine terminating = remunerate forecast blunder What about tonic dopamine?

Slide 31

Control DA exhausted # LPs in 30 minutes Control DA drained 2500 2000 Model reproduction 1500 # LPs in 30 minutes 1000 500 1 4 16 64 Aberman and Salamone 1999 Tonic dopamine = Average reward rate clarifies pharmacological controls dopamine control of force through BG pathways NB. phasic flag RPE for decision/esteem learning eating time perplex setting/state reliance (inspiration & drugs?) less switching=perseveration

Slide 32

♫ $ ♫ $ ♫ $ ♫ $ ♫ … likewise clarifies impacts of phasic dopamine on reaction times terminating rate response time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 Tonic dopamine speculation

Slide 33

Three Decision Makers tree look position assessment circumstance memory

Slide 34

Multiple Systems in RL demonstrate based RL manufacture a forward model of the errand, results seek in the forward model (online DP) ideal utilization of data computationally ruinous stored based RL learn Q values, which condense future worth computationally paltry bootstrap-based; so measurably wasteful learn both – select as indicated by instability

Slide 35

Two Systems:

Slide 36

Behavioral Effects

Slide 37

Effects of Learning distributional esteem cycle (Bayesian Q learning) settled extra vulnerability per step

Slide 38

One Outcome shallow tree infers objective coordinated control wins

Slide 39

Pavlovian & Instrumental Conditioning Pavlovian learning qualities and forecasts utilizing TD mistake Instrumental learning activities: by fortification (leg flexion) by (TD) faultfinder (really unique structures: objective coordinated & ongoing)

Slide 40

Pavlovian-Instrumental Interactions synergistic adapted support Pavlovian-instrumental exchange Pavlovian prompt predicts the instrumental result behavioral hindrance to maintain a strategic distance from aversive results unbiased Pavlovian-instrumental exchange Pavlovian signal predicts result with same motivational valence rival Pavlovian-instrumental exchange Pavlovian signal predicts inverse motivational valence negative automaintenance

Slide 41

- ve Automaintenance in Autoshaping basic decision undertaking N: nogo gives remunerate r=1 G: go gives compensate r=0 learn three amounts normal esteem Q esteem for N Q esteem for G instrumental inclination is

Slide 42

- ve Automaintenance in Autoshaping Pavlovian activity affirm: Pavlovian impulse towards G is v(t) weight Pavlovian and instrumental points of interest by ω – focused unwavering quality of Pavlov new penchants new activity decision

Slide 43

- ve Automaintenance in Autoshaping essential –ve automaintenance impact ( μ =5) lines are hypothetical asymptotes harmony probabilities of activity

Slide 44

Sensory Decisions as Optimal Stopping consider listening to: choice: pick, or test

Slide 45

Optimal Stopping likeness state u =1 is and states u =2,3 is

Slide 46

Transition Probabilities

Slide 47

Computational Neuromodulation dopamine phasic: expectation blunder for reward tonic: normal reward (power) serotonin phasic: expectation mistake for discipline? acetylcholine : expected instability? norepinephrine sudden vulnerability; neural intrude?

Slide 48

Ethology optimality propriety Psychology traditional/operant molding Computation dynamic progr. Kalman separating Algorithm TD/delta rules straightforward weights Conditioning expectation : of imperative occasions control : in the light of those forecasts Neurobiology neuromodulators; amygdala; OFC core accumbens; dorsal striatum

Slide 49

Markov Decision Process class of adapted errands with states , activities & rewards at each timestep t the world goes up against state s t and conveys remunerate r t , and the operator picks an activity a t

Slide 50

Markov Decision Process World: You are in state 34. Your quick reward is 3. You have 3 activities. Robot: I'll make a move 2. World: You are in state 77. Your prompt reward is - 7. You have 2 activities. Robot: I'll make a move 1. World: You're in state 34 (once more). Your quick reward is 3. You have 3 activities.

Slide 51

Markov Decision Process Stochastic process characterized by: reward work: r t ~ P( r t | s t ) move function: s t ~ P( s t+1 | s t , a t )

Slide 52

Markov Decision Process Stochastic process characterized by: reward work: r t ~ P( r t | s t ) move function: s t ~ P( s t+1 | s t , a t ) Markov property future restrictively autonomous of past, given s t

Slide 53

The ideal approach Definition: an arrangement with the end goal that at each express, its normal v

SPONSORS