0

0

1844 days ago,
670 views

PowerPoint PPT Presentation
Outline of part I: forecast and RL Prediction is vital for activity determination The issue: expectation of future reward The calculation: worldly contrast learning Neural usage: dopamine subordinate learning in BG An exact computational model of learning permits one to look in the cerebrum for "shrouded factors" hypothesized by the model Precise (regulating!) hypothesis for era of dopamine terminating designs Explains expectant dopaminergic reacting, second request molding Compelling record for the part of dopamine in established molding: expectation mistake goes about as flag driving learning in expectation regions

measured terminating rate demonstrate forecast blunder expectation mistake speculation of dopamine at end of trial: t = r t - V t (simply like R-W) Bayer & Glimcher (2005)

Global arrangement Reinforcement learning I: forecast traditional molding dopamine Reinforcement learning II: dynamic programming; activity choice Pavlovian bad conduct power Chapter 9 of Theoretical Neuroscience

Action Selection Evolutionary detail Immediate fortification: leg flexion Thorndike baffle box pigeon; rodent; human coordinating Delayed support: these errands labyrinths chess Bandler; Blanchard

Immediate Reinforcement stochastic approach: in light of activity qualities:

Indirect Actor utilize RW administer: switch each 100 trials

Direct Actor

Direct Actor

Could we Tell? correspond past prizes, activities with present decision roundabout performing artist (isolate timekeepers): coordinate on-screen character (single clock):

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Matching salary not return around exponential in r rotation decision part

Action at a (Temporal) Distance taking in a suitable activity at u= 1 : relies on upon the activities at u= 2 and u= 3 picks up no quick input thought: utilize forecast as surrogate criticism

Action Selection begin with approach: assess it: enhance it: 0.025 - 0.175 - 0.125 accordingly pick R more every now and again than L;C

Policy esteem is excessively skeptical activity is superior to normal

performer/commentator m 1 m 2 m 3 m n dopamine signs to both motivational & engine striatum show up, shockingly a similar proposal: preparing both qualities & arrangements

Variants: SARSA Morris et al, 2006

Variants: Q learning Roesch et al, 2007

Summary expectation learning Bellman assessment on-screen character faultfinder offbeat strategy cycle circuitous technique (Q learning) nonconcurrent esteem emphasis

Direct/Indirect Pathways coordinate: D1: GO; gain from DA increment backhanded: D2: noGO ; gain from DA diminish hyperdirect (STN) defer activities given emphatically appealing decisions Frank

Frank DARPP-32: D1 impact DRD2: D2 impact

Current Topics Vigor & tonic dopamine Priors over choice issues (LH) Pavlovian-instrumental associations impulsivity behavioral restraint encircling Model-based, display free and long winded control Exploration versus abuse Game theoretic collaborations (disparity abhorrence)

Vigor Two segments to decision: what : lever squeezing heading to run dinner to pick when/how quick/how lively free operant assignments genuine esteemed DP

power cost unit cost (compensate) cost P R U R LP S 1 S 2 NP S 0 1 time 2 time Other Costs Rewards Costs Rewards pick (activity, ) = (LP, 1 ) pick (activity, ) = (LP, 2 ) The model how quick ? objective

S 1 S 2 S 0 1 time 2 time Costs Rewards Costs Rewards pick (activity, ) = (LP, 1 ) pick (activity, ) = (LP, 2 ) The model Goal : Choose activities and latencies to augment the normal rate of return (prizes less expenses per time) ARL

Differential benefit of making a move L with idleness when in state x Average Reward RL Compute differential estimations of activities ρ = normal prizes less expenses, per unit time Future Returns Q L , (x ) = Rewards – Costs + relentless state conduct (not learning elements) (Extension of Schwartz 1993)

Average Reward Cost/advantage Tradeoffs 1. Which move to make? Pick activity with biggest expected reward short cost How quick to perform it? moderate less expensive (energy cost) moderate delays (all) prizes net rate of prizes = cost of deferral (opportunity cost of time) Choose rate that parities force and opportunity costs clarifies speedier (insignificant) activities under appetite, and so on masochism

1 st Nose jab seconds since fortification Optimal reaction rates 1 st Nose jab Niv, Dayan, Joel, unpublished Experimental information seconds since support Model recreation

low utility high utility invigorating impact mean idleness LP Other Effects of inspiration (in the model) RR25 stimulating impact

reaction rate/minute coordinating impact 1 2 seconds from fortification reaction rate/minute low utility high utility empowering impact U R half mean inactivity seconds from fortification LP Other Effects of inspiration (in the model) RR25

less more Relation to Dopamine Phasic dopamine terminating = remunerate forecast blunder What about tonic dopamine?

Control DA exhausted # LPs in 30 minutes Control DA drained 2500 2000 Model reproduction 1500 # LPs in 30 minutes 1000 500 1 4 16 64 Aberman and Salamone 1999 Tonic dopamine = Average reward rate clarifies pharmacological controls dopamine control of force through BG pathways NB. phasic flag RPE for decision/esteem learning eating time perplex setting/state reliance (inspiration & drugs?) less switching=perseveration

♫ $ ♫ $ ♫ $ ♫ $ ♫ … likewise clarifies impacts of phasic dopamine on reaction times terminating rate response time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 Tonic dopamine speculation

Three Decision Makers tree look position assessment circumstance memory

Multiple Systems in RL demonstrate based RL manufacture a forward model of the errand, results seek in the forward model (online DP) ideal utilization of data computationally ruinous stored based RL learn Q values, which condense future worth computationally paltry bootstrap-based; so measurably wasteful learn both – select as indicated by instability

Two Systems:

Behavioral Effects

Effects of Learning distributional esteem cycle (Bayesian Q learning) settled extra vulnerability per step

One Outcome shallow tree infers objective coordinated control wins

Pavlovian & Instrumental Conditioning Pavlovian learning qualities and forecasts utilizing TD mistake Instrumental learning activities: by fortification (leg flexion) by (TD) faultfinder (really unique structures: objective coordinated & ongoing)

Pavlovian-Instrumental Interactions synergistic adapted support Pavlovian-instrumental exchange Pavlovian prompt predicts the instrumental result behavioral hindrance to maintain a strategic distance from aversive results unbiased Pavlovian-instrumental exchange Pavlovian signal predicts result with same motivational valence rival Pavlovian-instrumental exchange Pavlovian signal predicts inverse motivational valence negative automaintenance

- ve Automaintenance in Autoshaping basic decision undertaking N: nogo gives remunerate r=1 G: go gives compensate r=0 learn three amounts normal esteem Q esteem for N Q esteem for G instrumental inclination is

- ve Automaintenance in Autoshaping Pavlovian activity affirm: Pavlovian impulse towards G is v(t) weight Pavlovian and instrumental points of interest by ω – focused unwavering quality of Pavlov new penchants new activity decision

- ve Automaintenance in Autoshaping essential –ve automaintenance impact ( μ =5) lines are hypothetical asymptotes harmony probabilities of activity

Sensory Decisions as Optimal Stopping consider listening to: choice: pick, or test

Optimal Stopping likeness state u =1 is and states u =2,3 is

Transition Probabilities

Computational Neuromodulation dopamine phasic: expectation blunder for reward tonic: normal reward (power) serotonin phasic: expectation mistake for discipline? acetylcholine : expected instability? norepinephrine sudden vulnerability; neural intrude?

Ethology optimality propriety Psychology traditional/operant molding Computation dynamic progr. Kalman separating Algorithm TD/delta rules straightforward weights Conditioning expectation : of imperative occasions control : in the light of those forecasts Neurobiology neuromodulators; amygdala; OFC core accumbens; dorsal striatum

Markov Decision Process class of adapted errands with states , activities & rewards at each timestep t the world goes up against state s t and conveys remunerate r t , and the operator picks an activity a t

Markov Decision Process World: You are in state 34. Your quick reward is 3. You have 3 activities. Robot: I'll make a move 2. World: You are in state 77. Your prompt reward is - 7. You have 2 activities. Robot: I'll make a move 1. World: You're in state 34 (once more). Your quick reward is 3. You have 3 activities.

Markov Decision Process Stochastic process characterized by: reward work: r t ~ P( r t | s t ) move function: s t ~ P( s t+1 | s t , a t )

Markov Decision Process Stochastic process characterized by: reward work: r t ~ P( r t | s t ) move function: s t ~ P( s t+1 | s t , a t ) Markov property future restrictively autonomous of past, given s t

The ideal approach Definition: an arrangement with the end goal that at each express, its normal v

SPONSORS

No comments found.

SPONSORS

SPONSORS