# Prologue to Machine Learning

1 / 120
0
0
1359 days ago, 472 views
PowerPoint PPT Presentation

### Presentation Transcript

Slide 1

﻿Prologue to Machine Learning Manik Varma Microsoft Research India http://research.microsoft.com/~manik manik@microsoft.com

Slide 2

Binary Classification Is this individual Madhubala or not? Is this individual male or female? Is this individual wonderful or not?

Slide 3

Multi-Class Classification Is this individual Madhubala, Lalu or Rakhi Sawant? Is this individual upbeat, dismal, irate or bewildered?

Slide 4

Ordinal Regression Is this individual exceptionally lovely, delightful, standard or terrible?

Slide 5

Regression How delightful is this individual on a ceaseless size of 1 to 10? 9.99?

Slide 6

Ranking Rank these individuals in diminishing request of engaging quality.

Slide 7

Multi-Label Classification Tag this picture with the arrangement of important marks from {female, Madhubala, wonderful , IITD faculty}

Slide 8

Can relapse tackle every one of these issues Binary grouping – foresee p ( y =1| x ) Multi-Class order – anticipate p ( y = k | x ) Ordinal relapse – anticipate p ( y = k | x ) Ranking – foresee and sort by pertinence Multi-Label Classification – foresee p ( y  {  1} k | x ) Learning as a matter of fact and information In what shape can the preparation information be acquired? What is known from the earlier ? Intricacy of preparing Complexity of forecast Are These Problems Distinct?

Slide 9

Supervised learning Classification Generative techniques Nearest neighbor, Naïve Bayes Discriminative strategies Logistic Regression Discriminant strategies Support Vector Machines Regression, Ranking, Feature Selection, and so on . Unsupervised learning Semi-administered learning Reinforcement learning In This Course

Slide 10

Noise and instability Unknown generative model Y = f (X) Noise in measuring information and highlight extraction Noise in marks Nuisance factors Missing information Finite preparing set size Learning from Noisy Data

Slide 11

Under and Over Fitting

Slide 12

Non-antagonism and unit measure 0 ≤ p ( y ) , p (  ) = 1, p ( ) = 0 Conditional likelihood – p ( y | x ) p ( x , y ) = p ( y | x ) p ( x ) = p ( x | y ) p ( y ) Bayes' Theorem p ( y | x ) = p ( x | y ) p ( y )/p ( x ) Marginalization p ( x ) =  y p ( x , y ) dy Independence p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 )  p ( x 1 | x 2 ) = p ( x 1 ) Chris Bishop, "Design Recognition & Machine Learning" Probability Theory

Slide 13

p ( x |  ,  ) = exp( - ( x –  ) 2/2  2 )/( 2   2 ) ½ The Univariate Gaussian Density - 3  - 2  - 1   1  2  3 

Slide 14

p ( x |  ,  ) = exp( - ½ ( x –  ) t  - 1 ( x –  )/( 2 ) D/2 |  | ½ The Multivariate Gaussian Density

Slide 15

p (  | a , b ) =  a (1 –  ) b-1  ( a + b )/ ( a )  ( b ) The Beta Density

Slide 16

Bernoulli: Single trial with likelihood of progress =  n  {0, 1},   [0, 1] p ( n |  ) =  n (1 –  ) 1-n Binomia l : N iid Bernoulli trials with n triumphs n  {0, 1, … , N },   [0, 1], p ( n | N ,  ) = N C n  n (1 –  ) N-n Multinomial: N iid trials, result k happens n k times n k  {0, 1, … , N },  k n k = N ,  k  [0, 1],  k  k = 1 p ( n | N ,  ) = N !  k  k n k/n k ! Likelihood Distribution Functions

Slide 17

We don't know whether a coin is reasonable or not. We are informed that heads happened n times in N coin flips. We are requested that foresee whether the following coin flip will bring about a head or a tail. Give y a chance to be a double irregular variable with the end goal that y = 1 speaks to the occasion that the following coin flip will be a head and y = 0 that it will be a tail We ought to foresee heads if p ( y =1| n , N ) > p ( y =0| n , N ) A Toy Example

Slide 18

Let p ( y =1| n , N ) =  and p ( y =0| n , N ) = 1 -  so we ought to anticipate heads if  > ½ How should we evaluate  ? Accepting that the watched coin flips took after a Binomial appropriation, we could pick the estimation of  that amplifies the probability of watching the information  ML = argmax  p ( n |  ) = argmax  N C n  n (1 –  ) N-n = argmax  n log(  ) + ( N – n ) log(1 –  ) = n/N We ought to foresee heads if n > ½ N The Maximum Likelihood Approach

Slide 19

We ought to pick the estimation of  expanding the back likelihood of  molded on the information We expect a Binomial probability : p ( n |  ) = N C n  n (1 –  ) N-n Beta earlier : p (  | a , b )=  a-1 (1–  ) b-1  ( a + b )/ ( a )  ( b )  MAP = argmax  p (  | n , a , b ) = argmax  p ( n |  ) p (  | a , b ) = argmax  n (1 –  ) N-n  a-1 (1–  ) b-1 = ( n + a - 1)/( N + a + b - 2) as though we saw an additional a – 1 heads & b – 1 tails We ought to anticipate heads if n > ½ ( N + b – a ) The Maximum A Posteriori Approach

Slide 20

We ought to underestimate over  p ( y =1| n , a , b ) =   p ( y =1| n ,  ) p (  | a , b , n ) d  =   p (  | a , b, n) d  =    (  | a + n , b + N – n) d  = ( n + a )/( N + a + b ) as though we saw an additional a heads & b tails We ought to anticipate heads if n > ½ ( N + b – a ) The Bayesian and MAP expectation correspond for this situation In the vast information constrain, both the Bayesian and MAP forecast concur with the ML forecast ( n > ½ N ) The Bayesian Approach

Slide 21

Classification

Slide 22

Binary Classification

Slide 23

Memorization Can not manage beforehand inconspicuous information Large scale explained information securing expense may be high Rule construct master framework Dependent in light of the capability of the master. Complex issues prompt to an expansion of standards, special cases, exemptions to exemptions, and so forth . Principles won't not exchange to comparative issues Learning from preparing information and earlier learning Focuses on speculation to novel information Approaches to Classification

Slide 24

Training Data Set of N named cases of the shape ( x i , y i ) Feature vector – x   D . X = [ x 1 x 2 … x N ] Label – y  {  1}. y = [ y 1 , y 2 … y N ] t . Y =diag( y ) Example – Gender Identification Notation ( x 1 = , y 1 = +1) ( x 2 = , y 2 = +1) ( x 3 = , y 3 = +1) ( x 4 = , y 4 = - 1)

Slide 25

Binary Classification

Slide 26

Binary Classification b w t x + b = 0  = [ w ; b ]

Slide 27

Bayes' choice control p ( y =+1| x ) > p ( y =-1| x ) ? y = +1 : y = - 1 ��  p ( y =+1| x ) > ½ ? y = +1 : y = - 1 Bayes' Decision Rule

Slide 28

Bayesian versus MAP versus ML Should we pick only one capacity to clarify the information? On the off chance that yes, ought to this be the capacity that clarifies the information the best? Shouldn't something be said about earlier information? Generative versus Discriminative Can we gain from "positive" information alone? Would it be advisable for us to demonstrate the information dispersion? Are there any missing factors? Do we simply think about a ultimate choice? Issues to Think About

Slide 29

p ( y | x ,X,Y) =  f p ( y,f | x ,X,Y) df =  f p ( y | f , x ,X,Y) p ( f | x ,X,Y) df =  f p ( y | f , x ) p ( f |X,Y) df This indispensable is regularly immovable. To settle it we can Choose the dispersions so that the arrangement is systematic (conjugate priors) Approximate the genuine dissemination of p ( f |X,Y ) by a less complex appropriation ( variational techniques) Sample from p ( f |X,Y ) (MCMC) Bayesian Approach

Slide 30

p ( y | x ,X,Y) =  f p ( y | f, x ) p ( f |X,Y) df = p ( y | f MAP , x) when p ( f |X,Y) =  ( f – f MAP ) The all the more preparing information there is the better p ( f |X,Y) approximates a delta work We can make expectations utilizing a solitary capacity, f MAP , and our center movements to evaluating f MAP . Most extreme A Posteriori (MAP)

Slide 31

f MAP = argmax f p ( f |X,Y) = argmax f p (X,Y| f ) p ( f )/p (X,Y) = argmax f p (X,Y| f ) p ( f ) f ML  argmax f p (X,Y| f ) (Maximum Likelihood) Maximum Likelihood holds if There is a great deal of preparing information so that p (X,Y| f ) >> p ( f ) Or if there is no earlier learning so that p ( f ) is uniform (uncalled for) MAP & Maximum Likelihood (ML)

Slide 32

f ML = argmax f p (X,Y| f ) = argmax f  I p ( x i , y i | f ) The autonomous and indistinguishably appropriated suspicion holds just on the off chance that we know everything about the joint dissemination of the elements and names. Specifically, p (X,Y)   I p ( x i , y i ) IID Data

Slide 33

Generative Methods Naïve Bayes

Slide 34

 MAP = argmax  p (  )  I p ( x i , y i |  ) = argmax  p (  x ) p (  y )  I p ( x i , y i |  ) = argmax  p (  x ) p (  y )  I p ( x i | y i ,  ) p ( y i |  ) = argmax  p (  x ) p (  y )  I p ( x i | y i ,  ) p ( y i |  ) = [argmax x p (  x )  I p ( x i | y i ,  x )] * [argmax y p (  y )  I p ( y i |  y )]  x and  y can be unraveled for freely The parameters of every class decouple and can be illuminated for autonomously Generative Methods

Slide 35

The parameters of every class decouple and can be tackled for autonomously Generative Methods

Slide 36

 MAP = [argmax x p (  x )  I p ( x i | y i ,  x )] * [argmax y p (  y )  I p ( y i |  x )] Naïve Bayes presumptions Independent Gaussian elements p ( x i | y i ,  x ) =  j p ( x ij | y i ,  x ) p ( x ij | y i =  1,  x ) = N ( x ij |  j  1 ,  i ) Improper uniform priors (no earlier information) p (  x ) = p (  y ) = const Bernoulli names p ( y i = + 1|  y ) =  , p ( y i = - 1|  y ) = 1- Generative M