0

0

1159 days ago,
398 views

Prologue to Machine Learning Manik Varma Microsoft Research India http://research.microsoft.com/~manik manik@microsoft.com

Binary Classification Is this individual Madhubala or not? Is this individual male or female? Is this individual wonderful or not?

Multi-Class Classification Is this individual Madhubala, Lalu or Rakhi Sawant? Is this individual upbeat, dismal, irate or bewildered?

Ordinal Regression Is this individual exceptionally lovely, delightful, standard or terrible?

Regression How delightful is this individual on a ceaseless size of 1 to 10? 9.99?

Ranking Rank these individuals in diminishing request of engaging quality.

Multi-Label Classification Tag this picture with the arrangement of important marks from {female, Madhubala, wonderful , IITD faculty}

Can relapse tackle every one of these issues Binary grouping – foresee p ( y =1| x ) Multi-Class order – anticipate p ( y = k | x ) Ordinal relapse – anticipate p ( y = k | x ) Ranking – foresee and sort by pertinence Multi-Label Classification – foresee p ( y { 1} k | x ) Learning as a matter of fact and information In what shape can the preparation information be acquired? What is known from the earlier ? Intricacy of preparing Complexity of forecast Are These Problems Distinct?

Supervised learning Classification Generative techniques Nearest neighbor, Naïve Bayes Discriminative strategies Logistic Regression Discriminant strategies Support Vector Machines Regression, Ranking, Feature Selection, and so on . Unsupervised learning Semi-administered learning Reinforcement learning In This Course

Noise and instability Unknown generative model Y = f (X) Noise in measuring information and highlight extraction Noise in marks Nuisance factors Missing information Finite preparing set size Learning from Noisy Data

Under and Over Fitting

Non-antagonism and unit measure 0 ≤ p ( y ) , p ( ) = 1, p ( ) = 0 Conditional likelihood – p ( y | x ) p ( x , y ) = p ( y | x ) p ( x ) = p ( x | y ) p ( y ) Bayes' Theorem p ( y | x ) = p ( x | y ) p ( y )/p ( x ) Marginalization p ( x ) = y p ( x , y ) dy Independence p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) p ( x 1 | x 2 ) = p ( x 1 ) Chris Bishop, "Design Recognition & Machine Learning" Probability Theory

p ( x | , ) = exp( - ( x – ) 2/2 2 )/( 2 2 ) ½ The Univariate Gaussian Density - 3 - 2 - 1 1 2 3

p ( x | , ) = exp( - ½ ( x – ) t - 1 ( x – )/( 2 ) D/2 | | ½ The Multivariate Gaussian Density

p ( | a , b ) = a (1 – ) b-1 ( a + b )/ ( a ) ( b ) The Beta Density

Bernoulli: Single trial with likelihood of progress = n {0, 1}, [0, 1] p ( n | ) = n (1 – ) 1-n Binomia l : N iid Bernoulli trials with n triumphs n {0, 1, … , N }, [0, 1], p ( n | N , ) = N C n n (1 – ) N-n Multinomial: N iid trials, result k happens n k times n k {0, 1, … , N }, k n k = N , k [0, 1], k k = 1 p ( n | N , ) = N ! k k n k/n k ! Likelihood Distribution Functions

We don't know whether a coin is reasonable or not. We are informed that heads happened n times in N coin flips. We are requested that foresee whether the following coin flip will bring about a head or a tail. Give y a chance to be a double irregular variable with the end goal that y = 1 speaks to the occasion that the following coin flip will be a head and y = 0 that it will be a tail We ought to foresee heads if p ( y =1| n , N ) > p ( y =0| n , N ) A Toy Example

Let p ( y =1| n , N ) = and p ( y =0| n , N ) = 1 - so we ought to anticipate heads if > ½ How should we evaluate ? Accepting that the watched coin flips took after a Binomial appropriation, we could pick the estimation of that amplifies the probability of watching the information ML = argmax p ( n | ) = argmax N C n n (1 – ) N-n = argmax n log( ) + ( N – n ) log(1 – ) = n/N We ought to foresee heads if n > ½ N The Maximum Likelihood Approach

We ought to pick the estimation of expanding the back likelihood of molded on the information We expect a Binomial probability : p ( n | ) = N C n n (1 – ) N-n Beta earlier : p ( | a , b )= a-1 (1– ) b-1 ( a + b )/ ( a ) ( b ) MAP = argmax p ( | n , a , b ) = argmax p ( n | ) p ( | a , b ) = argmax n (1 – ) N-n a-1 (1– ) b-1 = ( n + a - 1)/( N + a + b - 2) as though we saw an additional a – 1 heads & b – 1 tails We ought to anticipate heads if n > ½ ( N + b – a ) The Maximum A Posteriori Approach

We ought to underestimate over p ( y =1| n , a , b ) = p ( y =1| n , ) p ( | a , b , n ) d = p ( | a , b, n) d = ( | a + n , b + N – n) d = ( n + a )/( N + a + b ) as though we saw an additional a heads & b tails We ought to anticipate heads if n > ½ ( N + b – a ) The Bayesian and MAP expectation correspond for this situation In the vast information constrain, both the Bayesian and MAP forecast concur with the ML forecast ( n > ½ N ) The Bayesian Approach

Classification

Binary Classification

Memorization Can not manage beforehand inconspicuous information Large scale explained information securing expense may be high Rule construct master framework Dependent in light of the capability of the master. Complex issues prompt to an expansion of standards, special cases, exemptions to exemptions, and so forth . Principles won't not exchange to comparative issues Learning from preparing information and earlier learning Focuses on speculation to novel information Approaches to Classification

Training Data Set of N named cases of the shape ( x i , y i ) Feature vector – x D . X = [ x 1 x 2 … x N ] Label – y { 1}. y = [ y 1 , y 2 … y N ] t . Y =diag( y ) Example – Gender Identification Notation ( x 1 = , y 1 = +1) ( x 2 = , y 2 = +1) ( x 3 = , y 3 = +1) ( x 4 = , y 4 = - 1)

Binary Classification

Binary Classification b w t x + b = 0 = [ w ; b ]

Bayes' choice control p ( y =+1| x ) > p ( y =-1| x ) ? y = +1 : y = - 1 �� p ( y =+1| x ) > ½ ? y = +1 : y = - 1 Bayes' Decision Rule

Bayesian versus MAP versus ML Should we pick only one capacity to clarify the information? On the off chance that yes, ought to this be the capacity that clarifies the information the best? Shouldn't something be said about earlier information? Generative versus Discriminative Can we gain from "positive" information alone? Would it be advisable for us to demonstrate the information dispersion? Are there any missing factors? Do we simply think about a ultimate choice? Issues to Think About

p ( y | x ,X,Y) = f p ( y,f | x ,X,Y) df = f p ( y | f , x ,X,Y) p ( f | x ,X,Y) df = f p ( y | f , x ) p ( f |X,Y) df This indispensable is regularly immovable. To settle it we can Choose the dispersions so that the arrangement is systematic (conjugate priors) Approximate the genuine dissemination of p ( f |X,Y ) by a less complex appropriation ( variational techniques) Sample from p ( f |X,Y ) (MCMC) Bayesian Approach

p ( y | x ,X,Y) = f p ( y | f, x ) p ( f |X,Y) df = p ( y | f MAP , x) when p ( f |X,Y) = ( f – f MAP ) The all the more preparing information there is the better p ( f |X,Y) approximates a delta work We can make expectations utilizing a solitary capacity, f MAP , and our center movements to evaluating f MAP . Most extreme A Posteriori (MAP)

f MAP = argmax f p ( f |X,Y) = argmax f p (X,Y| f ) p ( f )/p (X,Y) = argmax f p (X,Y| f ) p ( f ) f ML argmax f p (X,Y| f ) (Maximum Likelihood) Maximum Likelihood holds if There is a great deal of preparing information so that p (X,Y| f ) >> p ( f ) Or if there is no earlier learning so that p ( f ) is uniform (uncalled for) MAP & Maximum Likelihood (ML)

f ML = argmax f p (X,Y| f ) = argmax f I p ( x i , y i | f ) The autonomous and indistinguishably appropriated suspicion holds just on the off chance that we know everything about the joint dissemination of the elements and names. Specifically, p (X,Y) I p ( x i , y i ) IID Data

Generative Methods Naïve Bayes

MAP = argmax p ( ) I p ( x i , y i | ) = argmax p ( x ) p ( y ) I p ( x i , y i | ) = argmax p ( x ) p ( y ) I p ( x i | y i , ) p ( y i | ) = argmax p ( x ) p ( y ) I p ( x i | y i , ) p ( y i | ) = [argmax x p ( x ) I p ( x i | y i , x )] * [argmax y p ( y ) I p ( y i | y )] x and y can be unraveled for freely The parameters of every class decouple and can be illuminated for autonomously Generative Methods

The parameters of every class decouple and can be tackled for autonomously Generative Methods

MAP = [argmax x p ( x ) I p ( x i | y i , x )] * [argmax y p ( y ) I p ( y i | x )] Naïve Bayes presumptions Independent Gaussian elements p ( x i | y i , x ) = j p ( x ij | y i , x ) p ( x ij | y i = 1, x ) = N ( x ij | j 1 , i ) Improper uniform priors (no earlier information) p ( x ) = p ( y ) = const Bernoulli names p ( y i = + 1| y ) = , p ( y i = - 1| y ) = 1- Generative M

SPONSORS

No comments found.

SPONSORS

SPONSORS