# Learning and Vision: Discriminative Models

0
0
1945 days ago, 646 views
PowerPoint PPT Presentation
Part II: Algorithms and Applications. Part I: FundamentalsPart II: Algorithms and ApplicationsSupport Vector MachinesFace and person on foot detectionAdaBoostFacesBuilding Fast ClassifiersTrading off rate for exactness

### Presentation Transcript

Slide 1

﻿Learning and Vision: Discriminative Models Chris Bishop and Paul Viola

Slide 2

Part II: Algorithms and Applications Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines Face and person on foot discovery AdaBoost Faces Building Fast Classifiers Trading off speed for precision… Face and protest location Memory Based Learning Simard Moghaddam

Slide 3

History Lesson 1950's Perceptrons are cool Very basic learning guideline, can learn "complex" ideas Generalized perceptrons are better - an excessive number of weights 1960's Perceptron's stink (M+P) Some straightforward ideas require exponential # of components Can't in any way, shape or form discover that, correct? 1980's MLP's are cool (R+M/PDP) Sort of basic learning standard, can learn anything (?) Create only the components you require 1990 MLP's stink Hard to prepare : Slow/Local Minima 1996 Perceptron's are cool

Slide 4

Why did we require multi-layer perceptrons? Issues like this appear to require exceptionally complex non-linearities. Minsky and Papert demonstrated that an exponential number of elements is important to take care of bland issues.

Slide 5

fourteenth Order??? 120 Features Why an exponential number of components? N=21, k=5 - > 65,000 components

Slide 6

MLP's versus Perceptron MLP's are difficult to prepare… Takes quite a while (erratically long) Can join to poor minima MLP are difficult to comprehend What are they truly doing? Perceptrons are anything but difficult to prepare… Type of direct programming. Polynomial time. One least which is worldwide. Summed up perceptrons are less demanding to get it. Polynomial capacities.

Slide 7

What about straightly indistinguishable? Perceptron Training is Linear Programming Polynomial time in the quantity of factors and in the quantity of imperatives.

Slide 8

Support Vector Machines Rebirth of Perceptrons How to prepare viably Linear Programming (… later quadratic programming) Though on-line works incredible as well. How to get such a variety of elements reasonably?!? Portion Trick How to sum up with such a large number of elements? VC measurement. (Then again is it regularization?)

Slide 9

Lemma 1: Weight vectors are straightforward The weight vector lives in a sub-space spread over by the cases… Dimensionality is dictated by the quantity of cases not the many-sided quality of the space.

Slide 10

Lemma 2: Only need to think about cases

Slide 11

Simple Kernels yield Complex Features

Slide 12

But Kernel Perceptrons Can Generalize Poorly

Slide 13

Perceptron Rebirth: Generalization Too many components … Occam is miserable Perhaps we ought to support smoothness? Smoother

Slide 14

Linear Program is not interesting The straight program can give back any numerous of the right weight vector... Slack factors & Weight earlier - Force the arrangement toward zero

Slide 15

Definition of the Margin Geometric Margin: Gap amongst negatives and positives measured opposite to a hyperplane Classifier Margin

Slide 16

Require non-zero edge Allows arrangements with zero edge Enforces a non-zero edge amongst illustrations and the choice limit.

Slide 17

Constrained Optimization Find the smoothest work that isolates information Quadratic Programming (like Linear Programming) Single Minima Polynomial Time calculation

Slide 18

Constrained Optimization 2

Slide 19

SVM: cases

Slide 20

SVM: Key Ideas Augment contributions with a vast list of capabilities Polynomials, and so forth. Utilize Kernel Trick(TM) to do this proficiently Enforce/Encourage Smoothness with weight punishment Introduce Margin Find best arrangement utilizing Quadratic Programming

Slide 21

SVM: Zip Code acknowledgment Data measurement: 256 Feature Space: 4 th arrange approximately 100,000,000 darken

Slide 22

Larger Scale Smallest Scale The Classical Face Detection Process 50,000 Locations/Scales

Slide 23

Classifier is Learned from Labeled Data Training Data 5000 confronts All frontal 10 8 non confronts Faces are standardized Scale, interpretation Many varieties Across people Illumination Pose (turn both in plane and out)

Slide 24

Key Properties of Face Detection Each picture contains 10 - 50 thousand locs/scales Faces are uncommon 0 - 50 for every picture 1000 circumstances the same number of non-faces as confronts Extremely little # of false positives: 10 - 6

Slide 25

Sung and Poggio

Slide 26

Rowley, Baluja & Kanade First Fast System - Low Res to Hi

Slide 27

Osuna, Freund, and Girosi

Slide 28

Support Vectors

Slide 29

P, O, & G: First Pedestrian Work

Slide 30

On to AdaBoost Given an arrangement of powerless classifiers None much superior to arbitrary Iteratively consolidate classifiers Form a direct blend Training blunder focalizes to 0 rapidly Test mistake is identified with preparing edge

Slide 31

Weak Classifier 1 Weights Increased Weak Classifier 2 Weak classifier 3 Final classifier is straight mix of frail classifiers Freund & Shapire AdaBoost

Slide 32

Slide 33

AdaBoost: Super Efficient Feature Selector Features = Weak Classifiers Each round chooses the ideal element given: Previous chose highlights Exponential Loss

Slide 34

Boosted Face Detection: Image Features "Rectangle channels" Similar to Haar wavelets Papageorgiou, et al. Extraordinary Binary Features

Slide 37

Feature Selection For each round of boosting: Evaluate every rectangle channel on every illustration Sort cases by channel values Select best edge for each channel (min Z ) Select best channel/edge (= Feature) Reweight cases M channels, T edges, N cases, L learning time O( MT L(MTN) ) Naïve Wrapper Method O( MN ) Adaboost highlight selector

Slide 38

Example Classifier for Face Detection A classifier with 200 rectangle components was found out utilizing AdaBoost 95% right location on test set with 1 in 14084 false positives. Not exactly focused... ROC bend for 200 component classifier

Slide 39

% False Pos 0 50 versus false neg controlled by 50 100 % Detection T IMAGE SUB-WINDOW Classifier 2 Classifier 3 FACE Classifier 1 F NON-FACE NON-FACE NON-FACE NON-FACE Building Fast Classifiers Given a settled arrangement of classifier speculation classes Computational Risk Minimization

Slide 40

Other Fast Classification Work Simard Rowley (Faces) Fleuret & Geman (Faces)

Slide 41

Cascaded Classifier A 1 include classifier accomplishes 100% discovery rate and around half false positive rate. A 5 include classifier accomplishes 100% location rate and 40% false positive rate (20% combined) utilizing information from past stage. A 20 highlight classifier accomplish 100% recognition rate with 10% false positive rate (2% aggregate) half 20% 2% IMAGE SUB-WINDOW 5 Features 20 Features FACE 1 Feature F NON-FACE NON-FACE NON-FACE

Slide 42

10 31 50 65 78 95 110 167 422 Viola-Jones 78.3 85.2 88.8 90.0 90.1 90.8 91.1 91.8 93.7 Rowley-Baluja-Kanade 83.2 86.0 89.2 90.1 89.9 Schneiderman-Kanade 94.4 Roth-Yang-Ahuja (94.8) Comparison to Other Systems False Detections Detector

Slide 43

Output of Face Detector on Test Images

Slide 44

Solving other "Face" Tasks Profile Detection Facial Feature Localization Demographic Analysis

Slide 45

Feature Localization Surprising properties of our structure The cost of location is not an element of picture size Just the quantity of elements Learning naturally centers consideration around key areas Conclusion: the "include" indicator can incorporate a vast logical district around the component

Slide 46

Feature Localization Features Learned elements mirror the undertaking

Slide 47

Profile Detection

Slide 48

More Results

Slide 49

Profile Features

Slide 50

Thanks to Andrew Moore One-Nearest Neighbor … One closest neighbor for fitting is portrayed without further ado… Similar to Join The Dots with two Pros and one Con. Genius: It is anything but difficult to execute with multivariate information sources. CON: It no longer interjects locally. Ace: A great prologue to case based learning…

Slide 51

Thanks to Andrew Moore 1-Nearest Neighbor is a case of… . Occurrence based learning Four things make a memory based learner: A separation metric what number close-by neighbors to take a gander at? A weighting capacity (discretionary) How to fit with the neighborhood focuses? x 1 y 1 x 2 y 2 x 3 y 3 . . x n y n A capacity approximator that has been around since around 1910. To make an expectation, look database for comparable datapoints, and fit with the nearby focuses.

Slide 52

Thanks to Andrew Moore Nearest Neighbor Four things make a memory based learner: A separation metric Euclidian what number close-by neighbors to take a gander at? One A weighting capacity (optional) Unused How to fit with the nearby points? J ust foresee an indistinguishable yield from the closest neighbor.

Slide 53

Thanks to Andrew Moore Multivariate Distance Metrics Suppose the information vectors x1, x2, … xn are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N1 , x N2 ). One can draw the closest neighbor areas in info space. The relative scalings out there metric influence locale shapes.

Slide 54

Thanks to Andrew Moore Euclidean Distance Metric Other Metrics… Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes' Ringo framework… ) Or identically, where

Slide 55

Thanks to Andrew Moore Notable Distance Metrics

Slide 56

Simard: Tangent Distance

Slide 57

Simard: Tangent Distance

Slide 58

Slide 59