Learning and Vision: Discriminative Models Chris Bishop and Paul Viola
Slide 2Part II: Algorithms and Applications Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines Face and person on foot discovery AdaBoost Faces Building Fast Classifiers Trading off speed for precision… Face and protest location Memory Based Learning Simard Moghaddam
Slide 3History Lesson 1950's Perceptrons are cool Very basic learning guideline, can learn "complex" ideas Generalized perceptrons are better - an excessive number of weights 1960's Perceptron's stink (M+P) Some straightforward ideas require exponential # of components Can't in any way, shape or form discover that, correct? 1980's MLP's are cool (R+M/PDP) Sort of basic learning standard, can learn anything (?) Create only the components you require 1990 MLP's stink Hard to prepare : Slow/Local Minima 1996 Perceptron's are cool
Slide 4Why did we require multi-layer perceptrons? Issues like this appear to require exceptionally complex non-linearities. Minsky and Papert demonstrated that an exponential number of elements is important to take care of bland issues.
Slide 5fourteenth Order??? 120 Features Why an exponential number of components? N=21, k=5 - > 65,000 components
Slide 6MLP's versus Perceptron MLP's are difficult to prepare… Takes quite a while (erratically long) Can join to poor minima MLP are difficult to comprehend What are they truly doing? Perceptrons are anything but difficult to prepare… Type of direct programming. Polynomial time. One least which is worldwide. Summed up perceptrons are less demanding to get it. Polynomial capacities.
Slide 7What about straightly indistinguishable? Perceptron Training is Linear Programming Polynomial time in the quantity of factors and in the quantity of imperatives.
Slide 8Support Vector Machines Rebirth of Perceptrons How to prepare viably Linear Programming (… later quadratic programming) Though on-line works incredible as well. How to get such a variety of elements reasonably?!? Portion Trick How to sum up with such a large number of elements? VC measurement. (Then again is it regularization?)
Slide 9Lemma 1: Weight vectors are straightforward The weight vector lives in a sub-space spread over by the cases… Dimensionality is dictated by the quantity of cases not the many-sided quality of the space.
Slide 10Lemma 2: Only need to think about cases
Slide 11Simple Kernels yield Complex Features
Slide 12But Kernel Perceptrons Can Generalize Poorly
Slide 13Perceptron Rebirth: Generalization Too many components … Occam is miserable Perhaps we ought to support smoothness? Smoother
Slide 14Linear Program is not interesting The straight program can give back any numerous of the right weight vector... Slack factors & Weight earlier - Force the arrangement toward zero
Slide 15Definition of the Margin Geometric Margin: Gap amongst negatives and positives measured opposite to a hyperplane Classifier Margin
Slide 16Require non-zero edge Allows arrangements with zero edge Enforces a non-zero edge amongst illustrations and the choice limit.
Slide 17Constrained Optimization Find the smoothest work that isolates information Quadratic Programming (like Linear Programming) Single Minima Polynomial Time calculation
Slide 18Constrained Optimization 2
Slide 19SVM: cases
Slide 20SVM: Key Ideas Augment contributions with a vast list of capabilities Polynomials, and so forth. Utilize Kernel Trick(TM) to do this proficiently Enforce/Encourage Smoothness with weight punishment Introduce Margin Find best arrangement utilizing Quadratic Programming
Slide 21SVM: Zip Code acknowledgment Data measurement: 256 Feature Space: 4 th arrange approximately 100,000,000 darken
Slide 22Larger Scale Smallest Scale The Classical Face Detection Process 50,000 Locations/Scales
Slide 23Classifier is Learned from Labeled Data Training Data 5000 confronts All frontal 10 8 non confronts Faces are standardized Scale, interpretation Many varieties Across people Illumination Pose (turn both in plane and out)
Slide 24Key Properties of Face Detection Each picture contains 10 - 50 thousand locs/scales Faces are uncommon 0 - 50 for every picture 1000 circumstances the same number of non-faces as confronts Extremely little # of false positives: 10 - 6
Slide 25Sung and Poggio
Slide 26Rowley, Baluja & Kanade First Fast System - Low Res to Hi
Slide 27Osuna, Freund, and Girosi
Slide 28Support Vectors
Slide 29P, O, & G: First Pedestrian Work
Slide 30On to AdaBoost Given an arrangement of powerless classifiers None much superior to arbitrary Iteratively consolidate classifiers Form a direct blend Training blunder focalizes to 0 rapidly Test mistake is identified with preparing edge
Slide 31Weak Classifier 1 Weights Increased Weak Classifier 2 Weak classifier 3 Final classifier is straight mix of frail classifiers Freund & Shapire AdaBoost
Slide 32AdaBoost Properties
Slide 33AdaBoost: Super Efficient Feature Selector Features = Weak Classifiers Each round chooses the ideal element given: Previous chose highlights Exponential Loss
Slide 34Boosted Face Detection: Image Features "Rectangle channels" Similar to Haar wavelets Papageorgiou, et al. Extraordinary Binary Features
Slide 37Feature Selection For each round of boosting: Evaluate every rectangle channel on every illustration Sort cases by channel values Select best edge for each channel (min Z ) Select best channel/edge (= Feature) Reweight cases M channels, T edges, N cases, L learning time O( MT L(MTN) ) Naïve Wrapper Method O( MN ) Adaboost highlight selector
Slide 38Example Classifier for Face Detection A classifier with 200 rectangle components was found out utilizing AdaBoost 95% right location on test set with 1 in 14084 false positives. Not exactly focused... ROC bend for 200 component classifier
Slide 39% False Pos 0 50 versus false neg controlled by 50 100 % Detection T IMAGE SUB-WINDOW Classifier 2 Classifier 3 FACE Classifier 1 F NON-FACE NON-FACE NON-FACE NON-FACE Building Fast Classifiers Given a settled arrangement of classifier speculation classes Computational Risk Minimization
Slide 40Other Fast Classification Work Simard Rowley (Faces) Fleuret & Geman (Faces)
Slide 41Cascaded Classifier A 1 include classifier accomplishes 100% discovery rate and around half false positive rate. A 5 include classifier accomplishes 100% location rate and 40% false positive rate (20% combined) utilizing information from past stage. A 20 highlight classifier accomplish 100% recognition rate with 10% false positive rate (2% aggregate) half 20% 2% IMAGE SUB-WINDOW 5 Features 20 Features FACE 1 Feature F NON-FACE NON-FACE NON-FACE
Slide 4210 31 50 65 78 95 110 167 422 Viola-Jones 78.3 85.2 88.8 90.0 90.1 90.8 91.1 91.8 93.7 Rowley-Baluja-Kanade 83.2 86.0 89.2 90.1 89.9 Schneiderman-Kanade 94.4 Roth-Yang-Ahuja (94.8) Comparison to Other Systems False Detections Detector
Slide 43Output of Face Detector on Test Images
Slide 44Solving other "Face" Tasks Profile Detection Facial Feature Localization Demographic Analysis
Slide 45Feature Localization Surprising properties of our structure The cost of location is not an element of picture size Just the quantity of elements Learning naturally centers consideration around key areas Conclusion: the "include" indicator can incorporate a vast logical district around the component
Slide 46Feature Localization Features Learned elements mirror the undertaking
Slide 47Profile Detection
Slide 48More Results
Slide 49Profile Features
Slide 50Thanks to Andrew Moore One-Nearest Neighbor … One closest neighbor for fitting is portrayed without further ado… Similar to Join The Dots with two Pros and one Con. Genius: It is anything but difficult to execute with multivariate information sources. CON: It no longer interjects locally. Ace: A great prologue to case based learning…
Slide 51Thanks to Andrew Moore 1-Nearest Neighbor is a case of… . Occurrence based learning Four things make a memory based learner: A separation metric what number close-by neighbors to take a gander at? A weighting capacity (discretionary) How to fit with the neighborhood focuses? x 1 y 1 x 2 y 2 x 3 y 3 . . x n y n A capacity approximator that has been around since around 1910. To make an expectation, look database for comparable datapoints, and fit with the nearby focuses.
Slide 52Thanks to Andrew Moore Nearest Neighbor Four things make a memory based learner: A separation metric Euclidian what number close-by neighbors to take a gander at? One A weighting capacity (optional) Unused How to fit with the nearby points? J ust foresee an indistinguishable yield from the closest neighbor.
Slide 53Thanks to Andrew Moore Multivariate Distance Metrics Suppose the information vectors x1, x2, … xn are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N1 , x N2 ). One can draw the closest neighbor areas in info space. The relative scalings out there metric influence locale shapes.
Slide 54Thanks to Andrew Moore Euclidean Distance Metric Other Metrics… Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes' Ringo framework… ) Or identically, where
Slide 55Thanks to Andrew Moore Notable Distance Metrics
Slide 56Simard: Tangent Distance
Slide 57Simard: Tangent Distance
Slide 58Thanks to Baback Moghaddam FERET Photobook Moghaddam & Pentland (1995)
Slide 59Normalized Eigenfaces Thanks to Baback Moghaddam Eigenfaces Moghaddam & Pentland (1995)
Slide 60Thanks to Baback Moghaddam Euclidean (Standard) "Eigenfaces" Turk & Pentland (1992) Moghaddam & Pentland (1995) Projects all the preparation faces onto a general eigenspace t
SPONSORS
SPONSORS
SPONSORS