Information Mining with numerous slides because of Gehrke, Garofalakis, Rastogi

0
0
1518 days ago, 483 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Information Mining (with numerous slides due to Gehrke, Garofalakis, Rastogi) Raghu Ramakrishnan Yahoo! Inquire about University of Wisconsin–Madison (on leave)

Slide 2

Introduction

Slide 3

Definition Data mining is the investigation and examination of huge amounts of information so as to find substantial, novel, conceivably helpful, and at last justifiable examples in information. Substantial: The examples hold when all is said in done. Novel: We didn't know the example previously. Helpful: We can devise activities from the examples. Reasonable: We can translate and understand the examples.

Slide 4

Case Study: Bank Business objective: Sell more home value advances Current models: Customers with school age youngsters utilize home value advances to pay for educational cost Customers with variable pay utilize home value credits to try and out stream of salary Data: Large information stockroom Consolidates information from 42 operational information sources

Slide 5

Case Study: Bank (Contd.) Select subset of client records who have gotten home value advance offer Customers who declined Customers who joined

Slide 6

Case Study: Bank (Contd.) Find principles to foresee whether a client would react to home value advance offer IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES …

Slide 7

Case Study: Bank (Contd.) Group clients into bunches and explore bunches Group 3 Group 2 Group 1 Group 4

Slide 8

Case Study: Bank (Contd.) Evaluate comes about: Many "uninteresting" bunches One fascinating bunch! Clients with both business and individual records; uncommonly high rate of likely respondents

Slide 9

Example: Bank (Contd.) Action: New showcasing effort Result: Acceptance rate for home value offers dramatically increased

Slide 10

Example Application: Fraud Detection Industries: Health mind, retail, Visa administrations, telecom, B2B connections Approach: Use authentic information to fabricate models of deceitful conduct Deploy models to recognize false cases

Slide 11

Fraud Detection (Contd.) Examples: Auto protection: Detect gatherings of individuals who organize mishaps to gather protection Medical protection: Fraudulent cases Money laundering: Detect suspicious cash exchanges (US Treasury's Financial Crimes Enforcement Network) Telecom industry: Find calling designs that digress from a standard (cause and goal of the call, term, time of day, day of week).

Slide 12

Other Example Applications CPG: Promotion investigation Retail: Category administration Telecom: Call utilization examination, agitate Healthcare: Claims investigation, misrepresentation recognition Transportation/Distribution: Logistics administration Financial Services: Credit examination, extortion discovery Data benefit suppliers: Value-included information examination

Slide 13

What is a Data Mining Model? An information mining model is a depiction of a specific part of a dataset. It produces yield values for a doled out arrangement of data sources. Illustrations: Clustering Linear relapse show Classification demonstrate Frequent itemsets and affiliation rules Support Vector Machines

Slide 14

Data Mining Methods

Slide 15

Overview Several very much contemplated undertakings Classification Clustering Frequent Patterns Many strategies proposed for every Focus in database and information mining group: Scalability Managing the procedure Exploratory investigation

Slide 16

Classification Goal: Learn a capacity that relegates a record to one of a few predefined classes. Necessities on the model: High exactness Understandable by people, interpretable Fast development for huge preparing databases

Slide 17

Classification Example application: telemarketing

Slide 18

Classification (Contd.) Decision trees are one way to deal with arrangement. Different methodologies include: Linear Discriminant Analysis k - closest neighbor techniques Logistic relapse Neural systems Support Vector Machines

Slide 19

Classification Example Training database: Two indicator characteristics: Age and Car-sort ( S port, M inivan and T ruck) Age is requested, Car-sort is absolute trait Class name demonstrates whether individual purchased item Dependent property is unmitigated

Slide 20

Classification Problem If Y is clear cut, the issue is an arrangement issue , and we utilize C rather than Y. |dom(C)| = J, the quantity of classes. C is the class name , d is known as a classifier. Give r a chance to be a record arbitrarily drawn from P. Characterize the misclassification rate of d: RT(d,P) = P(d(r.X 1 , … , r.X k ) != r.C) Problem definition : Given dataset D that is an irregular specimen from likelihood conveyance P, discover classifier d with the end goal that RT(d,P) is minimized.

Slide 21

Regression Problem If Y is numerical, the issue is a relapse issue. Y is known as the reliant variable, d is known as a relapse work. Give r a chance to be a record arbitrarily drawn from P. Characterize mean squared blunder rate of d: RT(d,P) = E(r.Y - d(r.X 1 , … , r.X k )) 2 Problem definition : Given dataset D that is an irregular example from likelihood dispersion P, discover relapse work d with the end goal that RT(d,P) is minimized.

Slide 22

Regression Example preparing database Two indicator qualities: Age and Car-sort ( S port, M inivan and T ruck) Spent shows what amount of individual spent amid a late visit to the site Dependent property is numerical

Slide 23

Decision Trees

Slide 24

What are Decision Trees? Minivan YES Sports, Truck YES NO 0 30 60 Age <30 >=30 Car Type YES Minivan Sports, Truck NO YES

Slide 25

Decision Trees A choice tree T encodes d (a classifier or relapse work) in type of a tree. A hub t in T without kids is known as a leaf hub . Generally t is called an interior hub .

Slide 26

Internal Nodes Each inside hub has a related part predicate. Most regular are twofold predicates. Illustration predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary – 10000 > 0

Slide 27

Leaf Nodes Consider leaf hub t: Classification issue: Node t is named with one class mark c in dom(C) Regression issue: Two decisions Piecewise steady model: t is named with a consistent y in dom(Y). Piecewise straight model: t is marked with a direct model Y = y t + Σ an i X i

Slide 28

Encoded classifier: If (age<30 and carType=Minivan) Then YES If (age <30 and (carType=Sports or carType=Truck)) Then NO If (age >= 30) Then YES Example Age <30 >=30 Car Type YES Minivan Sports, Truck NO YES

Slide 29

Issues in Tree Construction Three algorithmic segments: Split Selection Method Pruning Method Data Access Method

Slide 30

Top-Down Tree Construction BuildTree (Node n , Training database D , Split Selection Method S ) [ (1) Apply S to D to discover part foundation ] (1a) for every indicator trait X (1b) Call S .findSplit(AVC-set of X ) (1c) endfor (1d) S .chooseBest(); (2) if ( n is not a leaf hub) ... S : C4.5, CART, CHAID, FACT, ID3, GID3, QUEST, and so on

Slide 31

Split Selection Method Age 30 35 Numerical Attribute: Find a split point that isolates the (two) classes (Yes: No: )

Slide 32

Split Selection Method (Contd.) Categorical Attributes: How to aggregate? Sport: Truck: Minivan: (Sport, Truck) - (Minivan) (Sport) - (Truck, Minivan) (Sport, Minivan) - (Truck)

Slide 33

Impurity-based Split Selection Methods Split determination strategy has two sections: Search space of conceivable part criteria. Illustration: All parts of the frame "age <= c". Quality appraisal of a part paradigm Need to evaluate the nature of a split: Impurity work Example pollution capacities: Entropy, gini-record, chi-square list

Slide 34

Data Access Method Goal: Scalable choice tree development, utilizing the entire preparing database

Slide 35

AVC-Sets Training Database AVC-Sets

Slide 36

Motivation for Data Access Methods Age Training Database <30 >=30 Right Partition Left Partition on a fundamental level, one disregard preparing database for every hub. Will we move forward?

Slide 37

RainForest Algorithms: RF-Hybrid Database AVC-Sets Main Memory First sweep: Build AVC-sets for root

Slide 38

RainForest Algorithms: RF-Hybrid Second Scan: Build AVC sets for offspring of the root Age<30 Database AVC-Sets Main Memory

Slide 39

RainForest Algorithms: RF-Hybrid Database Age<30 Sal<20k Car==S Main Memory Partition 1 Partition 2 Partition 3 Partition 4 Third Scan: As we extend the tree, we come up short on memory, and need to "spill" segments to circle, and recursively read and process them later.

Slide 40

RainForest Algorithms: RF-Hybrid Further streamlining: While composing segments, simultaneously assemble AVC-gatherings of whatever number hubs as could be expected under the circumstances in-memory. This ought to help you to remember Hybrid Hash-Join! Database Age<30 Sal<20k Car==S Partition 4 Partition 1 Partition 2 Main Memory Partition 3

Slide 41

CLUSTERING

Slide 42

Problem Given focuses in a multidimensional space, gather them into a little number of bunches , utilizing some measure of "closeness" E.g., Cluster archives by subject E.g., Cluster clients by comparative premiums

Slide 43

Clustering Output: (k) gatherings of records called bunches , to such an extent that the records inside a gathering are more like records in different gatherings Representative focuses for every group Labeling of every record with every bunch number Other depiction of every bunch This is unsupervised learning : No record names are given to gain from Usage: Exploratory information mining Preprocessing step (e.g., anomaly identification)

Slide 44

Clustering (Contd.) Requirements: Need to characterize "similitude" between records Important: Use the "right" likeness (separate) work Scale or standardize all characteristics. Case: seconds, hours, days Assign diverse weights to reflect significance of the property Choose fitting measure (e.g., L1, L2)

Slide 45

Approaches Centroid-based: Assume we have k cl

SPONSORS