0

0

710 days ago,
231 views

Goals. Review of K-Means Unsupervised Clustering.Measures of Similarity and DistanceClustering utilizing SAS Enterprise Miner. Prologue to Clustering. Group: An accumulation of information objectsLarge closeness among articles in the same clusterDissimilarity among items in diverse clustersClustering is an Unsupervised Classification system: no pre-decided classesTypical uses of c

DSCI 4520/5240 (DATA MINING) DSCI 4520/5240 Lecture 9 Clustering Analysis Some slide material taken from: Marakas 2003, Han & Kamber, Olson & Shi, SAS Education

Objectives Overview of K-Means Unsupervised Clustering. Measures of Similarity and Distance Clustering utilizing SAS Enterprise Miner

Introduction to Clustering Cluster: A gathering of information questions Large similitude among articles in a similar group Dissimilarity among items in various bunches Clustering is an Unsupervised Classification procedure: no pre-decided classes Typical utilizations of grouping: As a remain solitary investigation, to pick up understanding on the information As a pre-handling venture for other prescient models

new case new case Unsupervised Classification Training Data Training Data case 1: inputs, ? case 2: inputs, ? case 3: inputs, ? case 4: inputs, ? case 5: inputs, ? case 1: inputs, bunch 1 case 2: inputs, bunch 3 case 3: inputs, bunch 2 case 4: inputs, bunch 1 case 5: inputs, bunch 2 Unsupervised Classification (Clustering) has an UNKNOWN TARGET

Clustering Applications Marketing : Help advertisers find particular gatherings in their client bases, and after that utilization this learning to create focused on showcasing programs Land utilize : Identification of zones of comparable land use in an earth perception database Insurance : Identifying gatherings of engine protection strategy holders with a high normal claim cost City-arranging : Identifying gatherings of houses as per their home sort, esteem, and topographical area Earth-shake ponders : Observed earth shudder epicenters ought to be grouped along mainland issues

Type of information in bunching investigation Interval-scaled factors Binary factors Nominal, ordinal, and proportion factors Variables of blended sorts

Interval-esteemed factors Standardize information Calculate the mean outright deviation: where Calculate the institutionalized estimation ( z-score ) Using mean total deviation is more vigorous than utilizing standard deviation

Similarity and Dissimilarity Between Objects Distances are typically used to quantify the comparability or uniqueness between two information questions Some well known ones include: Minkowski separate : where i = ( x i1 , x i2 , … , x ip ) and j = ( x j1 , x j2 , … , x jp ) are two p - dimensional information articles, and q is a positive whole number For q =1, we get the MANHATTAN DISTANCE For q =2, we get the EUCLIDEAN DISTANCE

(U 2 ,V 2 ) (U 1 ,V 1 ) L 1 = |U 1 - U 2 | + |V 1 - V 2 | Similarity and Dissimilarity Between Objects If q = 1 , d is Manhattan remove

(U 2 ,V 2 ) (U 1 ,V 1 ) L 2 = ((U 1 - U 2 ) 2 + (V 1 - V 2 ) 2 ) 1/2 Similarity and Dissimilarity Between Objects (Cont.) If q = 2 , d is Euclidean separation: Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Binary Variables Object j A possibility table for double information Simple coordinating coefficient (invariant, if the twofold factor is symmetric ): Jaccard coefficient (noninvariant if the parallel variable is hilter kilter ): Object i

Dissimilarity between Binary Variables Example sexual orientation is a symmetric quality the rest of the traits are unbalanced paired let the qualities Y and P be set to 1, and the esteem N be set to 0

Nominal Variables A speculation of the twofold factor in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple coordinating m : # of matches, p : add up to # of factors Method 2: utilize an expansive number of parallel factors making another paired variable for each of the M ostensible states

Ordinal Variables An ordinal variable can be discrete or ceaseless request is vital, e.g., rank Can be dealt with like interim scaled supplanting x if by their rank guide the scope of every variable onto [0, 1] by supplanting i - th question in the f - th variable by register the difference utilizing techniques for interim scaled factors

Similar archives Term j Document Projection H orizon Similar Documents Origin of Term i Vector Space Similarity between reports: The Vector Space Model (VSM) Document (Text) Classification is made conceivable through figurings of VSM-based record similitudes a similar likeness metric is utilized via web crawlers to ascertain closeness between inquiry messages and recovered records Every archive is spoken to as an entirety vector of its file terms Cosine of edge between vectors decides pertinence:

The K-Means Clustering Method Given k , the k-implies calculation is executed in the accompanying strides (Olson & Shi, p. 75): Select the wanted number of groups k Select k starting perceptions as seeds Calculate normal bunch values (Cluster Centroids) over every variable (for the underlying emphasis, this will just be the underlying seed perceptions) Assign each of the other preparing perceptions to the bunch with the closest centroid Recalculate bunch centroids (midpoints) in view of the assignments from step 4 Iterate between steps 4 and 5, stop when there are not any more new assignments

The K-Means Clustering Method Example

Comments on the K-Means Method Strength Relatively proficient : O ( tkn ), where n is # objects, k is # groups, and t is # emphasess. Regularly, k , t << n . Regularly ends at a nearby ideal . The worldwide ideal might be discovered utilizing systems, for example, deterministic tempering and hereditary calculations Weakness Applicable just when mean is characterized, then shouldn't something be said about downright information? Need to indicate k, the quantity of groups, ahead of time Unable to deal with uproarious information and anomalies Not reasonable to find bunches with non-raised shapes

Clustering in SAS Enterprise Miner (EM)

The Scenario The objective is to portion potential clients in light of geographic and statistic properties. Referred to properties incorporate such things as age, salary, conjugal status, sexual orientation, and home possession.

PROSPECT: The situation An inventory organization occasionally buys arrangements of prospects from outside sources. They need to outline a test mailing to assess the potential reaction rates for a few distinct items. In light of their experience, they realize that client inclination for their item relies on upon geographic and statistic variables. Therefore, they need to section the prospects into gatherings that are like each other concerning these characteristics. After the prospects have been divided, an arbitrary example of prospects inside each fragment will be sent one of a few offers. The aftereffects of the test crusade will permit the examiner to assess the potential benefit for each section.

PROSPECT informational collection

SPONSORS

No comments found.

SPONSORS

SPONSORS