Grouping Slides for Textbook Chapter 8

0
0
1987 days ago, 664 views
PowerPoint PPT Presentation
Han: Clustering. 2. What is Cluster Analysis?. Group: an accumulation of information objectsSimilar to each other inside of the same clusterDissimilar to the items in other clustersCluster analysisGrouping an arrangement of information articles into clustersClustering is unsupervised order: no predefined classesTypical applicationsAs a stand-alone instrument to get understanding into information appropriation As a preprocessing step

Presentation Transcript

Slide 1

Bunching — Slides for Textbook — Chapter 8 — © Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca Han: Clustering

Slide 2

General Applications of Clustering Pattern Recognition Spatial Data Analysis make topical maps in GIS by grouping highlight spaces recognize spatial groups and clarify them in spatial information mining Image Processing Economic Science (particularly statistical surveying) WWW Document arrangement Cluster Weblog information to find gatherings of comparative get to examples Han: Clustering

Slide 3

Examples of Clustering Applications Marketing: Help advertisers find unmistakable gatherings in their client bases, and after that utilization this learning to create focused on showcasing programs Land utilize: Identification of territories of comparable land use in an earth perception database Insurance: Identifying gatherings of engine protection approach holders with a high normal claim cost City-arranging: Identifying gatherings of houses as per their home sort, esteem, and geological area Earth-shake examines: Observed earth shudder epicenters ought to be bunched along mainland issues Han: Clustering

Slide 4

What Is Good Clustering? A decent grouping technique will deliver great bunches with high intra-class closeness low between class comparability The nature of a grouping result relies on upon both the likeness measure utilized by the strategy and its execution. The nature of a bunching technique is likewise measured by its capacity to find a few or the majority of the concealed examples. Han: Clustering

Slide 5

Requirements of Clustering in Data Mining Scalability Ability to manage distinctive sorts of traits Discovery of bunches with subjective shape Minimal prerequisites for area learning to decide input parameters Able to manage commotion and anomalies Insensitive to request of info records High dimensionality Incorporation of client determined limitations Interpretability and convenience Han: Clustering

Slide 6

Data Structures for Clustering Data framework (two modes) Dissimilarity grid (one mode) Han: Clustering

Slide 7

Measure the Quality of Clustering Dissimilarity/Similarity metric: Similarity is communicated regarding a separation capacity, which is commonly metric: d ( i, j ) There is a different "quality" capacity that measures the "decency" of a group. The meanings of separation capacities are generally altogether different for interim scaled, boolean, clear cut, ordinal and proportion factors. Weights ought to be related with various factors in light of uses and information semantics. It is difficult to characterize "sufficiently comparable" or "adequate" the appropriate response is commonly very subjective. Han: Clustering

Slide 8

Type of information in grouping examination Interval-scaled factors: Binary factors: Nominal, ordinal, and proportion factors: Variables of blended sorts: Han: Clustering

Slide 9

Interval-esteemed factors Standardize information Calculate the mean supreme deviation: where Calculate the institutionalized estimation ( z-score ) Using mean total deviation is more strong than utilizing standard deviation Han: Clustering

Slide 10

Similarity and Dissimilarity Between Objects Distances are regularly used to gauge the likeness or disparity between two information questions Some prevalent ones include: Minkowski separate : where i = ( x i1 , x i2 , … , x ip ) and j = ( x j1 , x j2 , … , x jp ) are two p - dimensional information items, and q is a positive number If q = 1 , d is Manhattan remove Han: Clustering

Slide 11

Similarity and Dissimilarity Between Objects (Cont.) If q = 2 , d is Euclidean separation: Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also one can utilize weighted separation, parametric Pearson item minute connection, or other disimilarity measures. Han: Clustering

Slide 12

Binary Variables Object j A possibility table for twofold information Simple coordinating coefficient (invariant, if the paired variable is symmetric ): Jaccard coefficient (noninvariant if the double factor is lopsided ): Object i Han: Clustering

Slide 13

Dissimilarity between Binary Variables Example sex is a symmetric quality the rest of the characteristics are hilter kilter parallel let the qualities Y and P be set to 1, and the esteem N be set to 0 Han: Clustering

Slide 14

Nominal Variables A speculation of the paired variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple coordinating m : # of matches, p : add up to # of factors Method 2: utilize an expansive number of twofold factors making another paired variable for each of the M ostensible states Han: Clustering

Slide 15

Ordinal Variables An ordinal variable can be discrete or nonstop request is vital, e.g., rank Can be dealt with like interim scaled supplanting x if by their rank guide the scope of every variable onto [0, 1] by supplanting i - th question in the f - th variable by register the divergence utilizing strategies for interim scaled factors Han: Clustering

Slide 16

Ratio-Scaled Variables Ratio-scaled variable : a positive estimation on a nonlinear scale, roughly at exponential scale, such as Ae Bt or Ae - Bt Methods: treat them like interim scaled factors — not a decent decision! (why?) apply logarithmic change y if = log(x if ) regard them as constant ordinal information regard their rank as interim scaled. Han: Clustering

Slide 17

Variables of Mixed Types A database may contain all the six sorts of factors symmetric twofold, awry paired, ostensible, ordinal, interim and proportion. One may utilize a weighted equation to consolidate their belongings. f is paired or ostensible: d ij (f) = 0 if x if = x jf , or d ij (f) = 1 o.w. f is interim based: utilize the standardized separation f is ordinal or proportion scaled register positions r if and regard z if as interim scaled Han: Clustering

Slide 18

Major Clustering Approaches Partitioning calculations : Construct different parcels and afterward assess them by some rule Hierarchy calculations : Create a various leveled deterioration of the arrangement of information (or items) utilizing some foundation Density-construct : situated in light of network and thickness capacities Grid-construct : based with respect to a numerous level granularity structure Model-based : A model is speculated for each of the groups and the thought is to locate the best attack of that model to each other Han: Clustering

Slide 19

Partitioning Algorithms: Basic Concept Partitioning strategy: Construct a segment of a database D of n articles into an arrangement of k bunches Given a k , discover a segment of k groups that streamlines the picked dividing measure Global ideal: comprehensively list all allotments Heuristic techniques: k-means and k-medoids calculations k-implies (MacQueen'67): Each bunch is spoken to by the focal point of the group k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw'87): Each group is spoken to by one of the items in the bunch Han: Clustering

Slide 20

The K-Means Clustering Method Given k , the k-implies calculation is executed in 4 stages: Partition objects into k nonempty subsets Compute seed focuses as the centroids of the bunches of the present segment. The centroid is the middle (mean point) of the bunch. Dole out each protest the group with the closest seed point. Backpedal to Step 2, stop when not any more new task. Han: Clustering

Slide 21

The K-Means Clustering Method Example Han: Clustering

Slide 22

Comments on the K-Means Method Strength Relatively proficient : O ( tkn ), where n is # objects, k is # bunches, and t is # cycles. Regularly, k , t << n . Frequently ends at a nearby ideal . The worldwide ideal might be discovered utilizing methods, for example, deterministic strengthening and hereditary calculations Weakness Applicable just when mean is characterized, then shouldn't something be said about clear cut information? Need to determine k, the quantity of bunches, ahead of time Unable to deal with loud information and exceptions Not appropriate to find groups with non-arched shapes Han: Clustering

Slide 23

Variations of the K-Means Method A couple of variations of the k-implies which vary in Selection of the underlying k implies Dissimilarity figurings Strategies to ascertain group implies Handling downright information: k-modes (Huang'98) Replacing method for groups with modes Using new disparity measures to manage absolute articles Using a recurrence - based strategy to refresh methods of bunches A blend of clear cut and numerical information: k-model technique Han: Clustering

Slide 24

The K - Medoids Clustering Method Find delegate objects, called medoids , in groups PAM (Partitioning Around Medoids, 1987) begins from an underlying arrangement of medoids and iteratively replaces one of the medoids by one of the non-medoids in the event that it enhances the aggregate separation of the subsequent bunching PAM works adequately for little informational collections, yet does not scale well for huge informational collections CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized inspecting Focusing + spatial information structure (Ester et al., 1995) Han: Clustering

Slide 25

PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), worked in Splus Use genuine protest speak to the group Select k agent questions subjectively For each combine of non-chose question h and chose protest i , compute the aggregate swapping cost TC ih For each match of i and h , If TC ih < 0, i is supplanted by h Then allot each non-chose question the most comparable agent protest rehash steps 2-3 until there is no change Han: Clustering

Slide 26

j t j h i h i h j i h j t PAM Clustering: Total swapping cost TC ih =  j C jih Han: Clustering

Slide 27

Step 0 Step 1 Step 2 Step 3 Step 4 agglomer

SPONSORS