Unsupervised Learning with Arbitrary Timberland Indicators: Connected to Tissue Microarray Information

Unsupervised learning with random forest predictors applied to tissue microarray data l.jpg
1 / 80
1365 days ago, 485 views
PowerPoint PPT Presentation
Specialized report and R code can be found at www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm www.genetics.ucla.edu/labs/horvath/kidneypaper/RCC.htm ...

Presentation Transcript

Slide 1

Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data Steve Horvath Biostatistics and Human Genetics University of California, LA

Slide 2

Contents Tissue Microarray Data Random woods (RF) indicators Understanding RF bunching Shi, T. what's more, Horvath, S. (2006) "Unsupervised learning utilizing arbitrary backwoods indicators" J. Comp. Diagram. Detail. Applications to Tissue Microarray Data: Shi et al (2004) "Tumor Profiling of Renal Cell Carcinoma Tissue Microarray Data" Modern Pathology Seligson DB et al (2005) Global histone change designs foresee danger of prostate growth repeat. Nature

Slide 3

Former understudies & Postdocs for TMA Tao Shi, PhD Tuyen Hoang, PhD Yunda Huang, PhD Xueli Liu, PhD UCLA Tissue Microarray Core David Seligson, MD Aarno Palotie, MD Arie Belldegrun, MD Robert Figlin, MD Lee Goodglick, MD David Chia, MD Siavash Kurdistani, MD Acknowledgments

Slide 4

Tissue Microarray Data

Slide 5

Tissue Microarray DNA Microarray

Slide 6

Tissue Array Section ~700 Tissue Samples 0.6 mm 0.2mm

Slide 7

Ki-67 Expression in Kidney Cancer High Grade Low Grade Message: chestnut recoloring identified with tumor review

Slide 8

Multiple estimations per tolerant: Several spots for every tumor test and a few "scores" per detect Each patients (tumor test) is generally spoken to by various spots 3 tumor spots 1 coordinated typical spot Maximum power = Max Percent of cells recoloring = Pos Spots have a spot review: NL,1,2,.

Slide 9

Properties of TMA Data Highly skewed, non-ordinary, semi-consistent. Regularly a smart thought to show as ordinal factors with numerous levels. Recoloring scores of similar markers are exceptionally corresponded

Slide 10

Histogram of tumor marker expression scores: POS and MAX Percent of Cells Staining(POS) EpCam P53 CA9 Maximum Intensity (MAX)

Slide 11

Thresholding strategies for tumor marker expressions Since clinicians and pathologists incline toward thresholding tumor marker expressions, it is normal to utilize measurable techniques that depend on thresholding covariates, e.g. relapse trees, survival trees, rpart, timberland indicators and so on. Dichotomized marker expressions are regularly fitted in a Cox (or option) relapse display Danger: Over-fitting because of ideal cut-off choice. A few thresholding techniques and routes for modifying for various correlations are checked on in Liu X, Minin V, Huang Y, Seligson DB, Horvath S (2004) Statistical Methods for Analyzing Tissue Microarray Data. J of Biopharmaceutical Statistics. Vol 14(3) 671-685

Slide 12

Tumor class disclosure Keywords: unsupervised learning, bunching

Slide 13

Tumor Class Discovery Molecular tumor classes=clusters of patients with comparative quality expression profiles Main street for tumor class revelation DNA microarrays Proteomics and so forth unsupervised learning: bunching, multi-dimensional scaling plots Tissue microarrays have been utilized for tumor marker approval directed learning, Cox relapse and so forth Challenge: demonstrate that tissue microarray information can be utilized as a part of unsupervised figuring out how to discover tumor classes street less voyaged

Slide 14

Tumor Class Discovery utilizing DNA Microarray Data Tumor class revelation involves utilizing an unsupervised learning calculation (e.g various leveled, k-implies, bunching and so on.) to naturally assemble tumor tests in light of their quality expression design. Bullinger et al. N Engl J Med. 2004

Slide 15

Clusters including TMA information may have offbeat shapes: Low hazard prostate malignancy patients are hued in dark. Scramble plot including 2 `dependent' tumor markers. The staying, less needy markers are not appeared. Generally safe bunch can be portrayed utilizing the accompanying principle Marker H3K4 > 45% and H3K18 > 70%. The instinct is very not quite the same as that of Euclidean separation based groups.

Slide 16

Unconventional state of a clinically important patient group 3 dimensional diffuse plot along tumor markers Low hazard patients are hued in dark MARKER 2 MARKER 1

Slide 17

How to bunch patients on the premise of Tissue Microarray Data?

Slide 18

A divergence measure is a crucial contribution for tumor class disclosure Dissimilarities between tumor tests are utilized as a part of bunching and other unsupervised learning systems Commonly utilized uniqueness measures incorporate Euclidean separation, 1 - connection

Slide 19

Challenge Conventional difference measures that work for DNA microarray information may not be ideal for TMA information. Disparity measure that depend on the instinct of multivariate typical appropriations (bunches have circular shapes) may not be ideal For tumor marker information, one might need to utilize an alternate instinct: groups are portrayed utilizing thresholding rules including subordinate markers. It might be alluring to have a uniqueness that is invariant under monotonic changes of the tumor marker expressions.

Slide 20

We have found that an arbitrary woodland (Breiman 2001) difference can function admirably in the unsupervised investigation of TMA information. Shi et al 2004, Seligson et al 2005. http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm

Slide 21

Kidney malignancy: Comparing PAM bunches that come about because of utilizing the RF disparity versus the Euclidean separation Kaplan Meier plots for gatherings characterized by cross arranging patients as indicated by their RF and Euclidean separation group participations. Message: In this application, RF bunches are more important with respect to survival time

Slide 22

The RF uniqueness is dictated by ward tumor markers Tumor markers The RF difference concentrates on the most ward markers (1,2). In a few applications, it regards concentrate on markers that are needy since they may constitute an ailment pathway. The Euclidean separation concentrates on the most changing marker (4) Patients sorted by group

Slide 23

The RF bunch can be depicted utilizing a thresholding standard including the most ward markers Low hazard quiet if marker1>cut1 & marker2> cut2 This sort of thresholding tenet can be utilized to make forecasts on autonomous information sets. Approval on free information set

Slide 24

Random Forest Predictors Breiman L. Irregular woods. Machine Learning 2001;45(1):5-32 http://detail www.berkeley.edu/clients/breiman/RandomForests/

Slide 25

Tree indicators are the essential unit of arbitrary timberland indicators C lassification a nd R egression T rees (CART) by Leo Breiman Jerry Friedman Charles J. Stone Richard Olshen RPART library in R programming Therneau TM, et al.

Slide 26

A case of CART Goal: For the patients conceded into ER, to foresee who is at higher danger of heart assault Training information set: No. of subjects = 215 Outcome variable = High/Low Risk decided 19 noninvasive clinical and lab factors were utilized as the indicators

Slide 27

CART Construction High 17% Low 83% Is BP > 91? No Yes High 70% Low 30% High 12% Low 88% Classified as high hazard! Is age <= 62.5? No Yes High 2% Low 98% High 23% Low 77% Classified as okay! Is ST exhibit? Yes No High half Low half High 11% Low 89% Classified as generally safe! Named high hazard!

Slide 28

CART Construction Binary - split parent hub into two youngster hubs Recursive - every tyke hub can be dealt with as parent hub Partitioning - information set is parceled into totally unrelated subsets in every split

Slide 29

RF Construction …

Slide 30

Random Forest (RF) A RF is a gathering of tree indicators to such an extent that every tree relies on upon the estimations of a freely examined irregular vector.

Slide 31

Prediction by majority voting The woods comprises of N trees Class expectation: Each tree votes in favor of a class; the anticipated class C for a perception is the majority, max C  k [ f k ( x , T ) = C ]

Slide 32

Random timberland indicators offer ascent to a divergence measure

Slide 33

Intrinsic Similarity Measure Terminal tree hubs contain couple of perceptions If case i and case j both land in a similar terminal hub, increment the comparability amongst i and j by 1. Toward the end of the run separate by 2 x no. of trees. Difference = sqrt(1-Similarity)

Slide 34

Age BP … Patient 1: 50 85 … Patient 2: 45 80 … Patient 3: … High 17% Low 83% Is BP > 91? No Yes High 70% Low 30% High 12% Low 88% Is age <= 62.5? No Yes High 2% Low 98% High 23% Low 77% Is ST introduce? Yes No patients 1 and 2 wind up in a similar terminal hub the vicinity between them is expanded by 1 High half Low half High 11% Low 89%

Slide 35

Unsupervised issue as a Supervised issue (RF usage) Key Idea (Breiman 2003) Label watched information as class 1 Generate manufactured perceptions and mark them as class 2 Construct a RF indicator to recognize class 1 from class 2 Use the subsequent divergence measure in unsupervised examination

Slide 36

Two standard methods for producing engineered covariates autonomous inspecting from each of the univariate appropriations of the factors ( Addcl1 =independent marginals ). free testing from regalia with the end goal that every uniform has extend equivalent to the scope of the relating variable ( Addcl2 ). 1.0 The diffuse plot of unique (dark) and manufactured (red) information in light of Addcl2 testing. 0.8 0.6 x2 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 x1

Slide 37

RF bunching Compute separate network from RF remove lattice = sqrt(1-likeness grid) Conduct apportioning around medoid (PAM) grouping examination input parameter = no. of groups k

Slide 38

Understanding RF Clustering (Theoretical Studies) Shi, T. what's more, Horvath, S. (2005) "Unsupervised learning utilizing irregular woods indicators" J. Comp. Diagram. Detail

Slide 39

Abstract: Random woodland disparity Intrinsic variable choice spotlights on ward factors Depending on the application, this can be alluring Resulting bunches can regularly be portrayed usi