Dental Data Mining: Practical Issues and Potential Pitfalls

0
0
2016 days ago, 739 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Dental Data Mining: Practical Issues and Potential Pitfalls Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children's Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251

Slide 2

What is K nowledge D iscovery and D ata Mining (KDD)? "Self-loader disclosure of examples, affiliations, abnormalities, and measurably noteworthy structures in information" – MIT Tech Review (2001) Interface of Artificial Intelligence – Machine Language Computer Science – Engineering – Statistics Association for Computing Machinery Special Interest Group on Knowledge Discovery in Data and Data Mining (ACM SIGKDD supports KDD Cup)

Slide 3

Pb Au Data Mining as Alchemy

Slide 4

Some Potential KDD Applications in Oral Health Research Large studies ( eg NHANES) Longitudinal studies ( eg VA Aging Study) Disease registries ( eg SEER) Digital diagnostics (radiographic & others) Molecular science ( eg PCR, microarrays) Health administrations inquire about/claims information Provider and workforce databases

Slide 5

Supervised Learning Regression k closest neighbor Trees (CART, MART, boosting, sacking) Random Forests Multivariate Adaptive Regression Splines (MARS) Neural Networks Support Vector Machines Unsupervised Learning Hierarchical bunching k - implies

Slide 6

Collect & Store Pre-Process Analyze Validate Act Sample Merge Warehouse Clean Impute Transform Standardize Register Supervised Unsupervised Visualize Internal Split Sample Cross-approve Bootstrap External Intervene Set Policy KDD Steps

Slide 7

Data Quality

Slide 8

Example – Caries Predicting sickness with conventional calculated relapse may have displaying challenges: nonlinearity (ANN better) & collaborations (CART better)(Kattan et al , Comp Biomed Res , '98) Want to contrast the execution of strategic relapse with prominent information mining procedures – tree and simulated neural system models in dental caries information CART in caries (Stewart & Stamm, JDR , '91)

Slide 9

Example think about – tyke caries Background: ~20% of youngsters have ~80% of caries (tooth rot) University of Rochester longitudinal study (Leverett et al , J Dent Res , 1993) 466 1 st - 2 nd graders without caries at pattern Saliva tests & exams at regular intervals Goal: Predict 24 month caries occurrence (yield)

Slide 10

18-month Predictors (Inputs) Salivary microscopic organisms Mutans Streptococci (log 10 CFU/ml) Lactobacilli (log 10 CFU/ml) Salivary science Fluoride (ppm) Calcium (mmol/l) Phosphate (ppm)

Slide 11

Modeling Methods Logistic Regression Neural Networks Decision Trees

Slide 12

Logistic Regression Models Logit (Primary Dentition Caries) Schematic Surface log 10 Mutans Streptococci Fluoride (F) ppm

Slide 13

Tree Models Logit (Primary Dentition Caries) Schematic Surface log 10 Mutans Streptococci Fluoride (F) ppm

Slide 14

Artificial Neural Networks Logit (Primary Dentition Caries) Schematic Surface log 10 Mutans Streptococci Fluoride (F) ppm

Slide 15

Artificial Neural Network ( p - r - 1) w ij x 1 w j h 1 x 2 h 2 y  h r x p inputs shrouded layer (neurons) yield

Slide 16

Common Mistakes with ANN (Scwartzer et al , StatMed , 2000) Too numerous parameters for test estimate No approval No model multifaceted nature punishment (eg Akaike Information Criterion (AIC)) Incorrect misclassification estimation Implausible capacity Incorrectly portrayed system unpredictability Inadequate factual contenders Insufficiently contrasted with detail contenders

Slide 17

Validation Split specimen (70% preparing/30% approval) Validation gauges unprejudiced misclassification K - overlay Cross Validation Mean squared blunder (Brier Score)

Slide 18

Why Validate? Case: Overfitting in 2 Dimensions

Slide 19

Data

Slide 20

Linear Fit to Data

Slide 21

High Degree Polynomial Fit to Data

Slide 22

10-Fold Cross-approval

Slide 23

10-Fold Cross-approval

Slide 24

10-Fold Cross-approval

Slide 25

Caries Example Model Settings Logit Stepwise determination Alpha=.05 to enter, alpha=.20 to stay AIC to judge extra indicators Tree Splitting rule: Gini record Pruning: Proportion accurately grouped

Slide 26

ANN Settings Artifical Neural Network (5-3-1 = 22 df) Multilayer perceptron 5 Preliminary runs Levenberg-Marquardt enhancement No weight rot parameter Average mistake choice 3 Hidden hubs/neurons Activation work: hyperbolic digression

Slide 27

ANN Sensitivity Analyses Random seeds: 5 values No distinctions Weight rot parameters: 0, .001, .005, .01, .25 Only slight contrasts for .01 and .25 Hidden hubs/neurons: 2, 3, 4 3 appears to be ideal

Slide 28

Prevalence: Node > Overall (15%) Overall Primary Caries 15% N=322 Training N=144 Validation Prevalence: Node < Overall (15%) log 10 MS <7.08 15% log 10 MS  7.08 91% log 10 LB <3.05 10% F  .110 0% log 10 LB  3.05 23% F < .110 100% log 10 MS <3.91 3% log 10 MS  3.91 14% F < .056 22% F  .056 25% Tree Model

Slide 29

Receiver Operating Characteristic (ROC) Curves

Slide 30

Cumulative Captured Response Curves

Slide 31

Lift Chart

Slide 32

Logistic Regression Beta Std Err Odds Ratio 95% CI log 10 MS .238 .072 1.27 1.10 – 1.46 log 10 LB .311 .070 1.36 1.19 – 1.57

Slide 33

MARS – MS at 4 Times

Slide 34

Predicted Quintiles 2 1 0 Standard LOGMS4 - 1 - 2 0 1 4 2 3 Rank for Variable PR_ANN

Slide 35

Predicted Quintiles 2 1 0 Standard LOGLB4 - 1 - 2 0 4 1 3 2 Rank for Variable PR_ANN

Slide 36

5-crease CV Results Logit Tree ANN RMS blunder .365 .363 .362 AUC .680 .553 .707

Slide 37

Summary Data quality and study plan are vital Utilize different strategies Be certain to approve Graphical showcases help understandings KDD techniques may give points of interest over customary factual models in dental information

Slide 39

Prediction tantamount to the information and model

SPONSORS