CS 345A Data Mining Lecture 1

0
0
2651 days ago, 797 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

CS 345A Data Mining Lecture 1 Introduction to Web Mining

Slide 2

What is Web Mining? Finding helpful data from the World-Wide Web and its utilization designs

Slide 3

Web Mining v. Information Mining Structure (or absence of it) Textual data and linkage structure Scale Data created every day is practically identical to biggest traditional information distribution centers Speed Often need to respond to advancing utilization designs progressively (e.g., marketing)

Slide 4

Web Mining themes Web chart investigation Power Laws and The Long Tail Structured information extraction Web promoting Systems Issues

Slide 5

Web Mining subjects Web diagram examination Power Laws and The Long Tail Structured information extraction Web publicizing Systems Issues

Slide 6

Size of the Web Number of pages Technically, unending Much duplication (30-40%) Best gauge of "interesting" static HTML pages originates from internet searcher claims Until a year ago, Google asserted 8 billion(?), Yahoo guaranteed 20 billion Google as of late declared that their list contains 1 trillion pages How to clarify the inconsistency?

Slide 7

The web as a diagram Pages = hubs, hyperlinks = edges Ignore content Directed chart High linkage 10-20 joins/page all things considered Power-law degree dispersion

Slide 8

Structure of Web chart Let's investigate structure Broder et al (2000) examined a creep of 200M pages and other littler slithers Bow-tie structure Not a "little world"

Slide 9

Bow-tie Structure Source: Broder et al, 2000

Slide 10

What can the chart let us know? Recognize "imperative" pages from insignificant ones Page rank Discover people group of related pages Hubs and Authorities Detect web spam Trust rank

Slide 11

Web Mining themes Web chart investigation Power Laws and The Long Tail Structured information extraction Web publicizing Systems Issues

Slide 12

Power-law degree dispersion Source: Broder et al, 2000

Slide 13

Power-laws in abundance Structure In-degrees Out-degrees Number of pages per website Usage designs Number of guests Popularity e.g., items, motion pictures, music

Slide 14

The Long Tail Source: Chris Anderson (2004)

Slide 15

The Long Tail Shelf space is a rare ware for conventional retailers Also: TV systems, film theaters,… The web empowers almost zero-cost scattering of data about items More decision requires better channels Recommendation motors (e.g., Amazon) How Into Thin Air made Touching the Void a hit

Slide 16

Web Mining points Web diagram examination Power Laws and The Long Tail Structured information extraction Web promoting Systems Issues

Slide 17

Extracting Structured Data http://www.simplyhired.com

Slide 18

Extracting organized information http://www.fatlens.com

Slide 19

Web Mining subjects Web chart examination Power Laws and The Long Tail Structured information extraction Web promoting Systems Issues

Slide 20

Ads versus list items

Slide 21

Ads versus indexed lists Search promoting is the income display Multi-billion-dollar industry Advertisers pay for snaps on their promotions Interesting issues What advertisements to appear for a pursuit? In case I'm a publicist, which seek terms would it be a good idea for me to offer on and the amount to offer?

Slide 22

Web Mining subjects Web chart examination Power Laws and The Long Tail Structured information extraction Web publicizing Systems Issues

Slide 23

Two Approaches to Analyzing Data Machine Learning approach Emphasizes refined calculations e.g., Support Vector Machines Data sets have a tendency to be little, fit in memory Data Mining approach Emphasizes huge information sets (e.g., in the terabytes) Data can't fit on a solitary circle! Fundamentally prompts to more straightforward calculations

Slide 24

Philosophy In numerous cases, adding more information prompts to better results that enhancing calculations Netflix Google look Google advertisements More on my blog: Datawocky (datawocky.com)

Slide 25

Systems engineering CPU Machine Learning, Statistics Memory "Traditional" Data Mining Disk

Slide 26

CPU Mem Disk Very Large-Scale Data Mining … Cluster of product hubs

Slide 27

Systems Issues Web information sets can be extensive Tens to many terabytes Cannot mine on a solitary server! Require expansive ranches of servers How to sort out equipment/programming to mine multi-terabye information sets Without burning up all available resources!

Slide 28

Web Mining points Web chart investigation Power Laws and The Long Tail Structured information extraction Web publicizing Systems Issues

Slide 29

Project Lots of intriguing undertaking thoughts If you can't consider one please come examine with us Infrastructure Aster Data group on Amazon EC2 Supports both MapReduce and SQL Data Netflix ShareThis Google WebBase TREC

SPONSORS