Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

1599 days ago, 544 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Misha Bilenko and Ryen White introduced by Matt Richardson Microsoft Research

Slide 2

Search = Modeling User Behavior Retrieval capacities appraise significance from conduct of a few client bunches: Page creators make page substance TF-IDF/BM25, inquiry is-page-title, … Page creators make joins PageRank/HITS, question matches-stay content, … Searchers submit inquiries and tap on results Clickthrough, question reformulations Most client conduct happens past web indexes Viewing results and perusing past them What would we be able to catch, and in what manner would we be able to utilize it ?

Slide 3

Prior Work Clickthrough/verifiable input techniques Learning positioning capacities from snaps and inquiry chains [ Joachims '02, Xue et al. '04, Radlinski-Joachims '05 '06 '07] Combining clickthrough with customary IR highlights [Richardson et al. '06, Agichtein et al. '06] Activity-based client models for personalization [ Shen et al. '05, Tan et al. '06] Modeling perusing conduct [Anderson et al. '01, Downey et al. '07, Pandit - Olston '07]

Slide 4

Search Trails begin with an internet searcher question Continue until an ending occasion Another inquiry Visit to a random website (interpersonal organizations, webmail) Timeout, program landing page, program shutting

Slide 5

Trails versus Click logs Trails catch stay time Both consideration share and site visit tallies are accounted Trails speak to client action crosswise over numerous sites Browsing groupings surface "under-positioned" pages Click logs are less uproarious Position predisposition is anything but difficult to control

Slide 6

Predicting Relevance from Trails Task: given a trails corpus D ={ q i → ( d i1 ,… , d ik )} anticipate pertinent sites for another question q Trails give us the great pages for every inquiry… … wouldn't we be able to simply query the pages for new inquiries? Not specifically: 50+% of questions are one of a kind Page visits are likewise greatly meager Solutions: Query sparsity : term-based coordinating, dialect displaying Pageview sparsity : smoothing (area level forecast)

Slide 7

Model 1: Heuristic Documents ≈ sites Contents ≈ inquiries going before sites in trails Split questions into terms, figure frequencies Terms incorporate unigrams, bigrams, named substances Relevance is undifferentiated from BM25 (TF-IDF) Query-term recurrence (QF) and opposite inquiry recurrence (IQF) terms consolidate corpus insights and site prevalence.

Slide 8

Model 2: Probabilistic IR by means of dialect demonstrating [ Zhai - Lafferty, Lavrenko ] Query-term dissemination gives more mass to uncommon terms: Term-site weights join abide time and checks

Slide 9

Model 2: Probabilistic (cont.) Basic probabilistic model is loud Misspellings, equivalent words, meager condition

Slide 10

Model 3: Random Walks Basic probabilistic model is boisterous Misspellings, equivalent words, inadequacy Solution: irregular walk augmentation

Slide 11

Evaluation Train: 140+ million hunt trails (toolbar information) Test: human-named pertinence set, 33K questions q =[ dark jewel carabiners ]

Slide 12

Evaluation (cont.) Metric: NDCG (N ormalized D iscounted C umulative G ain ) Preferable to MAP, Kendall's Tau, Spearman's, and so on. Touchy to best positioned comes about Handles variable number of results/target things Well connected with client fulfillment [ Bompada et al. '07]

Slide 13

Evaluation (cont.) Metric: NDCG (N ormalized D iscounted C umulative G ain ) Perfect positioning Obtained positioning

Slide 14

Results I: Domain positioning (cont.) Predicting right positioning of spaces for inquiries

Slide 15

Results I: Domain positioning (cont.) Full trails versus item clicks versus "goals"

Slide 16

Results I: Domain positioning (cont.) Scoring in light of stay times versus appearance numbers

Slide 17

Results I: Domain positioning (cont.) What's superior to information? Bunches OF DATA! NDCG@10

Slide 18

Results II: Learning to Rank Add Rel ( q, d i ) as an element to RankNet [Burges et al. '05] Thousands of different components catch different substance , connect and clickthrough-based proof

Slide 19

Conclusions Post-look perusing conduct (seek trails) can be mined to concentrate clients' certain underwriting of significant sites. Trail-based pertinence expectation gives one of a kind flag not caught by other (substance, interface, clickthrough) highlights. Utilizing full trails outflanks utilizing just query item snaps or inquiry trail goals. Probabilistic models fusing irregular strolls give best precision by beating information sparsity and clamor.

Slide 20

Model 3: Random Walks (cont.)

Slide 21

URLs versus Sites Website ≈ area Sites: , Not destinations: , Scoring: URL positioning Website positioning