Utilizing Bloom Filters to Refine Web Search Results

0
0
1443 days ago, 629 views
PowerPoint PPT Presentation
Utilizing Blossom Channels to Refine Web Indexed lists. Navendu Jain Mike Dahlin College of Texas at Austin Renu Tewari IBM Almaden Research Center. www.delorie.com/gnu/docs/emacs/emacs_toc.html. www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html.

Presentation Transcript

Slide 1

Utilizing Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center Department of Computer Sciences, UT Austin

Slide 2

www.delorie.com/gnu/docs/emacs/emacs_toc.html www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html www.dc.turkuamk.fi/docs/gnu/emacs/emacs_toc.html www.linuxselfhelp.com/gnu/emacs/html section/emacs_toc.html 29.2% of information regular crosswise over 150 million pages (Fetterly'03, Broder'97) Motivation Google, Yahoo, MSN: Significant portion of close copies in top list items Google "emacs manual" question 7 of 20 results repetitive 3 indistinguishable sets 4 like one record Similar outcomes for Yahoo, MSN, A9 web crawlers Department of Computer Sciences, UT Austin

Slide 3

Problem Statement Goal: Filter close copies in web list items Given an inquiry list items, recognize pages that are either Highly comparative in substance (and connection structure) Contained in another page (Inclusions with little changes) Key Constraints Low Space Overhead: Use just a little measure of data per report Low Time Overhead: (idleness unnoticeable to end-client) Perform quick correlation and coordinating of archives Department of Computer Sciences, UT Austin

Slide 4

Our Contributions A novel likeness recognition method utilizing content-characterized piecing and Bloom channels to refine web list items Satisfies key prerequisites Compact Representation Incurs just around 0.4% additional bytes for every archive Quick Matching 66 ms for main 80 list items Document comparability utilizing bit-wise AND of their component portrayals Easy Deployment Attached as a channel over any web index's outcome set Department of Computer Sciences, UT Austin

Slide 5

Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions Department of Computer Sciences, UT Austin

Slide 6

Features > Similarity edge Feature-set: little space, low many-sided quality Fast estimated examination Documents System Overview Applying closeness location to web search tools Crawl time: The web crawler Step 1: gets a page and files it Step 2: figures and stores per-page highlights Search time: The internet searcher (or end client's program) Step 1: Retrieve the top outcomes' meta-information for a given inquiry Step 2: Similarity Testing to channel profoundly comparative outcomes Similarity Testing C Department of Computer Sciences, UT Austin

Slide 7

Fast surmised correlation Feature-set: little space, low unpredictability C Markers Documents Feature Extraction and Similarity Testing (1) > Similarity limit Divide a document into variable-sized squares (called lumps) Use Rabin unique mark to process piece limits SHA-1 hash of each lump as its element portrayal Content-characterized Chunking Original Document 4 5 6 3 lump 1 2 piece 1 2' 3 4 5 6 Modified Document Data embedded Department of Computer Sciences, UT Austin

Slide 8

Fast inexact correlation Feature-set: little space, low many-sided quality C Content-characterized Chunking Bloom channel era Markers y X 1 0 1 0 1 0 1 0 1 0 1 0 1 Insert(y,S) Insert(x,S) … Documents … Feature Extraction and Similarity Testing (2) > Similarity edge A Bloom channel is a rough set portrayal A variety of m bits (at first 0) k free hash capacities Supports Insert (x,S) Member (y,S) Document SHA-1 Department of Computer Sciences, UT Austin

Slide 9

Fast surmised correlation Feature-set: little space, low intricacy C Content-characterized Chunking Similarity Testing Markers X 0 1 0 1 0 1 0 Documents 1 0 1 0 1 0 1 0 Feature Extraction and Similarity Testing (3) > Similarity edge A Document Bit-wise AND B SHA-1 0 1 0 1 0 1 0 1 0 Bloom channel era A/\ B … 75% of A's set bits coordinated Department of Computer Sciences, UT Austin

Slide 10

Proof-of-idea illustrations: Differentiate between numerous comparable archives IBM webpage (http://www.ibm.com) Dataset 20 MB (590 archives)/speculator/corpgoverance/index.phtml contrasted and all pages Similar pages (same base URL) cgcoi.phtml (53%) cgblaws.phtml (69%) CVS Repository Dataset Technical doc. record (17 KB) Extracted 20 continuous renditions from the CVS foo  unique report foo.1  initially changed form foo.19  last form Department of Computer Sciences, UT Austin

Slide 11

Talk Outline Motivation Our approach System Overview Bloom Filters for Similarity Testing Experimental Evaluation Related Work and Conclusions Department of Computer Sciences, UT Austin

Slide 12

Evaluation (1): Effect of level of comparability "emacs manual" question on Google 493 outcomes recovered utilizing GoogleAPI Fraction of copies 88% (half similitude), 31% (90% closeness) Larger Aliasing of higher positioned archives Initial outcome set rehashed all the more every now and again in later outcomes Similar outcomes watched for different inquiries 100 half comparative 80 60 Percentage of Duplicate Documents 80% comparable 70% comparative 60% comparable 40 20 90% comparative 0 100 200 300 400 500 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

Slide 13

Evaluation (2): Effect of Search Query Popularity 80 400 300 "jon stewart crossfire" "republican national tradition" " "hawking dark opening bet" 60 300 "day of the dead" "national tropical storm center" 200 "electoral college" Percentage of Duplicate Documents 40 200 100 "Olympics 2004 doping" 20 100 "x prize spaceship" "indian larry" 0 100 200 300 400 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

Slide 14

Evaluation (3): Analyzing Response Times Top-80 list items for "emacs manual inquiry" Offline Computation time (pre-processed and put away) CDC lumps 80 * 0.3 ms Bloom channels generation 80 * 14 ms Online Matching Time Bit-wise AND of two Bloom channels (4  s) Matching and Clustering time 66 ms + Total (disconnected + on the web) 1210 ms Online Time 66 ms Department of Computer Sciences, UT Austin

Slide 15

Selected Related Work Most earlier work in view of shingling (numerous variations) Basic thought: (Broder'97) Divide archive into k-shingles: all k back to back words/tokens Represent archive by shingle-set S hingle-sets convergence vast  close copy records Reduce likeness identification issue to set convergence Differences with our method: Document similitude in view of list of capabilities convergence Higher list of capabilities computational overhead Feature set size reliant on examining (Min s , Mod m , and so forth.) Department of Computer Sciences, UT Austin

Slide 16

Conclusions Problem: Highly comparable matches in list items Popular Search motors (Google, Yahoo, MSN) Significant portion of close copies in top outcomes Adversely influences inquiry seek execution Our Solution: A comparability recognition procedure utilizing CDC and Bloom channels Incurs little meta-information overhead 0.4% bytes for every report Performs quick similitude location Bit-wise AND operations; request of ms Easily conveyed as a channel over any web crawler's outcomes Department of Computer Sciences, UT Austin

Slide 17

For more data: http://www.cs.utexas.edu/clients/nav/Department of Computer Sciences, UT Austin

SPONSORS