Extraordinary Topics in Search Engines

1772 days ago, 610 views
PowerPoint PPT Presentation
Extraordinary Points in Web crawlers. Result Rundowns Hostile to spamming Copy end. Results outlines. Synopses. Having positioned the archives coordinating a question, we wish to display an outcomes list Most normally, the report title in addition to a short outline

Presentation Transcript

Slide 1

Uncommon Topics in Search Engines Result Summaries Anti-spamming Duplicate disposal

Slide 2

Results outlines

Slide 3

Summaries Having positioned the archives coordinating an inquiry, we wish to introduce an outcomes list Most regularly, the record title in addition to a short synopsis The title is normally consequently removed from report metadata What about the rundowns?

Slide 4

Summaries Two essential sorts: Static Dynamic A static synopsis of a record is dependably the same, paying little mind to the question that hit the doc Dynamic outlines are inquiry subordinate endeavor to clarify why the archive was recovered for the current question

Slide 5

Static rundowns In run of the mill frameworks, the static outline is a subset of the report Simplest heuristic: the initial 50 (or so – this can be changed) expressions of the archive Summary stored at ordering time More advanced: concentrate from each record an arrangement of "key" sentences Simple NLP heuristics to score each sentence Summary is comprised of top-scoring sentences. Most complex: NLP used to incorporate a rundown Seldom utilized as a part of IR; cf. content outline work

Slide 6

Dynamic synopses Present at least one "windows" inside the archive that contain a few of the inquiry terms "KWIC" bits: Keyword in Context introduction Generated in conjunction with scoring If question found as an expression, the/a few events of the expression in the doc If not, windows inside the doc that contain various inquiry terms The synopsis itself gives the whole substance of the window – all terms, not just the question terms – how?

Slide 7

Generating dynamic synopses If we have just a positional file, we can't (without much of a stretch) remake setting encompassing hits If we reserve the reports at record time, can run the window through it, signaling to hits found in the positional file E.g., positional file says "the question is an expression in position 4378" so we go to this position in the stored archive and stream out the substance Most regularly, store a settled size prefix of the doc Note: Cached duplicate can be obsolete

Slide 8

Dynamic outlines Producing great element rundowns is a precarious enhancement issue The land for the synopsis is ordinarily little and settled Want short thing, so appear whatever number KWIC coordinates as would be prudent, and maybe different things like title Want bits to be sufficiently long to be helpful Want etymologically very much framed scraps: clients lean toward pieces that contain finish phrases Want bits maximally instructive about doc But clients truly like bits, regardless of the possibility that they convolute IR framework plan

Slide 9


Slide 10

Adversarial IR (Spam) Motives Commercial, political, religious, entryways Promotion financed by publicizing spending Operators Contractors (Search Engine Optimizers) for anterooms, organizations Web experts Hosting administrations Forum Web ace world ( www.webmasterworld.com ) Search motor particular traps Discussions about scholastic papers 

Slide 11

Search Engine Optimization I Adversarial IR ("web index wars")

Slide 12

Can you put stock in words on the page? auctions.hitsoffice.com/Pornographic Content www.ebay.com/Examples from July 2002

Slide 13

Simplest structures Early motors depended on the thickness of terms The top-positioned pages for the question maui resort were the ones containing the most maui 's and resort 's SEOs reacted with thick reiterations of picked terms e.g., maui resort maui resort maui resort Often, the redundancies would be in an indistinguishable shading from the foundation of the site page Repeated terms got filed by crawlers But not noticeable to people on programs Can't believe the words on a site page, for positioning.

Slide 14

A couple spam advancements Cloaking Serve fake substance to internet searcher robot DNS shrouding: Switch IP address. Mimic Doorway Pages advanced for a solitary watchword that re-direct to the genuine target page Keyword Spam Misleading meta-catchphrases, intemperate reiteration of a term, fake "stay content" Hidden content with hues, CSS traps, and so forth. Interface spamming Mutual esteem social orders, shrouded joins, grants Domain flooding: various areas that indicate or re-coordinate an objective page Robots Fake snap stream Fake inquiry stream Millions of entries through Add-Url

Slide 15

SPAM Y Is this a Search Engine arachnid? Genuine Doc N More spam methods Cloaking Serve fake substance to internet searcher creepy crawly DNS shrouding: Switch IP address. Mimic Cloaking

Slide 16

Tutorial on Cloaking & Stealth Technology

Slide 17

Variants of watchword stuffing Misleading meta-labels, over the top redundancy Hidden content with hues, template traps, and so forth. Meta-Tags = "… London lodgings, inn, occasion motel, hilton, rebate, booking, reservation, sex, mp3, britney lances, viagra, … "

Slide 18

More spam systems Doorway Pages advanced for a solitary watchword that re-direct to the genuine target page Link spamming Mutual deference social orders, concealed connections, grants – more on these later Domain flooding: various spaces that indicate or re-coordinate an objective page Robots Fake question stream – rank checking programs "Bend fit" positioning projects of web crawlers Millions of entries by means of Add-Url

Slide 19

The war against spam Quality signs - Prefer definitive pages in view of: Votes from writers (linkage signals) Votes from clients (use signals) Policing of URL entries Anti robot test Limits on meta-catchphrases Robust connection investigation Ignore measurably farfetched linkage (or content) Use interface examination to distinguish spammers (coerce by affiliation)

Slide 20

The war against Spam acknowledgment by machine learning Training set in light of known spam Family amicable channels Linguistic investigation, general characterization methods, and so forth. For pictures: substance tone identifiers, source content investigation, and so on. Publication mediation Blacklists Top questions evaluated Complaints tended to

Slide 21

Acid test Which SEO's rank very on the inquiry search engine optimization ? Web search tools have arrangements on SEO hones they endure/square See pointers in Resources Adversarial IR: the unending (specialized) fight amongst SEO's and web search tools See for example http://airweb.cse.lehigh.edu/

Slide 22

Duplicate discovery

Slide 23

Duplicate/Near-Duplicate Detection Duplication : Exact match with fingerprints Near-Duplication : Approximate match Overview Compute syntactic closeness with an alter separate measure Use likeness edge to recognize close copies E.g., Similarity > 80% => Documents are "close copies" Not transitive however now and then utilized transitively

Slide 24

Computing Similarity Segments of an archive (regular or simulated breakpoints) [Brin95] Shingles (Word k - Grams) [Brin95, Brod98] "a rose is a rose is a rose" => a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Set crossing point [Brod98] (Specifically, Size_of_Intersection/Size_of_Union ) Jaccard measure

Slide 25

Shingles + Set Intersection Computing accurate set convergence of shingles between all sets of records is costly Approximate utilizing a keenly picked subset of shingles from each (a portray ) Estimate Jaccard from a short draw Create an "outline vector" (e.g., of size 200) for each report Documents which share more than t (say 80%) comparing vector components are comparative For doc d , portray d [i] is processed as takes after: Let f delineate shingles in the universe to 0..2 m Let p i be a particular irregular stage on 0..2 m Pick MIN p i ( f(s) ) over all shingles s in d

Slide 26

Shingling with inspecting minima Given two archives A1, A2. Give S1 and S2 a chance to be their shingle sets Resemblance = |Intersection of S1 and S2|/| Union of S1 and S2|. Let Alpha = min ( p (S1)) Let Beta = min ( p (S2)) Probability (Alpha = Beta) = Resemblance

Slide 27

Document 1 Computing Sketch[i] for Doc1 Start with 64 bit shingles Permute on the number line with p i Pick the min esteem 2 64 2 64 2 64 2 64

Slide 28

Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 64 2 64 2 64 2 64 2 64 2 64 A B 2 64 2 64 Are these equivalent? Test for 200 irregular changes: p 1 , p 2 ,… p 200

Slide 29

Document 2 Document 1 2 64 2 64 2 64 2 64 2 64 2 64 2 64 2 64 However… A B A = B iff the shingle with the MIN esteem in the union of Doc1 and Doc2 is normal to both (I.e., lies in the crossing point) This occurs with likelihood: Size_of_intersection/Size_of_union Why?

Slide 30

Set Similarity Set Similarity (Jaccard measure) View sets as sections of a lattice; one column for every component in the universe. an ij = 1 shows nearness of thing i in set j Example C 1 C 2 0 1 0 1 sim J (C 1 ,C 2 ) = 2/5 = 0.4 0 1 0 1

Slide 31

Key Observation For segments C i , C j , four sorts of lines C i C j A 1 1 B 1 0 C 0 1 D 0 0 Overload documentation: A = # of lines of sort A Claim

Slide 32

Min Hashing Randomly permute lines h(C i ) = record of first line with 1 in segment C i Surprising Property Why? Both are An/(A+B+C) Look down segments C i , C j until first non-Type-D push h(C i ) = h(C j )  sort A line

Slide 33

Mirror Detection Mirroring is precise replication of pages crosswise over hosts. Single biggest reason for duplication on the web Host1/an and Host2/b are mirrors iff For all (or most) ways p to such an extent that when http://Host1/a/p exists http://Host2/b/p exists also with indistinguishable (or close indistinguishable) substance, and the other way around.

Slide 34

Mirror Detection illustration http://www.elsevier.com/and http://www.elsevier.nl/Structural Classification of Proteins http://scop.mrc-lmb.cam.ac.uk/scop http://scop.berkeley.edu/http://scop.wehi.edu.au/scop http://pdb.weizmann.ac.il/scop http://scop.protres.ru/

Slide 35

Repackaged Mirrors Auctions.lycos.com Auctions.msn.com Aug 2001