Recognizing Sets of Related Words from the Internet Postulation Guard 06/09/2005

0
0
1889 days ago, 581 views
PowerPoint PPT Presentation
Built up an Algorithm that predicts sets of related words by utilizing a relatedness measure. ... Google has an extremely viable positioning calculation called PageRank which endeavors to ...

Presentation Transcript

Slide 1

Distinguishing Sets of Related Words from the World Wide Web Thesis Defense 06/09/2005 Pratheepan (Prath) Raveendranathan Advisor: Ted Pedersen

Slide 2

Outline Introduction & Objective Methodology Experimental Results Conclusion Future Work Demo

Slide 3

Introduction The objective of my theory research is to utilize the World Wide Web as a wellspring of data to recognize sets of words that are connected in significance. Case, given two words - {gun , pistol} a conceivable arrangement of related words would be { handgun, holster, shotgun, automatic rifle, weapon,ammunition,bullet, magazine } Example, given two words – { toyota, nissan, portage } A conceivable arrangement of related words would { honda, gmc, chevy, mitsubishi }

Slide 4

Examples Cont… Example, given two words - {red , yellow} a conceivable arrangement of related words would be { white,black,blue, hues, green} Example, given two words - {George Bush,Bill Clinton} a conceivable arrangement of related words would be { Ronald Reagan, Jimmy Carter, White House, Presidents, USA, and so on }

Slide 5

Application Use sets of related words to characterize Semantic Orientation of surveys. (Dwindle Turney) Use sets of related words to discover the assumption connected with specific item. (Rajiv Vaidyanathan and Praveen Agarwal).

Slide 6

Pros and Cons of utilizing the Web Pros Huge measures of content Diverse content Encyclopedia's, Publications, Commercial Web Pages Dynamic (steadily evolving state) Cons, The Web makes a one of a kind arrangement of difficulties, Dynamic (regularly evolving state) News sites, Blogs Presence of redundant, loud, or low-quality information. HTML labels, web dialect (landing page, data and so forth)

Slide 7

Contributions Developed an Algorithm that predicts sets of related words by utilizing design coordinating systems and recurrence numbers. Built up an Algorithm that predicts sets of related words by utilizing a relatedness measure. Built up an Algorithm that predicts sets of related words by utilizing a relatedness measure and an augmentation of the Log Likelihood score. Connected arrangements of related words to issue of Sentiment Classification.

Slide 8

Outline Introduction & Objective Methodology Experimental Results Conclusion Future Work Demo

Slide 9

Interface to Web - Google Reasons for utilizing Google Research is particularly dependant on both the amount and nature of the Web content. Google has an exceptionally successful positioning calculation called PageRank which endeavors to give more essential or higher quality website pages a higher positioning. Google API – An interface which permits software engineers to question more than 8 billion pages utilizing the Google web search tool. (http://www.google.com/apis/).

Slide 10

Problems with Google API Restricted to 1000 questions a day 10 Results for every inquiry No "close" administrator (Proximity based pursuit) Maximum 1000 results. Elective Yahoo API – 5000 Queries a day (Released as of late) No "close" administrator too. Can't recover number of hits. Note: Google was utilized just as method for recovering from the Information.

Slide 11

Key Idea behind Algorithms Words that are connected in importance regularly have a tendency to happen together. Illustration, A Springfield, MA , Chevrolet, Ford, Honda, Lexus, Mazda, Nissan, Saturn, Toyota car merchant with new and pre-possessed vehicle deals and renting

Slide 12

Algorithm 1 Features Based on recurrence Takes just single words as info Initial set 2 words Frequency cutoff Ranked by recurrence Smart stop list - The, in the event that, me, why, you and so on (non-content words) Web stop list Web page, WWW, home,page, individual, url, data, interface, content , embellishment, verdana, script, javascript

Slide 13

Algorithm 1 – High level Description Create inquiries to Google in light of the information terms. Recover the top N number of website pages for every inquiry. Parse the recovered site page content for every question. 3. Tokenize site page content into rundown of words and recurrence. Dispose of words that happen not as much as C number of times. 4. Find the regular words between no less than two of the arrangements of words. This arrangement of crossing words are the arrangement of related words to the info term. 5. Rehash the procedure for I emphasess by utilizing the arrangement of related words from the past emphasis as info.

Slide 14

Algorithm 1 Trace 1 Search Terms : S1={pistol, gun} Frequency Cutoff – 15 Num Results (Web Pages) – 10 Iterations - 2

Slide 15

Algorithm 1 –Step 1 Create inquiries to Google based stages of the Input Terms, firearm weapon AND gun AND weapon

Slide 16

Algorithm 1 – Step 2 Issue inquiry to Google, Retrieve the main 10 URLs for the question, For every URL, recover the site page content, and parse the page for more connections. Cross these connections and recover the substance of those pages also. Rehash this procedure for every question.

Slide 17

Trace 1 Cont… Web pages for the question firearm

Slide 18

Trace 1 Cont… Web pages for gun

Slide 19

Trace 1 Cont… Web pages for weapon AND gun

Slide 20

Trace 1 Cont… Web pages for gun AND weapon

Slide 21

Algorithm 1 – Step 3. Next, for the aggregate website page content recovered for every inquiry, Remove HTML Tags and so forth and recover content. Expel stop words. Tokenize the website page content into arrangements of words and recurrence. Take note of: This would bring about the accompanying 4 sets of words, every set speaking to the words recovered for every question.

Slide 22

Words from Web pages in the wake of expelling stop words

Slide 23

Algorithm 1 – Step 4. Discover the words that are basic no less than 2 sets. Let, firearm AND gun AND weapon gun Related Set =

Slide 24

Related Set 1 – Iteration 1

Slide 25

Trace 1 Cont… Iteration 2 11 input terms – Search terms made – Rifle Shooting Guns Cases Airsoft Shooting AND Guns AND Shooting Guns AND Cases and so forth and so on. Brings about 11 2 = 121 questions to Google! Note: As you can see, the quantity of inquiries to Google increments radically.

Slide 26

Result Set 2 – {gun, pistol}

Slide 27

Algorithm 1 – {red, yellow} Number of Results – 10 Frequency Cutoff - 15 Iterations - 1 Related Words

Slide 28

Problems with Algorithm 1 Frequency based positioning, Number of information terms confined to 2, Input and yield limited to single words

Slide 29

Algorithm 2 Features Based on recurrence & relatedness score Can takes include as single words or 2 word collocations Relatedness measure in view of Jiang and Conrath Frequency cutoff and relatedness score cutoff Ranked by score Initial set can be more than 2 words Bi-grams as yield Smart stop list The, on the off chance that, me, why, you and so forth Web stop words + phrases Web page, WWW, landing page, individual, url, data, connect, content , embellishment, verdana, script, javascript

Slide 30

Algorithm 2 – High level Description Repeat same strides as in Algorithm 1 to recover starting arrangement of related words (Add bigrams to comes about too). For every word returned by Algorithm 1 as a related word, Calculate Relatedness of word to information terms. Dispose of any word or bigram with a relatedness score more noteworthy than the score cutoff. Sort remaining terms from most significant to immaterial. Rehash Steps 1 – 2 for every cycle, utilizing the arrangement of words from emphasis past cycle as info.

Slide 31

Relatedness Measure (Distance Measure) Relatedness (Word1, Word2) = log (hits(Word1)) + log (hits(Word2)) – 2 * log (hits(Word1 Word2)) (Based on measure by Jiang and Conrath) Example 1, hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND passage) = 50,000 = 32.41 Example 2, hits(toyota) = 12,500,000 hits(ford) = 22,900,000 hits(toyota AND portage) = 1 50,000 = 30.82

Slide 32

Relatedness Measure Cont… Example 3, hits(toyota) = 1000 hits(ford) = 1000 hits(toyota AND portage) = 1 000 Relatedness (toyota,ford) = 0 As the measure tends to approach zero, the relatedness between the two terms increment.

Slide 33

Input Set – {gun, pistol}

Slide 34

Algorithm 2 – {red, yellow} Number of Results – 10 Frequency Cutoff - 10 Score Cutoff - 30 Iterations - 1

Slide 35

Problems with Algorithm 2 Certain bigrams are bad collocations, For instance, {sunny, cloudy} Number of Results - 10 Frequency Cutoff - 15 Bigram Cutoff - 4 Score Cutoff - 30

Slide 36

Algorithm 3 – High Level Description Repeat same strides as in Algorithm 1 to recover introductory arrangement of related words (Add bigrams to comes about too). For every term returned by Algorithm 1 as a related word, If the term is a bigram, Validate if bigram is a substantial collocation If bigram is a legitimate collocation proceed with step 2.2 else 2. Remove term from set of related words. Ascertain Relatedness of word to info terms. Dispose of any word or collocation with a relatedness score more prominent than the score cutoff. Sort remaining terms from most pertinent to superfluous.

Slide 37

Verifying Bigrams Adapt Log Likelihood ( G 2 ) Score to web hit tallies Example, "New York" 4 Queries to Google "New *" "New York" "* York" "of the"

Slide 38

Expected Values (621 * 3560)/5670 (5049 * 3560)/5670 (621 * 2110)/5670 (5049 * 2110)/5670

Slide 39

Identifying a "terrible" collocation Bigram is disposed of if, Observed esteem for bigram is 0 (eg, "New York") Observed esteem for bigram is not exactly the normal esteem.

Slide 40

Example Bigrams

Slide 41

Methodology Introduction & Objective Methodology Experimental Results & Evaluation Conclusion Future Work Demo

Slide 42

Evaluating Results Compare with Google Sets http://labs.google.com/sets Human Subject Experiments Around 20 individuals extended 2-word sets to what they feel as an arrangement of related words

Slide 43

F-measure, Precision and Recall

Slide 44

Comparison of Algorithm 1 & 2

Slide 45

Algorithm 1 {jordan,chicago} Number of Results – 10 Frequency Cutoff - 15 Iterations - 1 Precision = 0, Recall = 0 F-measure = 0 .:ts

SPONSORS