Opportunities in Natural Language Processing

1834 days ago, 598 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Openings in Natural Language Processing Christopher Manning Depts of Computer Science and Linguistics Stanford University http://nlp.stanford.edu/~manning/

Slide 2

Outline Overview of the field Why are dialect advancements required? What innovations are there? What are intriguing issues where NLP can and can't convey advance? NL/DB interface Web seek Product Info, email Text classification, grouping, IE Finance, little gadgets, talk rooms Question replying

Slide 3

What's the world's most utilized database? Prophet? Exceed expectations? Maybe, Microsoft Word? Information just include as information when it's sections? Yet, there's heaps of other information: reports, spec. sheets, client criticism, arranges, … "The Unix logic"

Slide 4

"Databases" in 1992 Database frameworks (for the most part social) are the unavoidable type of data innovation giving effective access to organized, forbidden information basically for governments and organizations: Oracle, Sybase, Informix, and so on. (Content) Information Retrieval frameworks is a little market overwhelmed by a couple of huge frameworks giving data to particular markets (legitimate, news, therapeutic, corporate data): Westlaw, Medline, Lexis/Nexis Commercial NLP advertise fundamentally nonexistent principally DARPA work

Slide 5

"Databases" in 2002 A great deal of new things appear to be critical: Internet, Web look, Portals, Peer­to­Peer, Agents, Collaborative Filtering, XML/Metadata, Data mining Is everything the same, distinctive, or only a wreck? There is a greater amount of everything, it's more circulated, and it's less organized. Huge textbases and data recovery are a critical part of present day data frameworks, and bigly affect ordinary individuals (web seek, entrances, email)

Slide 6

Linguistic information is omnipresent Most of the data in many organizations, associations, and so forth is material in human dialects (reports, client email, site pages, exchange papers, content, sound, video) – not stuff in conventional databases Estimates: 70%, 90% ?? [all depends how you measure]. The greater part of it. A large portion of that data is currently accessible in advanced shape: Estimate for organizations in 1998: around 60% [CAP Ventures/Fuji Xerox]. More like 90% at this point?

Slide 7

The issue When individuals see content, they comprehend its significance (all around) When PCs see content, they get just character strings (and maybe HTML labels) We'd like PC specialists to see implications and have the capacity to wisely handle message These cravings have prompted to numerous proposition for organized, semantically increased arrangements But frequently people still unflinchingly make utilization of content in human dialects This issue isn't probably going to simply leave.

Slide 8

Why is Natural Language Understanding troublesome? The shrouded structure of dialect is exceedingly questionable Structures for: Fed brings financing costs 0.5% up in push to control swelling ( NYT feature 5/17/00)

Slide 9

Where are the ambiguities?

Slide 10

Translating client needs User require User question Results For RDB, many individuals know how to do this accurately, utilizing SQL or a GUI apparatus The answers turning out around here will then be decisively what the client needed

Slide 11

Translating client needs User require User inquiry Results For implications in content, no IR-style inquiry gives one precisely what one needs; it just alludes to it The answers turning out might be generally what was needed, or can be refined Sometimes!

Slide 12

Translating client needs User require NLP inquiry Results For a more profound NLP examination framework, the framework unpretentiously interprets the client's dialect If the answers returning aren't what was needed, the client habitually has no clue how to alter the issue Risky!

Slide 13

Aim: Practical connected NLP objectives Use dialect innovation to increase the value of information by: understanding change esteem sifting expansion (giving metadata) Two inspirations: The measure of data in literary shape Information joining needs NLP strategies for adapting to uncertainty and setting

Slide 14

Multi-dimensional Meta-information Extraction Knowledge Extraction Vision

Slide 15

Terms and advances Text handling Stuff like TextPad (Emacs, BBEdit), Perl, grep. Semantics and structure daze, however does what you let it know in a sufficiently decent manner. Still valuable. Data Retrieval (IR) Implies that the PC will attempt to discover records which are significant to a client while seeing nothing (enormous accumulations) Intelligent Information Access (IIA) Use of cunning methods to help clients fulfill a data require (pursuit or UI advancements)

Slide 16

Terms and advances Locating little stuff. Valuable pieces of data that a client needs: Information Extraction (IE): Database filling The significant bits of content will be found, and the PC will see enough to fulfill the client's open objectives Wrapper Generation (WG) [or Wrapper Induction] Producing channels so specialists can "figure out" site pages proposed for people back to the fundamental organized information Question Answering (QA) – NL questioning Thesaurus/key expression/wording era

Slide 17

Terms and innovations Big Stuff. Diagrams of information: Summarization Of one record or a gathering of related archives (cross-report synopsis) Categorization (records) Including content sifting and directing Clustering (accumulations) Text division: subparts of huge writings Topic location and following Combines IE, order, division

Slide 18

Terms and innovations Digital libraries [text work has been unsexy?] Text (Data) Mining (TDM) Extracting pieces from content. Astute. Unforeseen associations that one can find between bits of human recorded information . Characteristic Language Understanding (NLU) Implies an endeavor to totally comprehend the content … Machine interpretation (MT), OCR, Speech acknowledgment, and so on. Presently accessible wherever programming is sold!

Slide 19

discover all site pages containing the word Liebermann read the most recent 3 months of the NY Times and give an outline of the battle so far Problems and methodologies Some spots where I see less esteem Some spots where I see more esteem

Slide 20

Natural Language Interfaces to Databases This would have been the huge utilization of NLP in the 1980s > what number administration calls did we get from Europe a month ago? I am posting the aggregate administration calls from Europe for November 2001. The aggregate for November 2001 was 1756. It has been as of late incorporated into MS SQL Server (English Query) Problems: require generally hand-manufactured custom semantic support (enhanced wizards in new form!) GUIs more unmistakable and successful?

Slide 21

NLP for IR/web look? It's an easy decision that NLP ought to be valuable and utilized for web hunt (and IR when all is said in done): Search for "Puma" the PC ought to know or ask whether you're keen on enormous felines [scarce on the web], autos, or, maybe an atom geometry and solvation vitality bundle, or a bundle for quick system I/O in Java Search for 'Michael Jordan' The basketballer or the machine learning fellow? Scan for tablet, don't discover journal Google doesn't stem: Search for probabilistic model , and you don't coordinate pages with probabilistic models .

Slide 22

NLP for IR/web look? Word sense disambiguation innovation for the most part functions admirably (like content classification) Synonyms can be found or recorded Lots of individuals have been into altering this e-Cyc had a beta form with Hotbot that disambiguated faculties, and was going to go live in 2 months … 14 months prior Lots of new businesses: LingoMotors iPhrase " Traditional watchword look innovation is pitifully obsolete "

Slide 23

NLP for IR/web seek? In any case, by and by it's a thought that hasn't gotten much footing Correctly finding etymological base structures is direct, yet delivers little preferred standpoint over unrefined stemming which only somewhat over comparability classes words Word sense disambiguation just aides by and large in IR if more than 90% exact (Sanderson 1994), and that is about where we are Syntactic expressions ought to help, however individuals have possessed the capacity to get a large portion of the mileage with "measurable expressions" – which have been forcefully coordinated into frameworks as of late

Slide 24

NLP for IR/web look? Individuals can without much of a stretch sweep among results (on their 21" screen) … in case you're over the overlay Much more advance has been made in connection examination, and utilization of stay content, and so forth. Grapple content gives human-if equivalent words Link or snap stream investigation gives a type of pragmatics: what do individuals discover right or critical (in a default setting) Focus on short, prevalent questions, news, and so forth. Utilizing human knowledge dependably beats manmade brainpower

Slide 25

NLP for IR/web look? Techniques which utilization of rich ontologies, and so forth., can work exceptionally well for intranet look inside a client's webpage (where stay content, connection, and snap examples are substantially less pertinent) But don't generally scale to the entire web Moral: it's difficult to beat watchword scan for the assignment of general impromptu record recovery Conclusion: one ought to climb the evolved way of life to errands where better grained comprehension of significance is required

Slide 27

Product data

Slide 28

Product information C-net markets this data How would they get the greater part of it? Telephone calls Typing.

Slide 29

Inconsistency: computerized cameras Image Capture Device: 1.68 million pixel 1/2-creep CCD sensor Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] ) Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] ) These all fell off a similar producer's site!! Furthermore, this is an extremely specialized area. Attempt couch beds.

Slide 30

Product data/Com