Information Combination

0
0
2050 days ago, 568 views
PowerPoint PPT Presentation
Discover houses with 4 rooms. estimated under 300K. Include numerous errands. 5. Uniform question ... Discover houses with 4 rooms. evaluated under 300K. wrapper. wrapper. wrapper ...

Presentation Transcript

Slide 1

Information Integration

Slide 2

Data Integration Challenge Find houses with 4 rooms valued under 300K New employee realestate.com homeseekers.com homes.com

Slide 3

wrapper Architecture of Data Integration Systems Find houses with 4 rooms estimated under 300K intervened blueprint source construction 1 source pattern 2 source diagram 3 homes.com realestate.com houses.com Provide a uniform inquiry interface

Slide 4

cost | city | numbeds | numbaths value area beds showers $185,000 Urbana, IL 2 $270,000 Seattle, WA 3 - Architecture of Data Integration Systems Find houses with 4 rooms evaluated under 300K value area beds $185K Urbana, IL 2 $299K Kent, WA 3 wrapper $185,000 <em>Urbana, IL</em> 2 beds/2 showers Century 21 $270,000 <em>Seattle, WA</em> 3 beds REMAX land homeseekers.com Involve numerous errands

Slide 5

Another Example Uniform question ability crosswise over self-ruling, heterogeneous information sources on LAN, WAN, or Internet

Slide 6

More Motivating Examples An association has all things considered 49 databases can discuss a similar point, yet utilize distinctive vocabularies, diverse outlines by what method would we be able to get to them as though getting to a solitary db? Many online book shops amazon.com, barnes&noble.com, and so forth by what method would we be able to question them as though questioning a solitary source? Several CS sites in US, in content configuration would we be able to solidify data about every one of them and question them as though questioning a monster social database?

Slide 7

The General Problem How would we be able to get to an arrangement of heterogeneous, circulated, independent databases as though getting to a solitary database? Emerges in various settings on the Web, at endeavors, military, logical collaboration, bio-informatics areas, e-trade, and so on. Presently exceptionally hot in both database research and industry

Slide 8

Current State of Affairs Mostly impromptu programming: make an uncommon answer for each case; pay specialists a ton of cash. Long-standing test in the DB people group AI/WWW people group are ready Annual workshops, vision papers, ... Organizations Informatica, numerous others, ...

Slide 9

A Brief Research History Many early impromptu arrangements Converged into two methodologies information warehousing versus virtual DI frameworks Semi-organized information, XML Wrappers, data extraction Other issues: inquiry streamlining, pattern coordinating, ... Current bearings DI for particular spaces (e.g., bioinformatics) on-the-fly DI, element driven DI streamline reconciliation errands New sorts of information sharing frameworks P2P frameworks, Semantic Web

Slide 10

Data warehousing versus Virtual DI frameworks

Slide 11

Data Warehouse Architecture OLAP/Decision bolster/Data solid shapes/information mining User questions Relational database (stockroom) Data extraction programs Data cleaning/scouring Data source Data source Data source

Slide 12

Data warehousing Data warehousing: stack every one of the information occasionally into a distribution center. 6-year and a half lead time Separates operational DBMS from choice bolster DBMS. (not just an answer for information coordination). Execution is great; information may not be new. Need to spotless, clean you information.

Slide 13

The Virtual Integration Architecture Leave the information in the sources. At the point when a question comes in: Determine the pertinent sources to the inquiry Break down the question into sub-inquiries for the sources. Find the solutions from the sources, and consolidate them properly. Information is new. Challenge: numerous

Slide 14

Virtual Integration Architecture User questions Mediated mapping Mediator: Reformulation motor analyzer Which information display? Information source list Execution motor wrapper Data source Data source Data source Sources can be: social, progressive (IMS), structure documents, sites.

Slide 15

Architecture of (Virtual) Data Integration System Find books composed by Isaac Asimov & valued under $15 worldwide question interface inquiry interface 1 question interface 2 inquiry interface 3 amazon.com bn.com powell.com

Slide 16

A Brief History Many early impromptu arrangements Converged into two methodologies information warehousing versus virtual DI frameworks Semi-organized information, XML Wrappers Other issues: question enhancement, outline coordinating, ... Current headings DI for specific spaces (e.g., bioinformatics) on-the-fly DI, element driven DI New sorts of information sharing frameworks P2P frameworks, Semantic Web

Slide 17

Semi-organized Data What ought to be the hidden information demonstrate for DI settings? social model is not a perfect decision Developed semi-organized information demonstrate began with the OEM (question trade show) Then XML tagged along It is presently the most surely understood semi-organized information display Generating much research in the DB people group

Slide 18

HTML < h1 > Bibliography </h1 > < p > < i > Foundations of Databases </i > Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 < p > < i > Data on the Web </i > Abiteboul, Buneman, Suciu < br > Morgan Kaufmann, 1999 HTML is hard for applications

Slide 19

XML < list of sources > < book > < title > Foundations… </title > < writer > Abiteboul </writer > < writer > Hull </writer > < writer > Vianu </writer > < distributer > Addison Wesley </distributer > < year > 1995 </year > </book > … </reference index > XML portrays the substance: simple for applications

Slide 20

DTDs as Grammars Same thing as: A DTD is an EBNF (Extended BNF) linguistic use A XML tree is unequivocally a deduction tree db ::= ( book|publisher) * book ::= ( title , writer *, year ?) title ::= string writer ::= string year ::= string distributer ::= string XML Documents that have a DTD and fit in with it are called substantial

Slide 21

More on DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper ( segment *)> <!ELEMENT area (( title , segment *) | content )> <!ELEMENT title (#PCDATA)> <!ELEMENT content (#PCDATA)> ]> < paper > < segment > < content > </content > </segment > < segment > < title > </title > < segment > … </segment > < segment > … </segment > </segment > </paper > XML reports can be settled self-assertively profound

Slide 22

< people > < row> <name >John</name > < phone> 3634</telephone ></push > < row> <name >Sue</name > < phone> 6343</telephone > < push > < name >Dick</name > < telephone > 6363</telephone ></push > </people > XML for Representing Data XML: people push telephone name telephone name telephone name "John" 3634 "Sue" 6343 "Dick" 6363

Slide 23

XML versus Data Models XML is self-depicting Schema components turn out to be a piece of the information Relational pattern: persons(name,phone) In XML < people >, < name >, < telephone > are a piece of the information, and are rehashed ordinarily Consequence: XML is a great deal more adaptable XML = semistructured information

Slide 24

Semi-organized Data Explained Missing properties: Repeated qualities < individual > < name > John</name > < telephone >1234</telephone > </individual > < individual > < name >Joe</name > </individual >  no telephone ! < individual > < name > Mary</name > < telephone >2345</telephone > < telephone >3456</telephone > </individual >  two telephones !

Slide 25

Semistructured Data Explained Attributes with various sorts in various articles Nested accumulations (no 1NF) Heterogeneous accumulations: <db> contains both < book> s and < publisher> s < individual > < name > < first > John </first > < last > Smith </last > </name > < telephone >1234</telephone > </individual >  organized name !

Slide 26

XML Data v.s. E/R, ODL, Relational Q: is XML better or more regrettable ? A: fills distinctive needs E/R, ODL, Relational models: For incorporated preparing, when we control the information XML: Data sharing between various frameworks we don't have control over the whole information E.g. on the Web Do NOT utilize XML to display your information ! Utilize E/R, ODL, or social.

Slide 27

Exporting Relational Data to XML Product(pid, name, weight) Company(cid, name, address) Makes(pid, cid, value) makes item organization

Slide 28

Export information gathered by organizations < db >< organization > < name > GizmoWorks </name > < address > Tacoma </address > < item > < name > thingamajig </name > < cost > 19.99 </cost > </item > < item > … </item > … </organization > < organization > < name > Bang </name > < address > Kirkland </address > < item > < name > doohickey </name > < cost > 22.99 </cost > </item > … </organization > … </db > Redundant representation of items

Slide 29

The DTD <!ELEMENT db ( organization *)> <!ELE

SPONSORS