Information and the Grid: From Databases to Global Knowledge Communities

Information and the Grid: From Databases to Global Knowledge Communities Ian Foster Argonne National Laboratory University of Chicago Image Credit: Electronic Visualization Lab, UIC Keynote Talk, 15 th Intl Conf on Scientific and Statistical Database Management, Boston, July 11, 2003

My Presentation 1) Data coordination as another open door Driven by advances in innovation & science The need to find, get to, investigate, break down assorted conveyed information sources Grid advances as a substrate for key administration capacities 2) Science as communitarian work process The need to sort out, chronicle, reuse, clarify, and plan logical work processes Virtual information as a bringing together idea

It's Easy to Forget How Different 2003 is From 1993 Enormous amounts of information: Petabytes For an expanding number of groups, gating step is not accumulation but rather examination Ubiquitous Internet: 100+ million hosts Collaboration & asset sharing the standard Ultra-rapid systems: 10+ Gb/s Global optical systems Huge amounts of registering: 100+ Top/s Moore's law gives every one of us supercomputers

Consequence: The Emergence of Global Knowledge Communities Teams composed around shared objectives Communities: "Virtual associations" With various participation & abilities Heterogeneity is a quality not a shortcoming And geographic and political dissemination No area/association has every required aptitude and assets Must adjust as an element of the circumstance Adjust enrollment, reallocate obligations, renegotiate assets

The Emergence of Global Knowledge Communities

Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of information sets as of mid-2002, gathered by wavelength 12 waveband scope of huge regions of the sky Total around 200 TB information Largest indexes close to 1B objects Data and pictures obligingness Alex Szalay, John Hopkins

Many wellsprings of information, administrations, calculation Security & approach must underlie get to & administration choices Discovery R RM Registries arrange administrations important to a group Access RM Resource administration is expected to guarantee advance & referee contending requests RM Policy benefit Security benefit Policy benefit Security benefit Data coordination exercises may oblige access to, & investigation of, information at numerous areas Exploration & examination may include complex, multi-step work processes Data Integration as a Fundamental Challenge

Performance Requirements Demand Whole-System Management Assume Remote information at 1 GB/s 10 nearby bytes for every remote 100 operations for every byte >1 GByte/s achievable today (FAST, 7 streams, LA Geneva) Local Network Parallel calculation: 1000 Gop/s Remote information Wide region interface (end-to-end exchanged lambda?) 1 GB/s Parallel I/O: 10 GB/s

Data Integration: Key Challenges obviously, recognizable issues: information association, blueprint definition/intercession, and so forth., and so on. Be that as it may, likewise new difficulties identifying with dynamic, disseminated groups Establishment, arrangement, administration, & advancement of multi-authoritative alliances And to the sheer number of assets, speed of systems, and volume of information Coordination, administration, provisioning, & checking of work processes & required assets

Enter Grid Technologies Infrastructure ("middleware") for setting up, overseeing, and developing multi-hierarchical leagues Dynamic, self-ruling, area free On-request, pervasive access to registering, information, and administrations Mechanisms for making and overseeing work process inside such organizations New capacities built powerfully and straightforwardly from dispersed administrations Service-situated, virtualization

Managed shared virtual frameworks Computer science explore Open Grid Services Arch Web administrations, and so on. Genuine models Multiple usage Globus Toolkit Internet measures Defacto standard Single execution The Emergence of Open Grid Standards Increased usefulness, institutionalization Custom arrangements 1990 1995 2000 2005 2010

OGSA Structure A standard substrate: the Grid benefit Standard interfaces and practices that address key dispersed framework issues: naming, benefit state, lifetime, notice A Grid administration is a Web benefit … underpins standard administration particulars Agreement, information get to & mix, work process, security, approach, diagnostics, and so on. Focus of current & arranged GGF endeavors … and subjective application-particular administrations in light of these & different definitions

Client Introspection: What port sorts? What arrangement? What state? GridService (required) Other standard interfaces: production line, notice, accumulations Grid Service Handle Service information component Service information component Service information component handle determination Grid Service Reference Open Grid Services Infrastructure Lifetime administration Explicit devastation Soft-state lifetime Data get to Implementation Hosting environment/runtime ("C", J2EE, .NET, … )

Open Grid Services Infrastructure GWD-R (draft-ggf-ogsi-gridservice-23) Editors: Open Grid Services Infrastructure (OGSI) S. Tuecke, ANL K. Czajkowski, USC/ISI I. Encourage, ANL J. Frey, IBM S. Graham, IBM C. Kesselman, USC/ISI D. Snelling, Fujitsu Labs P. Vanderbilt, NASA February 17, 2003 Open Grid Services Infrastructure (OGSI) "The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration", Foster, Kesselman, Nick, Tuecke, 2002

Client Request and oversee document exchange operations Notf'n Source Policy Grid Service Fault Monitor Pending interfaces Query &/or subscribe to administration information Performance benefit information components Policy Perf. Screen Faults Example: Reliable File Transfer Service File Transfer Internal State Data exchange operations

OGSA and Data Integration OGSI gives key empowering instruments to disseminated information coordination Introspect on dispersed framework components Create and oversee conveyed state We require more than OGSI, obviously, e.g., WS-Agreement: arrange understandings between administration supplier and customer OGSA-DAI: Data Access and Integration WS-Management: benefit administration Security and approach

Job Submission Brokering Workflow Structured Data Integration Registry Banking Authorisation Data Transport Resource Usage Transformation Structured Data Access Structured Data Relational XML Semi-organized - Infrastructure Architecture Data Intensive X-ology Researchers Data Intensive Applications for X-ology Research Simulation, Analysis & Integration Technology for X-ology Generic Virtual Data Access and Integration Layer OGSA OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Distributed Virtual Integration Architecture Slide Courtesy Malcolm Atkinson, UK eScience Center

Data as Service: OGSA Data Access & Integration Service-situated treatment of information seems to have noteworthy favorable circumstances Leverage OGSI thoughtfulness, lifetime, and so forth. Similarity with Web administrations Standard administration interfaces being characterized Service information: e.g., composition Derive new information administrations from old (perspectives) Externalize to e.g. document/database arrange Perform questions or different operations

1a. Demand to Registry for wellsprings of information about "x" SOAP/HTTP benefit creation API cooperations Registry 1b. Registry reacts with Factory handle 2a. Demand to Factory for access to database Factory Client 2c. Industrial facility returns handle of GDS to customer 2b. Manufacturing plant makes GridDataService to oversee get to 3a. Customer questions GDS with XPath, SQL, and so forth XML/Relational database Grid Data Service 3c. Consequences of inquiry came back to customer as XML 3b. GDS associates with database Data Access & Integration Services Slide Courtesy Malcolm Atkinson, UK eScience Center

Globus Toolkit v3 (GT3) Open Source OGSA Technology Implements and expands on OGSI interfaces Supports essential GT2 interfaces Public key validation Scalable administration revelation Secure, solid asset get to High-execution information development (GridFTP) Numerous new administrations included or arranged SLA transaction, benefit registry, group approval, information get to & joining, … Rapidly developing appropriation and commitments E.g., OGSA-DAI from U.K. eScience program

My Presentation 1) Data incorporation as another open door Driven by advances in innovation & science The need to find, get to, investigate, dissect assorted conveyed information sources Grid advances as a substrate for fundamental administration capacities 2) Science as communitarian work process The need to arrange, document, reuse, clarify, & plan logical work processes Virtual information as a binding together idea

Science as Workflow Data reconciliation = the inference of new information from old, through composed computation(s) May be computationally requesting The work processes used to accomplish mix are frequently profitable antiquities in their own right Thus we should be worried with how we Build work processes Share and reuse work processes Explain work processes Schedule work processes

Sloan Digital Sky Survey Production System

Virtual Data Concept Capture and oversee data about connections among Data (of generally differing representations) Programs (& their execution needs) Computations (& execution situations) Apply this data to, e.g. Disclosure: Data and program revelation Workflow: Structured worldview for arranging, finding, determining, & asking for information Explanation: provenance Planning and sc