Alexei Klimentov Brookhaven National Research facility

2050 days ago, 694 views
PowerPoint PPT Presentation
10 mists use LFC as document index and Panda as employments agent ... Google Web Toolkit (GWT) Source information uncovered specifically from its source (like the Panda ...

Presentation Transcript

Slide 1

XXII-th International Symposium on Nuclear Electronics and Computing. Varna Sep 6-13, 2009 ATLAS Distributed Computing Model, Data Management, Production System, Distributed Analysis, Information System, Monitoring Alexei Klimentov Brookhaven National Laboratory

Slide 2

Introduction The title that Vladimir gave me is impossible in 20 mins. I'll discuss Distributed Computing Components, however I am surely one-sided as any Operations individual.

Slide 3

ATLAS Collaboration 6 Continents 37 Countries 169 Institutions 2800 Physicists 700 Students >1000 Technical and bolster staff Albany, Alberta, NIKHEF Amsterdam, Ankara, LAPP Annecy, Argonne NL, Arizona, UT Arlington, Athens, NTU Athens, Baku, IFAE Barcelona, Belgrade, Bergen, Berkeley LBL and UC, HU Berlin, Bern, Birmingham, Bogotá, Bologna, Bonn, Boston, Brandeis, Bratislava/SAS Kosice, Brookhaven NL, Buenos Aires, Bucharest, Cambridge, Carleton, Casablanca/Rabat, CERN, Chinese Cluster, Chicago, Chilean Cluster (Santiago+Valparaiso), Clermont-Ferrand, Columbia, NBI Copenhagen, Cosenza, AGH UST Cracow, IFJ PAN Cracow, DESY, Dortmund, TU Dresden, JINR Dubna, Duke, Frascati, Freiburg, Geneva, Genoa, Giessen, Glasgow, Göttingen, LPSC Grenoble, Technion Haifa, Hampton, Harvard, Heidelberg, Hiroshima, Hiroshima IT, Indiana, Innsbruck, Iowa SU, Irvine UC, Istanbul Bogazici, KEK, Kobe, Kyoto, Kyoto UE, Lancaster, UN La Plata, Lecce, Lisbon LIP, Liverpool, Ljubljana, QMW London, RHBNC London, UC London, Lund, UA Madrid, Mainz, Manchester, Mannheim, CPPM Marseille, Massachusetts, MIT, Melbourne, Michigan, Michigan SU, Milano, Minsk NAS, Minsk NCPHEP, Montreal, McGill Montreal, FIAN Moscow, ITEP Moscow, MEPhI Moscow, MSU Moscow, Munich LMU, MPI Munich, Nagasaki IAS, Nagoya, Naples, New Mexico, New York, Nijmegen, BINP Novosibirsk, Ohio SU, Okayama, Oklahoma, Oklahoma SU, Oregon, LAL Orsay, Osaka, Oslo, Oxford, Paris VI and VII, Pavia, Pennsylvania, Pisa, Pittsburgh, CAS Prague, CU Prague, TU Prague, IHEP Protvino, Regina, Ritsumeikan, UFRJ Rio de Janeiro, Rome I, Rome II, Rome III, Rutherford Appleton Laboratory, DAPNIA Saclay, Santa Cruz UC, Sheffield, Shinshu, Siegen, Simon Fraser Burnaby, SLAC, Southern Methodist Dallas, PNPI St.Petersburg, Stockholm, KTH Stockholm, Stony Brook, Sydney, AS Taipei, Tbilisi, Tel Aviv, Thessaloniki, Tokyo ICEPP, Tokyo MU, Toronto, TRIUMF, Tsukuba, Tufts, Udine/ICTP, Uppsala, Urbana UI, Valencia, UBC Vancouver, Victoria, Washington, Weizmann Rehovot, FH Wiener Neustadt, Wisconsin, Wuppertal, Yale, Yerevan

Slide 4

Necessity of Distributed Computing? Map book will gather RAW information at 320 MB/s for 50k seconds/day and ~100 days/year RAW information: 1.6 PB/year Processing (and re-preparing) these occasions will require ~10k CPUs full time the primary year of information taking, and significantly more later on as information collect Reconstructed occasions will likewise be huge, as individuals need to study indicator execution and in addition do material science examination utilizing the yield information ESD information: 1.0 PB/year, AOD information: 0.1 PB/year At slightest 10k CPUs are additionally required for ceaseless reenactment creation of no less than 30% of the genuine information rate and for investigation There is no real way to think all required registering force and capacity limit at CERN The LEP model won't scale to this level dispersed figuring, and later of the registering matrix, got to be trendy when the new century rolled over and looked encouraging when connected to HEP trials' registering needs

Slide 5

Computing Model : Main Operations Copy RAW information to CERN Castor Mass Storage System tape for recorded Copy RAW information to Tier-1s for capacity and reprocessing Run first-pass adjustment/arrangement (inside 24 hrs) Run first-pass remaking (inside 48 hrs) Distribute reproduction yield (ESDs, AODs & TAGs) to Tier-1s Archive a small amount of RAW information (Re)run alignment and arrangement Re-handle information with better calib/adjust or/and algo Distribute determined information to Tier-2s Run HITS recreation and extensive scale occasion choice and examination employments TAG Run MC reenactment Keep AOD and TAG for the investigation Run examination occupations (36 Tier-2s, ~80 destinations) AOD TAG Incomplete rundown of Data Formats: ESD : Event Summary Data AOD : Analysis Object Data DPD : Derived Physics Data TAG : occasion meta-data RAW 5 Calibration Tier 2 5 locales in Europe and US Tier 3 Contribute to MC reenactment Users Analysis O(100) locales Worldwide

Slide 6

ATLAS Grid Sites and Data Distribution 3 Grids, 10 Tier-1s, ~80 Tier-2(3)s Tier-1 and related Tier-ns frame cloud. Chart book mists have from 2 to 15 destinations. We likewise have T1-T1 affiliations. Map book Tier-1s Data Shares Tier-0 IN2P3 MoU & CM RAW, ESD 15%, AOD,DPD,TAG 100% BNL MoU & CM RAW 24%, ESD, AOD,DPD,TAG 100% Tier-1 ASGC IN2P3 ASGC BNL MWT2 SWT2 FZK Input Rates Estimation (Tier-1s) AGLT2 SLAC FZK MoU and CM RAW, ESD 10%, AOD,DPD,TAG 100% NET2 Data send out from CERN reProcessed and MC information dissemination

Slide 7

Ubiquitous Wide Area Network Bandwidth First Computing TDR's accepted insufficient system transmission capacity The Monarch extend proposed multi Tier show in light of this Today organize transfer speed is our minimum issue But regardless we have the Tier display in the LHC tests Not in all parts of the world perfect system yet (last mile) LHCOPN gives superb spine to Tier-0 and Tier-1's Each LHC analyze has received diversely K.Bos. "Status and Prospects of The LHC Experiments Computing". CHEP'09

Slide 8

Distributed Computing Components The ATLAS Grid engineering depends on : Distributed Data Management (DDM) Distributed Production System (ProdSys, PanDA) Distributed Analysis (DA), GANGA, PanDA Monitoring Grid Information System Accounting Networking Databases

Slide 9

ATLAS Distributed Data Management. 1/2 The second era of ATLAS DDM framework (DQ2) DQ2 is based on top of Grid information exchange apparatuses Moved to dataset based approach Datasets : a total of records in addition to related DDM metadata Datasets is a unit of capacity and replication Automatic information exchange components utilizing appropriated site administrations Subscription framework Notification framework Technicalities : Global administrations dataset vault dataset area index coherent document names just, no worldwide physical document list Local Site administrations (LocalFileCatalog) It gives consistent to physical record name mapping.

Slide 10

ATLAS Distributed Data Management. 2/2 Data send out from CERN to Tiers day/normal MB/s STEP09 Reprocessed datasets replication between Tier-1s (Δτ [hours] = T_last_file_transfer – T_subscription) Days of running One dataset wasn't reproduced following 3 days 99% of information were exchanged inside 4 hours Latency in reprocessing or site issue

Slide 11

ATLAS Production System 1/2 Manages ATLAS reenactment (full chain) and reprocessing employments on the wLCG Task ask for interface to characterize a related gathering of occupations Input : DQ2 dataset(s) (except for some occasion era) Output : DQ2 dataset(s) (the employments are done just when the yield is at the Tier-1) Due to brief site issues, employments are permitted a few endeavors Job definition and endeavor state are put away in Production Database (Oracle DB) Jobs are regulated by ATLAS Production System Consists of numerous parts DDM/DQ2 for information administration PanDA undertaking demand interface and employment definitions PanDA for occupation supervision ATLAS Dashboard and PanDA screen for checking Grid Middlewares ATLAS programming

Slide 12

ATLAS Production System 2/2 Job expediting is finished by the PanDA Service (bamboo) as indicated by information and site accessibility Production Database: work definition, work states, metadata Task ask for interface Tasks Input: DQ2 datasets Task states Tasks Output: DQ2 datasets 3 Grids/10 Clouds/90+Production Sites A.Read, Mar09 Monitor locales, assignments, employments

Slide 13

Data Processing Cycle Data preparing at CERN (Tier-0 handling) First-pass handling of the essential occasion stream The inferred datasets (ESD, AOD, DPD, TAG) are disseminated from the Tier-0 to the Tier-1s RAW information (got from Event Filter Farm) are traded inside 24h. This is the reason first-pass preparing should be possible by Tier-1s (however this office was not utilized amid LHC pillar and enormous beam runs) Data reprocessing at Tier-1s 10 Tier-1 focuses around the world. Every takes a subset of RAW information (Tier-1 offers from 5% to 25%), ATLAS creation offices at CERN can be utilized as a part of instance of crisis. Every Tier-1 reprocessed its share of RAW information. The determined datasets are dispersed vast. See P.Nevski' talk NEC2009, LHC Computing

Slide 14

ATLAS Data Simulation and Reprocessing Running Jobs Production System in ceaseless operations 10 mists utilize LFC as record index and Panda as occupations agent CPUs are under used in normal, top rate 33kjobs/day ProdSys can deliver 100 TB/week of MC Average walltime proficiency is more than 90% System does : Data reenactment and information reprocessing Sep08-Sep09 Reprocessing

Slide 15

ATLAS Distributed Analysis ATLAS employments go to the information J.Elmsheuser Sep09 Probably the most imperative range now It relies on upon a practical information administration and employment administration framework Two generally utilized appropriated investigation devices (Ganga and pathena) They catch the considerable dominant part of clients We anticipate that the use will become generously in the planning and particularly in the 2009/10 run Present/customary utilize cases: AOD/DPD examination unmistakably essential But likewise keep running over chose RAW (for indicator troubleshooting, concentrating on and so on… )

Slide 16

ATLAS Grid Information System (AGIS) The general motivation behind ATLAS Grid Information System is to store and to uncover static, element and setup parameters required by ATLAS Distributed Computing (ADC) applications. AGIS is a database arranged framework. The main AGIS proposition from G . Poulard. The spearheading work of R . Pezoa and R.Roc