The Restorative Data Framework - MedISys eHealth 2009 Second Global ICST Gathering on Electronic Human services for the

1443 days ago, 520 views
PowerPoint PPT Presentation
... Wellbeing (PH) associations to screen issues of Public Health concern ... Amplify checking past the customary news wires (Internet). Enhance scope, number of dialects, ...

Presentation Transcript

Slide 1

The Medical Information System - MedISys eHealth 2009 Second International ICST Conference on Electronic Healthcare for the 21st century September 23-25, 2009 - Istanbul, Turkey Erik van der Goot & the OPTIMA group ( OPensource Text Information Mining and Analysis ) European Commission – Joint Research Center (JRC) Institute for the Protection and Security of the Citizen (IPSC)

Slide 2

JRC - who

Slide 3

JRC - where

Slide 4

MedISys - Overview Objective: Provide open source information gathering and investigation for reconnaissance and the study of disease transmission Replace manual examining of various daily papers and web gateways Support national and global Public Health (PH) associations to screen issues of Public Health concern (e.g. CBRN) Functionality: Gather, channel, group, concentrate and total wellbeing related data Monitor patterns , distinguish breaking news Visualize examination comes about Alert clients Allows altered perspectives In blend with RNS device, permits manual balance .

Slide 5

Background - History Based on JRC's Europe Media Monitor (EMM) innovation (EMM live since 2002; On ask for/activity of the EC's Directorate General for Health and Consumer Protection (DG SANCO). Secret word ensured benefit for Public Health bodies since 2005. Open administration since mid 2007 (, limited usefulness ).

Slide 6

Background - Media Monitoring EU Commission Media Monitoring (until 2001/2002) Traditional cut and glue for printed press just Monitoring of approaching news wires (e.g. Reuters, AFP) Simple catchphrase based sifting of wires Manual determination of printed squeeze things Human arrangement of things Potential issues Not 'constant' for predominant press: printed press regularly once every day Limited scope: not all media is printed Inaccurate and fragmented characterization: subjective and set number of classifications Labor escalated and costly: predetermined number of articles per analyst every day, requires topical learning and requires dialect information

Slide 7

EMM History New Challenges (as observed in 2002) Enlargement (+10 nations): more media, more dialects More utilization of electronic distributed (media) Electronic circulation of results (web+mobile) Automatic alarming capacities New approach: EMM - a one stop look for Media Monitoring Facilitate (not supplant) human Media Monitoring exercises Extend checking past the conventional news wires (Internet). Enhance scope, number of dialects, examination. Apply programmed classification and investigation to all sources Provide new administrations like programmed email, sms, versatile releases and so on. Give publication framework to deal with the data and deliver bulletins and so on. Critical: EMM is not Yet Another Internet Search Engine

Slide 8

EMM System Features Automatic dialect acknowledgment Based on persistently overhauled dialect particular recurrence tables Automated data/substance extraction 400.000 people and associations in light of consistently redesigned rundown of elements, numerous dialect particular equivalent words. Geotagging Based on homegrown fit multilingual geo-information set, around 600.000 place name variations in many dialects secured by EMM, generally national capitals, territorial capitals and commonplace capitals. Enhanced Categorization Engine Boolean mixes, closeness, special cases Support for Arabic and comparative (programmed thing prefix handling) Support for Chinese and comparable (no whitespace) Tonality/Sentiment Simple pack of words approach, go from exceptionally negative to extremely positive, amended for long haul source predisposition, intriguing for taking after reporting patterns per classification

Slide 9

… more components Duplicate discovery Metadata arrangement Allows determination of articles in view of any already doled out meta-information. Computerized data connecting Incremental subject based grouping and storytracking, geolocation. 10 minute interim incremental grouping on most recent 4 hours worth of news. (Best Stories on front page) Automatic discovery of breaking news Cluster development rate Flux of articles per class Indexing Index full content and generally metadata. Insights/Trend examination Quantitative investigation of reporting. Keep up basic tally measurements.

Slide 10

… and more components Event extraction Language autonomous occasion linguistic uses used to parse bunches utilizing dialect subordinate assets to fill the punctuation openings. As of now for 5 dialects (en, fr, it, pt, ru), brutal occasions, helpful occasions

Slide 11

2002 2004 2006 2009 Development course of events EMM/RNS Domain particular application MediSys Continuous improvement New components NewsExplorer First form 2005 EMM System upgrade Redesign in view of EMM RNS update

Slide 12

MediSys System Overview MediSys Newsbrief NewsDesk Service (a.k.a. RNS) Editorial Interface EMM Open Source Monitoring Engine

Slide 13

Problems to illuminate Find significant data Millions of new articles/sites/things/tweets distributed on Internet every day Deliver the data to the right client Allow for some (perhaps covering) classifications to address particular issues Timely Right now if conceivable In short: Deliver focused on data opportune to the right client

Slide 14

Approach Wide scope Many sources Local, Regional, National and International scope Many dialects Multilinguality & cross-lingual data get to Fast scope High recurrence observing of destinations, a few destinations at regular intervals Overcome the data flood Categorization, accumulation, copy recognizable proof, grouping Customisability of MedISys NewsBrief Search capacities RNS apparatus for manual balance and focused on scattering

Slide 15

Input information ~ 2200 Sources (around the world, however essential concentrate on Europe) ~ 4,000 HTML web pages+RSS nourishes ~ 100 authority medicinal locales ~ 20 business newswires Specialist pay-for sources (LexisMed) day in and day out , close consistent checking ~ 80,000 new articles/things every day Converts messy html with adverts, menus, html labels, 'related stories', and so forth into spotless and standardised Unicode-encoded RSS arrange Use RSS when accessible Perform full substance examination

Slide 16

MediSys Screenshots

Slide 17

MedISys – Current endorsers and clients incorporate … Supranational organisations Directorate General Health and Consumer Protection (SANCO) European Centre for Disease Control, Stockholm (ECDC) European Food Safety Authority (EFSA) World Health Organisation (WHO) National Public Health organisations Swiss Federal Office of Public Health Icelandic Ministry of Health Spanish Ministry of Sanitation & Ministry of Health and Consumer Protection Institut de Veille Sanitaire ( France ) Global Public Health Intelligence Network ( Canada ) Danish Emergency Management Agency Italian Ministry of Health and Ministry of Defence Dutch Institute of Public Health & Food and Consumer Product Safety Authority The (general?) open Currently ~ 1000 guests, ~ 37000 hits for each day on open framework

Slide 18

Locations specified in MedISys restorative articles crosswise over dialects English - French Spanish - Portuguese Importance of multilingual data gathering Italian - German

Slide 19

Influenza-A-Virus influenzavirus tipo A swine-birthplace flu sjevernoameričk gripe pandemia influenzale mexicaanse griep мексиканск грипп североамериканск грипп pandemija svinjske sjevernoameričke complain grippe nouvelle gripă porcină svinjski hold sikainfluenssa svininfluensa Schweineinfluenza Porzine Influenza Schweinegrippe flu porcina prasečí chřipka Multilingual and cross lingual investigation (1) Barack Obama (Eu,yo) Barak Obama (az,wo) Барак Обама (ba,uk) باراك أوباما (ar) باراك اوباما (ar,fa) Барак Хуссейн Обама (ru) Baraque Obama (pt) バラク・オバマ (ja) บารัค โอบามา ( th) Բարաք Օբամա (hy) ބަރަކް އޮބާމާ (dv) באראק אבאמא (yi) ברק אובאמה (he) 贝拉克·奥巴马 (zh) ބަރާކް އޮބާމާ (dv) بارک اوبامہ (ur) Data handling layer: Detect 'known elements' crosswise over dialects utilizing vast multilingual arrangement of name variations (redesigned day by day) Geo-find the articles utilizing huge multilingual geo-database Apply content based order utilizing multilingual classification definitions

Slide 20

Multilingual and cross lingual examination (2) Data presentation layer: "Accommodation" connections to outer Machine Translation programs, where accessible. Show of different MedISys classifications, of people and associations found in content. Show on-line English interpretation of Chinese and Arabic

Slide 21

Aggregation of multilingual data Documents from all dialects get characterized by same nations and classifications. An expansion of the quantity of media reports on any nation classification blend is recognized, freely of the reporting dialect. Diagrams and cautions may demonstrate occasions not yet reported in your own particular dialect .

Slide 22

Detection utilizing measurements Detect anomalous flux of reporting for a specific nation/classification blend

Slide 23

Recent case

Slide 24

News Clusters for the most part about Category Sat. 02-05-2009, Influenza A

Slide 25

Categorized and Clustered News Sat. 02-05-2009, Influenza A

Slide 26

PULS Event location Results from Helsinki University

Slide 27

Category definitions – Example: haemorrhagic fever Terms (single or multi-word) Cumulative weights with edge Case compelling Upper case characters in example just match capitalized in content (helpful for acronyms and so on.) Wild cards Single letters (_) Zero, one or more letters (%) Adjacent words (+) Boolean blends of term records And, or, not Using closeness administrator (inside X words )

Slide 28

Customisability of MedISys Add more news sources or new classes, e.g