Information Stream Algorithms Intro, Sampling, Entropy

1466 days ago, 489 views
PowerPoint PPT Presentation
Information Stream Algorithms. 2. Layout. Prologue to Data StreamsMotivating cases and applicationsData Streaming modelsBasic tail boundsSampling from information streamsSampling to gauge entropy. . . Information Stream Algorithms. 3. Information is becoming speedier than our capacity to store or file itThere are 3 Billion Telephone Calls in US every day, 30 Billion messages day by day, 1 Billion SMS, IMs. Investigative information:

Presentation Transcript

Slide 1

Information Stream Algorithms Intro, Sampling, Entropy Graham Cormode

Slide 2

Outline Introduction to Data Streams Motivating illustrations and applications Data Streaming models Basic tail limits Sampling from information streams Sampling to gauge entropy Data Stream Algorithms

Slide 3

Data is Massive Data is becoming quicker than our capacity to store or file it There are 3 Billion Telephone Calls in US every day, 30 Billion messages day by day, 1 Billion SMS, IMs. Logical information : NASA's perception satellites produce billions of readings each every day. IP Network Traffic : up to 1 Billion bundles for every hour per switch. Each ISP has a huge number switches! Entire genome successions for some species now accessible: every megabytes to gigabytes in size Data Stream Algorithms

Slide 4

Massive Data Analysis Must dissect this monstrous information: Scientific research (screen condition, species) System administration (spot flaws, drops, disappointments) Customer look into (affiliation rules, new offers) For income security (telephone extortion, benefit manhandle) Else, why even measure this information? Information Stream Algorithms

Slide 5

Example: Network Data Networks are wellsprings of huge information: the metadata every hour per switch is gigabytes Fundamental issue of information stream examination: Too much data to store or transmit So prepare information as it arrives: one pass, little space: the information stream approach. Surmised answers to many inquiries are OK, if there are assurances of result quality Data Stream Algorithms

Slide 6

Network Operations Center (NOC) SNMP/RMON, NetFlow records Source Destination Duration Bytes Protocol 12 20K http 16 24K http 15 20K http 19 40K http 26 58K http 27 100K ftp 32 300K ftp 18 80K ftp Peer Converged IP/MPLS Core Enterprise Networks PSTN Broadband Internet Access DSL/Cable Networks Voice over IP FR, ATM, IP VPN IP Network Monitoring Application 24x7 IP parcel/stream information streams at system components Truly enormous streams touching base at fast rates AT&T/Sprint gather ~1 Terabyte of NetFlow information every day Often dispatched off-website to information distribution center for disconnected investigation Example NetFlow IP Session Data Stream Algorithms

Slide 7

Back-end Data Warehouse What are the top (most successive) 1000 (source, dest) sets seen throughout the most recent month? DBMS (Oracle, DB2) what number unmistakable (source, dest) sets have been seen by both R1 and R2 yet not R3? SELECT COUNT (R1.source, R2.dest) FROM R1, R2 WHERE R1.dest = R2.source Set-Expression Query SQL Join Query Network Monitoring Queries Extra multifaceted nature originates from constrained space and time Will present answers for these and different issues Off-line investigation – moderate, costly Network Operations Center (NOC) R3 Peer R1 R2 Enterprise Networks PSTN DSL/Cable Networks Data Stream Algorithms

Slide 8

Other Streaming Applications Sensor systems Monitor living space and ecological parameters Track many items, interruptions, drift examination… Utility Companies Monitor control matrix, client utilization designs and so on. Alarms and quick reaction if there should arise an occurrence of issues Data Stream Algorithms

Slide 9

Streams Defining Frequency Dbns. We will consider streams that characterize recurrence disseminations E.g. recurrence of bundles from source A to source B This basic setting catches large portions of the center algorithmic issues in information spilling what number particular (non-zero) values seen? What is the entropy of the recurrence dispersion? What (and where) are the most noteworthy frequencies? All the more for the most part, can consider streams that characterize multi-dimensional disseminations, charts, geometric information and so on. In any case, notwithstanding for recurrence dispersions, a few models are pertinent Data Stream Algorithms

Slide 10

Data Stream Models We demonstrate information streams as successions of straightforward tuples Complexity emerges from huge length of streams Arrivals just streams : Example: (x, 3), (y, 2), (x, 2) encodes the landing of 3 duplicates of thing x , 2 duplicates of y, then 2 duplicates of x . Could speak to eg. parcels on a system; control use Arrivals and flights : Example: (x, 3), (y,2), (x, - 2) encodes last condition of (x, 1), (y, 2). Could speak to fluctuating amounts, or measure contrasts between two conveyances x y x y Data Stream Algorithms

Slide 11

Approximation and Randomization Many things are difficult to register precisely over a stream Is the number of all things the same in two unique streams? Requires direct space to process precisely Approximation : discover an answer adjust inside some element Find an answer that is inside 10% of right outcome More by and large, a (1   ) consider estimation Randomization : permit a little likelihood of disappointment Answer is right, aside from with likelihood 1 in 10,000 More by and large, achievement likelihood (1- ) Approximation and Randomization : (  ,  ) - approximations Data Stream Algorithms

Slide 12

Probability dissemination Tail likelihood Markov: Chebyshev : Basic Tools: Tail Inequalities General limits on tail likelihood of an arbitrary variable (likelihood that an irregular variable goes astray a long way from its desire) Basic Inequalities : Let X be an irregular variable with desire  and difference Var[X] . At that point, for any  >0 Data Stream Algorithms

Slide 13

Tail Bounds Markov Inequality: For an irregular variable Y which takes just non-negative qualities. Pr[Y  k]  E(Y)/k (This will be < 1 just for k > E(Y) ) Chebyshev's Inequality: For an arbitrary variable Y: Pr[|Y-E(Y)|  k]  Var(Y)/k 2 Proof: Set X = (Y – E(Y)) 2 E(X) = E(Y 2 +E(Y) 2 –2YE(Y)) = E(Y 2 )+E(Y) 2 - 2E(Y) 2 = Var(Y) So: Pr[|Y-E(Y)|  k] = Pr[(Y – E(Y)) 2  k 2 ]. Utilizing Markov:  E(Y – E(Y)) 2/k 2 = Var(Y)/k 2 Data Stream Algorithms

Slide 14

Outline Introduction to Data Streams Motivating illustrations and applications Data Streaming models Basic tail limits Sampling from information streams Sampling to gauge entropy Data Stream Algorithms

Slide 15

Sampling From a Data Stream Fundamental prob: test m things consistently from stream Useful: inexact exorbitant calculation on little specimen Challenge : don't know to what extent stream is So when/how regularly to test? Two arrangements, apply to various circumstances: Reservoir inspecting (dates from 1980s?) Min-wise testing (dates from 1990s?) Data Stream Algorithms

Slide 16

Reservoir Sampling Sample first m things Choose to test the i 'th thing ( i>m ) with likelihood m/i If examined, haphazardly supplant a formerly tested thing Optimization : when i gets expansive, register which thing will be examined next, skirt mediating things. [Vitter 85] Data Stream Algorithms

Slide 17

1 i i+1 n-2 n-1  …  i i+1 i+2 n-1 n Reservoir Sampling - Analysis Analyze straightforward case: test estimate m = 1 Probability i 'th thing is the specimen from stream length n : Prob. i is inspected on entry  prob. i gets by to end = 1/n Case for m > 1 is comparative, simple to show uniform likelihood Drawbacks of repository inspecting: hard to parallelize Data Stream Algorithms

Slide 18

Min-wise Sampling For every thing, pick an irregular division in the vicinity of 0 and 1 Store item(s) with the littlest arbitrary label [Nath et al.'04] 0.391 0.908 0.291 0.555 0.619 0.273 Each thing has same possibility of slightest label, so uniform Can keep running on different streams independently, then union Data Stream Algorithms

Slide 19

Sampling Exercises What happens when every thing in the stream additionally has a weight appended, and we need to test in view of these weights? Sum up the repository inspecting calculation to draw a solitary example in the weighted case. Sum up store examining to test various weighted things, and demonstrate an illustration where it neglects to give an important answer. Examine issue: outline new spilling calculations for testing in the weighted case, and dissect their properties. Information Stream Algorithms

Slide 20

Outline Introduction to Data Streams Motivating illustrations and applications Data Streaming models Basic tail limits Sampling from information streams Sampling to gauge entropy Data Stream Algorithms

Slide 21

Application of Sampling: Entropy Given a long arrangement of characters S = <a 1 , a 2 , a 3 … a m > each a j  {1… n} Let f i = recurrence of i in the succession Compute the experimental entropy: H(S) = -  i f i/m log f i/m = -  i p i log p i Example : S = < a , b, a , b, c , a , d , a > p a = 1/2 , p b = 1/4, p c = 1/8 , p d = 1/8 H(S) = ½ + ¼  2 + 1/8  3 + 1/8  3 = 7/4 Entropy advanced for irregularity location in systems Data Stream Algorithms

Slide 22

Challenge Goal : surmised H(S) in space sublinear (poly-log) in m (stream length), n (letter set size) (  ,  ) approx: answer is (1 §  )H(S) w/prob 1- Easy on the off chance that we have O(n) space: figure every f i precisely More difficult if n is tremendous, m is colossal, and we have just a single disregard the contribution to arrange (The information stream display) Data Stream Algorithms

Slide 23

Sampling Based Algorithm Simple estimator: Randomly test a position j in the stream Count how often a j shows up accordingly = r Output X = - (r log (r/m) – (r-1) log((r-1)/m)) Claim: Estimator is unprejudiced – E[X] = H(S) Proof : prob of picking j = 1/m , whole telescopes accurately Variance of gauge is not very expansive – Var[X] = O(log 2 m) Observe that |X|