Presentation Malathi Veeraraghavan Professor Charles L. Cocoa Dept. of Electrical and Computer Engineering University of Virginia

Outline Increasing enthusiasm for information Course: From Data to Knowledge Summary

"The information downpour" "Information, information all over the place" Economist Special Issue Feb 27-Mar. 5, 2010 Walmart databases alone are assessed at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers From organizations to governments, information accumulation and investigation is quickly turning into the following enormous thing. 2012: pieces of information effect in-the-world.html?pagewanted=all

"The information downpour" "another sort of expert has developed, the information researcher , who consolidates the aptitudes of programming developer, analyst and storyteller/craftsman to extricate the chunks of gold covered up under piles of information." Hal Varian, Google's central financial specialist takes note of that "Information are generally accessible; what is rare is the capacity to concentrate shrewdness from them."

Business insight Nestle offers > 100,000 items in 200 nations utilizing 550,000 providers Problem: not utilizing its gigantic purchasing power viably Used SAP programming and dissected its information Just one fixing – vanilla – its American operation lessened the quantity of determinations and utilized less providers, sparing $30M every year Annual reserve funds from such operational enhancements: $1 billion Economist extraordinary issue

Medical utilize Dr. Carolyn McGregor from University of Ontario Goal: spot deadly diseases in untimely children Monitors inconspicuous changes in 7 floods of continuous information, for example, heart rate, circulatory strain, and so forth. ECG alone takes 1000 readings/second Infections are recognized before evident indications develop Naked eye can't see it, yet the PC can! Who programs these? Details specialists. Another term: Evidence Based Medicine Economist uncommon issue

Government use An extra to a 1986 law obliged firms to reveal the hurtful chemicals they discharge. At the point when people in general began following these numbers, by 2000, American organizations had decreased their outflows of the chemicals secured under the law by 40% Economist uncommon issue

Best-dealers "Super-crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart" by Ian Ayres "Cash Ball: The Art of Winning an Unfair Game" by Michael Lewis "The Long Tail" by Chris Anderson Malcolm Gladwell books - Outliers Microtrends – Mark Penn (decisions) Freakonomics – S. Dubner and S. Levitt

Moneyball case 2002 season: Richest group, NY Yankees, had a finance of $126 million, while the Oakland A's had a finance of not exactly 33% of that, about $40 million, but they had achieved the playoffs three years in succession, and took the Yankees near disposal. How could they have been able to they isn't that right? Billy Beane, general supervisor of Oakland A's Respected insights Hired Paul DePodesta, Harvard MBA, who connected Bill James' equations and chose players in light of their measurements. Runs made = (Hits + Walks)  Total Bases/(At Bats + Walks) Jeremy Brown – just player in the historical backdrop of the SEC with 300 hits and 200 strolls, however he was overweight Scouts versus analysts! The propensity of everybody to sum up fiercely from his own experience. The vast majority think their own experience is run of the mill!

Malcolm's Gladwell's "Outliers" hockey players story Why Canadian hockey players conceived ahead of schedule in the year have a major preferred standpoint; cutoff date was Jan. 1 ESPN led a little concentrate: All the 2008 season NHL players who were conceived from 1980 to 1990. [Later debated for 2011 players] sufficiently sure: Many more were conceived ahead of schedule in the year than late. = merron/081208

Examples from "The Long Tail" Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the quantity of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the biggest physical music retailer, stocks just 55,000 tracks. Song reports that 40% of its aggregate deals originated from the Long Tail items, i.e., those not accessible in retail locations. Anderson gives a few such illustrations, calling these organizations Long-Tail aggregators G oogle as the long-tail aggregator of publicizing eBay of products Amazon of books Apple of music Netflix of films

Experts versus instinct Ian Ayres' book "The future has a place with individuals like Wolfers who are alright with both instinct and numbers" Wolfers dissected 44,000 school ball games (> 16 years) Also observe Jason Lehrer's "The way we Decide" – another blockbuster Ian Ayres' book, page 220

What Wolfers plotted thickness capacity of number of amusements that beat the Las Vegas spread Perfect typical chime bend! Simply take a gander at amusements with point spreads not exactly or equivalent to 12 Perfect ordinary ringer bend Look at recreations with point spread > 12 47% possibility that the favored group beat the spread (53% neglected to cover the spread) more than 20% of diversions fell in this class of recreations with >12 spreads Is it point shaving? Take a gander at the score five minutes before the end of the amusement – appropriate on track to beat the spread half of the time! Surely a more grounded case for point shaving Ian Ayres' book, page 216

2SD Rule: To comprehend fluctuation There is a 95% possibility that a typically circulated variable will fall inside two standard deviations (give or take) of its mean Statistical importance – straightforward instinctive idea – there is under 5% chance that an irregular variable will be more than two standard deviations far from the mean. Stanford Law school understudies realized that educators were required to give a 3.2 mean. They needed to know whether the teacher was a "spreader" or a "clumper"! Ian Ayres' book, page 221

"Room for mistakes" News article says "Laverne is driving Shirley 51% to 49% with a safety buffer of 2%" thus the race is a "measurable dead warmth." Ayers pronounces this "baloney!" Why? Room for mistakes = 2SD So standard deviation is 1% This implies there is a 84% possibility that Laverne leads in the surveys (i.e., has more than half of the vote) Ian Ayres' book, page 224

P(X≤1) = P (X ≥-1) = 0.84, where X~N(0,1)

Exercise See on the off chance that you can utilize the 2SD govern and simply your instinct to infer a number for the standard deviation for grown-up male tallness Estimate two things: mean and standard deviation Ian Ayres' book, page 214

Technology patterns empowering this information investigation Cloud processing Amazon , Google, Yahoo, Microsoft Open source programming R programming dialect NY Times article, Jan. 7, 2009 Hadoop permits customary PCs to break down gigantic amounts of information that beforehand required supercomputers Economist exceptional issue

Technology or procedures? Moore's Law Processing power copies at regular intervals Supercrunching needs CPUs, yet registering power has been accessible More vital: Kryder's Law Storage limit of hard drives has been multiplying at regular intervals Chief innovation office (Mark Kryder) for hard drive producer, Seagate Ian Ayres' book, page 151

Three procedures Regressions blunder term ~ N(0,  2 ) Randomization Run tests by treating diverse examples in various ways Neural systems Functional shape is not thought to be straight or anything particular Ian Ayres' book

Course material From Data to Knowledge Focus on information sets Less on subtle elements of factual methods Learn R programming through class-if R projects and assignments

Summary Importance of information investigation in each stroll of life! How to extricate the "story" covered up in the information set?