14.170: Programming for Economists 1/12/2009-1/16/2009 Melissa Dell Matt Notowidigdo Paul Schrimpf
Slide 2Lecture 5, Large Data Sets in Stata + Numerical Precision
Slide 3Overview This address is part wrap-up address, part "tips and traps" Focus is on managing extensive information sets and on numerical accuracy Numerical exactness Introduction to twofold representation Equilibrating lattices Large information sets How Stata speaks to information in memory Speeding up code Tips and traps for expansive information sets
Slide 4Numerical exactness What the @&*%&$!^ is going ahead here? nearby a = 0.7 + 0.1 neighborhood b = 0.8 show (`a' == `b') nearby a = 0.75 + 0.05 neighborhood b = 0.8 show (`a' == `b')
Slide 5Binary numbers Computers store numbers in base 2 ("bits") 14 10 = 1110 2 (14 = 2 + 4 + 8) 170 10 = 10101010 2 (170 = 2 + 8 + 32 + 128) How are decimals put away?
Slide 6Binary numbers, con't 0.875 10 = 0.111 2 (0.875 = 0.5 + 0.25 + 0.125) 0.80 10 = 0.110011001100 2 0.70 10 = 0.101100110011 2 0.10 10 = 0.000110011001 2 0.75 10 = 0.11 2 0.05 10 = 0.000011001100 2 QUESTION: Is there a rehashing decimal in base 10 that is not rehashing in base 2?
Slide 7Precision issues in mata A = (1e10, 2e10 \ 2e-10, 3e-10) A rank (A) luinv (An) A_inv = (- 3e-10, 2e10 \ 2e-10, - 1e10) I = A * A_inv I end
Slide 8Precision issues in Mata
Slide 9Precision issues in Mata r = c = 0 A = (1e10, 2e10 \ 2e-10, 3e-10) A rank(A) luinv(A, 1e-15 ) _equilrc (A, r, c) A r c rank(A) luinv(A) c':*luinv(A):*r' end
Slide 10Precision issues in Mata
Slide 11Large information sets in Stata Computer engineering diagram CPU: executes guidelines RAM (likewise called the "memory"): stores every now and again got to information DISK ("hard drive"): stores not-as-much of the time utilized information RAM is gotten to electronically; DISK is gotten to mechanically (that is the reason you can HEAR it). In this manner DISK is a few requests of extent slower than RAM. In Stata, on the off chance that you ever need to get to the circle, you're essentially dead. Stata was not composed to manage information sets that are bigger than the accessible RAM. It anticipates that the information set will fit in memory. So when you write "set memory XXXm", ensure that you are not setting the esteem to be bigger than the accessible RAM (some working frameworks won't let you, in any case). For >20-30 GB of information, Stata is not prescribed. Consider Matlab or SAS. CPU RAM HARD DRIVE
Slide 12Large information sets in Stata, con't Don't keep re-making similar factors again and again "save" can truly help or truly hurt. Know when to utilize it and when to keep away from it Don't gauge parameters you couldn't care less about Lots of "if" and "in" charges could back things off Create "1% test" to create and test code (to avert unforeseen crashes after code has been running for a considerable length of time)
Slide 13Two-way altered impacts clear set seed 12345 set mem 2000m set matsize 2000 set more off set obs 5000 gen myn = _n gen id = 1 + floor((_n - 1)/100) sort id myn by id: gen t = 1 + floor((_n - 1)/5) gen x = invnormal (uniform()) gen fe = invnormal (uniform()) sort id t myn by id t: supplant fe = fe [1] gen y = 2 + x + fe + 100 * invnormal (uniform()) reg y x xi i.id* i.t reg y x _I* summ t gen idXt = id * (r(max) + 1) + t areg y x, retain( idXt )
Slide 14Two-way settled impacts
Slide 15Fixed Effects with huge information sets ~674 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xi i.t xtreg y x _It*, i(id) fe
Slide 16Fixed Effects with huge information sets
Slide 17Fixed Effects with vast information sets ~53 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xtreg y, i(id) fe anticipate y_resid, e xtreg x, i(id) fe foresee x_resid, e xtreg y_resid x_resid, i(t) fe
Slide 18Fixed Effects with huge information sets
Slide 19Other tips and traps when you have expansive number of settled impacts in extensive information sets Use network variable based math Newton ventures in parallel "crisscross augmentation" (Heckman-McCurdy)
Slide 20Matrix polynomial math clear mata rseed (14170) N = 3000 rA = rnormal (5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_inv = luinv (V) V_inv[1..5,1..5] ~162 seconds
Slide 21Matrix variable based math <1 second clear mata rseed(14170) N = 3000 rA = rnormal(5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_fast = luinv(rA - cross (rB', d :^ - 1, rC)) V_fast
Slide 22Fixed Effects probit Finkelstein, Luttmer, Notowidigdo (2008) run Fixed Effects probit as a power check What about the accidental parameters issue? (see Hahn and Newey, EMA, 2004) But what to do with >11,000 settled impacts! Can't de-mean inside board as you could with direct likelihood show Stata/SE and Stata/MP grid estimate cutoff is 11,000 Need a few calculation traps
Slide 23Fixed Effects probit clear set seed 12345 set matsize 2000 set obs 2000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) bys id: egen x_mean = mean(x) gen x_demean = x - x_mean probit y x probit y x_demean sort id y by id: keep if y[1] != y[_N] probit y x xi i.id probit y x _I*
Slide 24Fixed Effects probit
Slide 25Fixed Effects probit (moderate) tidy set more up set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] xi i.id probit y x _I*
Slide 26Fixed Effects probit (moderate) ~40 minutes
Slide 27Fixed Effects probit (quicker) clear set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] egen id_new = group(id) summ id_new neighborhood max = r(max) gen fe_hat = 0 forvalues iter = 1/20 { probit y x, nocons counterbalance ( fe_hat ) catch drop xb * anticipate xb , xb nooffset forvalues i = 1/`max' { qui probit y if id_new == ` i ', balance( xb ) qui supplant fe_hat = _b[_cons] if id_new == ` i " } probit y x, noconstant balance( fe_hat )
Slide 28Fixed Effects probit (speedier) ~8 minutes QUESTION:Why are standard mistakes not the same?
Slide 29Exercises Speed up altered impacts probit significantly more by redesigning settled impacts in parallel Fix standard blunders in FE probit illustration
SPONSORS
SPONSORS
SPONSORS