0

0

1060 days ago,
413 views

14.170: Programming for Economists 1/12/2009-1/16/2009 Melissa Dell Matt Notowidigdo Paul Schrimpf

Lecture 5, Large Data Sets in Stata + Numerical Precision

Overview This address is part wrap-up address, part "tips and traps" Focus is on managing extensive information sets and on numerical accuracy Numerical exactness Introduction to twofold representation Equilibrating lattices Large information sets How Stata speaks to information in memory Speeding up code Tips and traps for expansive information sets

Numerical exactness What the @&*%&$!^ is going ahead here? nearby a = 0.7 + 0.1 neighborhood b = 0.8 show (`a' == `b') nearby a = 0.75 + 0.05 neighborhood b = 0.8 show (`a' == `b')

Binary numbers Computers store numbers in base 2 ("bits") 14 10 = 1110 2 (14 = 2 + 4 + 8) 170 10 = 10101010 2 (170 = 2 + 8 + 32 + 128) How are decimals put away?

Binary numbers, con't 0.875 10 = 0.111 2 (0.875 = 0.5 + 0.25 + 0.125) 0.80 10 = 0.110011001100 2 0.70 10 = 0.101100110011 2 0.10 10 = 0.000110011001 2 0.75 10 = 0.11 2 0.05 10 = 0.000011001100 2 QUESTION: Is there a rehashing decimal in base 10 that is not rehashing in base 2?

Precision issues in mata A = (1e10, 2e10 \ 2e-10, 3e-10) A rank (A) luinv (An) A_inv = (- 3e-10, 2e10 \ 2e-10, - 1e10) I = A * A_inv I end

Precision issues in Mata

Precision issues in Mata r = c = 0 A = (1e10, 2e10 \ 2e-10, 3e-10) A rank(A) luinv(A, 1e-15 ) _equilrc (A, r, c) A r c rank(A) luinv(A) c':*luinv(A):*r' end

Precision issues in Mata

Large information sets in Stata Computer engineering diagram CPU: executes guidelines RAM (likewise called the "memory"): stores every now and again got to information DISK ("hard drive"): stores not-as-much of the time utilized information RAM is gotten to electronically; DISK is gotten to mechanically (that is the reason you can HEAR it). In this manner DISK is a few requests of extent slower than RAM. In Stata, on the off chance that you ever need to get to the circle, you're essentially dead. Stata was not composed to manage information sets that are bigger than the accessible RAM. It anticipates that the information set will fit in memory. So when you write "set memory XXXm", ensure that you are not setting the esteem to be bigger than the accessible RAM (some working frameworks won't let you, in any case). For >20-30 GB of information, Stata is not prescribed. Consider Matlab or SAS. CPU RAM HARD DRIVE

Large information sets in Stata, con't Don't keep re-making similar factors again and again "save" can truly help or truly hurt. Know when to utilize it and when to keep away from it Don't gauge parameters you couldn't care less about Lots of "if" and "in" charges could back things off Create "1% test" to create and test code (to avert unforeseen crashes after code has been running for a considerable length of time)

Two-way altered impacts clear set seed 12345 set mem 2000m set matsize 2000 set more off set obs 5000 gen myn = _n gen id = 1 + floor((_n - 1)/100) sort id myn by id: gen t = 1 + floor((_n - 1)/5) gen x = invnormal (uniform()) gen fe = invnormal (uniform()) sort id t myn by id t: supplant fe = fe [1] gen y = 2 + x + fe + 100 * invnormal (uniform()) reg y x xi i.id* i.t reg y x _I* summ t gen idXt = id * (r(max) + 1) + t areg y x, retain( idXt )

Two-way settled impacts

Fixed Effects with huge information sets ~674 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xi i.t xtreg y x _It*, i(id) fe

Fixed Effects with huge information sets

Fixed Effects with vast information sets ~53 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xtreg y, i(id) fe anticipate y_resid, e xtreg x, i(id) fe foresee x_resid, e xtreg y_resid x_resid, i(t) fe

Fixed Effects with huge information sets

Other tips and traps when you have expansive number of settled impacts in extensive information sets Use network variable based math Newton ventures in parallel "crisscross augmentation" (Heckman-McCurdy)

Matrix polynomial math clear mata rseed (14170) N = 3000 rA = rnormal (5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_inv = luinv (V) V_inv[1..5,1..5] ~162 seconds

Matrix variable based math <1 second clear mata rseed(14170) N = 3000 rA = rnormal(5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_fast = luinv(rA - cross (rB', d :^ - 1, rC)) V_fast

Fixed Effects probit Finkelstein, Luttmer, Notowidigdo (2008) run Fixed Effects probit as a power check What about the accidental parameters issue? (see Hahn and Newey, EMA, 2004) But what to do with >11,000 settled impacts! Can't de-mean inside board as you could with direct likelihood show Stata/SE and Stata/MP grid estimate cutoff is 11,000 Need a few calculation traps

Fixed Effects probit clear set seed 12345 set matsize 2000 set obs 2000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) bys id: egen x_mean = mean(x) gen x_demean = x - x_mean probit y x probit y x_demean sort id y by id: keep if y[1] != y[_N] probit y x xi i.id probit y x _I*

Fixed Effects probit

Fixed Effects probit (moderate) tidy set more up set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] xi i.id probit y x _I*

Fixed Effects probit (moderate) ~40 minutes

Fixed Effects probit (quicker) clear set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] egen id_new = group(id) summ id_new neighborhood max = r(max) gen fe_hat = 0 forvalues iter = 1/20 { probit y x, nocons counterbalance ( fe_hat ) catch drop xb * anticipate xb , xb nooffset forvalues i = 1/`max' { qui probit y if id_new == ` i ', balance( xb ) qui supplant fe_hat = _b[_cons] if id_new == ` i " } probit y x, noconstant balance( fe_hat )

Fixed Effects probit (speedier) ~8 minutes QUESTION:Why are standard mistakes not the same?

Exercises Speed up altered impacts probit significantly more by redesigning settled impacts in parallel Fix standard blunders in FE probit illustration

SPONSORS

No comments found.

SPONSORS

SPONSORS