14.170: Programming for Economists

0
0
2571 days ago, 923 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

14.170: Programming for Economists 1/12/2009-1/16/2009 Melissa Dell Matt Notowidigdo Paul Schrimpf

Slide 2

Lecture 5, Large Data Sets in Stata + Numerical Precision

Slide 3

Overview This address is part wrap-up address, part "tips and traps" Focus is on managing extensive information sets and on numerical accuracy Numerical exactness Introduction to twofold representation Equilibrating lattices Large information sets How Stata speaks to information in memory Speeding up code Tips and traps for expansive information sets

Slide 4

Numerical exactness What the @&*%&$!^ is going ahead here? nearby a = 0.7 + 0.1 neighborhood b = 0.8 show (`a' == `b') nearby a = 0.75 + 0.05 neighborhood b = 0.8 show (`a' == `b')

Slide 5

Binary numbers Computers store numbers in base 2 ("bits") 14 10 = 1110 2 (14 = 2 + 4 + 8) 170 10 = 10101010 2 (170 = 2 + 8 + 32 + 128) How are decimals put away?

Slide 6

Binary numbers, con't 0.875 10 = 0.111 2 (0.875 = 0.5 + 0.25 + 0.125) 0.80 10 = 0.110011001100 2 0.70 10 = 0.101100110011 2 0.10 10 = 0.000110011001 2 0.75 10 = 0.11 2 0.05 10 = 0.000011001100 2 QUESTION: Is there a rehashing decimal in base 10 that is not rehashing in base 2?

Slide 7

Precision issues in mata A = (1e10, 2e10 \ 2e-10, 3e-10) A rank (A) luinv (An) A_inv = (- 3e-10, 2e10 \ 2e-10, - 1e10) I = A * A_inv I end

Slide 8

Precision issues in Mata

Slide 9

Precision issues in Mata r = c = 0 A = (1e10, 2e10 \ 2e-10, 3e-10) A rank(A) luinv(A, 1e-15 ) _equilrc (A, r, c) A r c rank(A) luinv(A) c':*luinv(A):*r' end

Slide 10

Precision issues in Mata

Slide 11

Large information sets in Stata Computer engineering diagram CPU: executes guidelines RAM (likewise called the "memory"): stores every now and again got to information DISK ("hard drive"): stores not-as-much of the time utilized information RAM is gotten to electronically; DISK is gotten to mechanically (that is the reason you can HEAR it). In this manner DISK is a few requests of extent slower than RAM. In Stata, on the off chance that you ever need to get to the circle, you're essentially dead. Stata was not composed to manage information sets that are bigger than the accessible RAM. It anticipates that the information set will fit in memory. So when you write "set memory XXXm", ensure that you are not setting the esteem to be bigger than the accessible RAM (some working frameworks won't let you, in any case). For >20-30 GB of information, Stata is not prescribed. Consider Matlab or SAS. CPU RAM HARD DRIVE

Slide 12

Large information sets in Stata, con't Don't keep re-making similar factors again and again "save" can truly help or truly hurt. Know when to utilize it and when to keep away from it Don't gauge parameters you couldn't care less about Lots of "if" and "in" charges could back things off Create "1% test" to create and test code (to avert unforeseen crashes after code has been running for a considerable length of time) ‏

Slide 13

Two-way altered impacts clear set seed 12345 set mem 2000m set matsize 2000 set more off set obs 5000 gen myn = _n gen id = 1 + floor((_n - 1)/100) sort id myn by id: gen t = 1 + floor((_n - 1)/5) gen x = invnormal (uniform()) gen fe = invnormal (uniform()) sort id t myn by id t: supplant fe = fe [1] gen y = 2 + x + fe + 100 * invnormal (uniform()) reg y x xi i.id* i.t reg y x _I* summ t gen idXt = id * (r(max) + 1) + t areg y x, retain( idXt )

Slide 14

Two-way settled impacts

Slide 15

Fixed Effects with huge information sets ~674 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xi i.t xtreg y x _It*, i(id) fe

Slide 16

Fixed Effects with huge information sets

Slide 17

Fixed Effects with vast information sets ~53 seconds clear set seed 12345 set mem 100m set more off set obs 500000 gen myn = _n gen id = 1 + floor((_n - 1)/200) sort id myn by id: gen t = _n gen x = invnormal(uniform()) gen id_fe = invnormal(uniform()) gen t_fe = invnormal(uniform()) by id: supplant id_fe = id_fe[1] sort t id by t: supplant t_fe = t_fe[1] gen y = 2 + x + id_fe + t_fe + 100 * invnormal(uniform()) xtreg y, i(id) fe anticipate y_resid, e xtreg x, i(id) fe foresee x_resid, e xtreg y_resid x_resid, i(t) fe

Slide 18

Fixed Effects with huge information sets

Slide 19

Other tips and traps when you have expansive number of settled impacts in extensive information sets Use network variable based math Newton ventures in parallel "crisscross augmentation" (Heckman-McCurdy)

Slide 20

Matrix polynomial math clear mata rseed (14170) N = 3000 rA = rnormal (5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_inv = luinv (V) V_inv[1..5,1..5] ~162 seconds

Slide 21

Matrix variable based math <1 second clear mata rseed(14170) N = 3000 rA = rnormal(5, 5, 0, 1) rB = rnormal(5, N, 0, 1) rC = rnormal(N, 5, 0, 1) d = rnormal(1, N, 0, 1) V = (rA, rB \ rC, diag(d)) V_fast = luinv(rA - cross (rB', d :^ - 1, rC)) V_fast

Slide 22

Fixed Effects probit Finkelstein, Luttmer, Notowidigdo (2008) run Fixed Effects probit as a power check What about the accidental parameters issue? (see Hahn and Newey, EMA, 2004) But what to do with >11,000 settled impacts! Can't de-mean inside board as you could with direct likelihood show Stata/SE and Stata/MP grid estimate cutoff is 11,000 Need a few calculation traps

Slide 23

Fixed Effects probit clear set seed 12345 set matsize 2000 set obs 2000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) bys id: egen x_mean = mean(x) gen x_demean = x - x_mean probit y x probit y x_demean sort id y by id: keep if y[1] != y[_N] probit y x xi i.id probit y x _I*

Slide 24

Fixed Effects probit

Slide 25

Fixed Effects probit (moderate) tidy set more up set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] xi i.id probit y x _I*

Slide 26

Fixed Effects probit (moderate) ~40 minutes

Slide 27

Fixed Effects probit (quicker) clear set mem 1000m set seed 12345 set matsize 3000 set obs 12000 gen id = 1+floor((_n - 1)/4) gen a = invnormal (uniform()) gen fe_raw = 0.5* invnorm (uniform()) + 2*a bys id: egen fe = mean( fe_raw ) gen x = invnormal (uniform()) gen e = invnormal (uniform()) gen y = (1*x + fe > invnormal (uniform()) + a) sort id y by id: keep if y[1] != y[_N] egen id_new = group(id) summ id_new neighborhood max = r(max) gen fe_hat = 0 forvalues iter = 1/20 { probit y x, nocons counterbalance ( fe_hat ) catch drop xb * anticipate xb , xb nooffset forvalues i = 1/`max' { qui probit y if id_new == ` i ', balance( xb ) qui supplant fe_hat = _b[_cons] if id_new == ` i " } probit y x, noconstant balance( fe_hat )

Slide 28

Fixed Effects probit (speedier) ~8 minutes QUESTION:Why are standard mistakes not the same?

Slide 29

Exercises Speed up altered impacts probit significantly more by redesigning settled impacts in parallel Fix standard blunders in FE probit illustration

SPONSORS