/* HSR Method I - Week 1 intro to Stata and regression review All the datasets from DMN textbook can be found at https://www.stata-press.com/data/heus.html */ cd "H:\Teaching\Methods 2020\lectures\Week 1 Overview and Stata\code" log using "Week 1 Stata", text *** Load data from DNM use http://www.stata-press.com/data/heus/heus_mepssample, replace // ---- Explore the data desc * Variables we will use today lookfor ed desc exp_tot age female race_* codebook age codebook exp_tot age female race_* * Explore variables sum exp_tot age female race_* tabstat exp_tot age female race_*, stats(mean sd median min max) tabstat exp_tot age female race_*, stats(mean sd median min max) columns(statistics) hist exp_tot // --- Expenditure by sex tabstat exp_tot, by(female) stats(mean median sd min max) columns(statistics) * Stata has macros global tsopts stats(mean median sd min max) columns(statistics) tabstat exp_tot, by(female) \$tsopts // ---- Create new variables gen older65 = 0 replace older65 = 1 if age > 65 & age ~=. * Don't generate older65_1 = (age > 65) * Yes generate older65_2 = (age > 65) if !missing(age) sum older65* * Quartiles xtile ageq = age, n(4) tabstat age, by(ageq) stats(mean sd median min max) // ---- Expenditure by sex and race tabstat exp_tot if race_bl==1, by(female) \$tsopts tabstat exp_tot if race_bl==0, by(female) \$tsopts * Same with a regression - no causal interpretation, a descriptive model reg exp_tot i.female i.race_bl i.female#i.race_bl * Stata saves results in variables so we can use them later ereturn list * try return list for other non-estimation commands like summarize or tabulate matrix list e(b) * average for female and black di _b[_cons] + _b[1.female] + _b[1.race_bl] + _b[1.female#1.race_bl] * average for black male di _b[_cons] + _b[1.race_bl] * average for while male di _b[_cons] * average for while female di _b[_cons] + _b[1.female] // ---- Properties and regression towwards the mean qui reg exp_tot i.female i.race_bl i.female#i.race_bl predict yhat if e(sample) *same as predict yhat if e(sample) ==1 predict resis, res sum exp_tot resis yhat corr resis female race_bl /// --- Saturated model * This is also a saturated model * How many levels of age there are? levelsof age codebook age * 68 unique values reg exp_tot i.female##i.race_bl##i.age ereturn list * note that we estimated 621 paremeters, one for each possible value of * combinations // ---- Let's add age reg exp_tot age i.female * Better to interpret as changes in decades. Easy if linear di _b[age] *10 gen aged = age/10 reg exp_tot aged i.female * Is the effect of age on total expenditure non-linear? scatter exp_tot age, jitter(3) msize(small) lowess exp_tot age lowess exp_tot age if female == 1 reg exp_tot c.age##c.age i.female * Lets compare models qui reg exp_tot age i.female est sto m1 qui reg exp_tot c.age##c.age i.female est sto m2 * Same as quietly { reg exp_tot age i.female est sto m1 reg exp_tot c.age##c.age i.female est sto m2 } est table m1 m2, star stats(N r2 r2_a bic) * What is the effect of age? qui reg exp_tot c.age##c.age i.female matrix list e(b) di _b[age] + 2*_b[c.age#c.age]*30 margins, dydx(age) at(age=(20 30 40 50 60 70 80 90)) vsquish *** What about those standard errors? gen logexp = log(exp_tot + 1) reg logexp c.age##c.age i.female est sto log est table m1 m2 log, star stats(N r2 r2_a bic) *** The better model: GLM glm exp_tot c.age##c.age i.female, link(log) family(gamma) nolog glm exp_tot c.age i.female, link(log) family(gamma) nolog est sto glm margins, dydx(*) // ---- A model estimating factors affecting zero expenditures gen zeroexp = 0 replace zeroexp = 1 if exp_tot == 0 sum zeroexp logit zeroexp age female race_bl, nolog logit zeroexp age female race_bl, nolog or logit zeroexp age i.female i.race_bl, nolog or margins, dydx(*) logit zeroexp aged i.female i.race_bl, nolog or margins, dydx(*) // --- Linear probability models * Of course, the wrong but helpful linear probability model does the job reg zeroexp aged i.female i.race_bl, robust /// --- Graphs scatter exp_tot age, msize(vsmall) graph export exp_age.png, replace * Smoother. See Cameron and Trivedi page 2.6.6. lowess exp_tot age * Better to save the results to make your own graph instead lowess exp_tot age, gen(y_smooth) nograph * Combine scatter and line graph scatter exp_tot age, msize(vsmall) || line y_smooth age, sort color(red) /// legend(off) * Hard to see trend because of scale given outliers scatter exp_tot age, msize(vsmall) || line y_smooth age, sort color(red) /// legend(off) * Restric data scatter exp_tot age if exp_tot <=150000, msize(vsmall) /// || line y_smooth age if exp_tot <=150000, sort color(red) legend(off) graph export low.png, replace log close