install.packages("AER")
install.packages("systemfit")
Predpoklad exogenity
\(E(y|X)=\beta_0+\beta_1E(x_1|X)+\beta_2E(x_2|X)+E(\epsilon|X)\)
Pokud bude \(E(X^T\epsilon)\neq0\), tak odhad bude zkreslený a nekonzistetní
specifikace endogenity (simultánní rovnice)
instrumentální promenná
panelová data - pokud je máme
Je dulezité rozlisovat
\(\textbf{Proxy variable}\) vs. \(\textbf{Instrumental variable}\)
Nejprve si predstavme, ze máme následující populacní regresní funkci:
\(\log(wage)=\beta_0+\beta_1educ+\beta_2ability+\epsilon\)
my “nevidíme” \(ability\), tedy promenná prejde do náhodné slozky \(\epsilon\)
Jenze \(Cov(ability,educ)\neq 0\)
Jedna z mozností je nahradit \(ability\) vhodnou promennou napr. \(IQ\), pak:
\(\log(wage)=\beta_0+\beta_1educ+\beta_2IQ+\epsilon\)
\(\textbf{IQ JE PROXY PROMENNÁ K ABILITY}\)
Co kdyz nemáme proxy promennou? Co nahradit \(educ\) vhodnou prommennou \(Z\), se specifickými vlastnostmi:
Pokud jsou splneny, tak promennou \(Z\) nazýváme instrumentální promennou.
Na IV se muzeme dívat jako na promennou, kterou transformujeme celou rovnici a “zbavíme se” endogenity.
\(Z'y=Z'X\beta+Z'\epsilon\)
\(Z'\epsilon=Z'y-Z'X\beta\)
\(plim_{n\to\infty}\dfrac{Z'\epsilon}{N}=plim_{n\to\infty}\dfrac{Z'y}{N}-plim_{n\to\infty}\dfrac{Z'X}{N}\beta\)
\(plim_{n\to\infty}\dfrac{Z'\epsilon}{N}=0\)
\(plim_{n\to\infty}\dfrac{Z'y}{N}=plim_{n\to\infty}\dfrac{Z'X}{N}\beta\)
\(b=(Z'X)^{-1}Z'y\)
Budeme pokracovat v predeslém príkladu. Vzdelání je endogenní promenná a odhad pomocí OLS povede k nekonzistentnímu odhadu.
\(\log(wage)=\beta_0+\beta_1educ+\epsilon\)
library(foreign) # balcek pro nahravani ruznych datovych formatu
## Warning: package 'foreign' was built under R version 3.2.5
getwd()
## [1] "C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/IV"
data.IV=read.dta("C:/Users/Lukas/Documents/mroz.dta") # funkce pro nacteni souboru STATA
Nejprve provedeme OLS.
regrese_ols=lm(lwage~educ,data=data.IV)
summary(regrese_ols)
##
## Call:
## lm(formula = lwage ~ educ, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10256 -0.31473 0.06434 0.40081 2.10029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1852 0.1852 -1.000 0.318
## educ 0.1086 0.0144 7.545 2.76e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.68 on 426 degrees of freedom
## (325 observations deleted due to missingness)
## Multiple R-squared: 0.1179, Adjusted R-squared: 0.1158
## F-statistic: 56.93 on 1 and 426 DF, p-value: 2.761e-13
Ve druhé casti si sami napíseme odhad pomocí IV. Je to opet OLS, ale na upravená data.
\(b_{IV}=(Z'X)^{-1}Z'y\)
Rucní výpocet
data.2=na.omit(data.IV) #odstraníme NA
Z=cbind(1,data.2$fatheduc)
X=cbind(1,data.2$educ)
y=data.2$lwage
b_iv=solve(t(Z)%*%X)%*%t(Z)%*%y
b_iv
## [,1]
## [1,] 0.44110341
## [2,] 0.05917348
Nekdy máme více mozných IV. Otázkou je, kterou vybrat. K tomu nám pomuze 2SLS, dvoustupnová metoda nejmensích ctvercu.
Nejprve vsak pouzijeme 2SLS na puvodní príklad, prestoze to není nutné, jelikoz máme pouze 1 IV.
V prvním stupni odhadneme následující regresní rovnici. Tedy udeláme regresi IV na “problemovou” promennou.
\(educ=\alpha_0+\alpha_1fatheduc+u\)
regrese_iv1=lm(educ~fatheduc,data=data.IV) # first-stage regression
summary(regrese_iv1)
##
## Call:
## lm(formula = educ ~ fatheduc, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1881 -1.1881 0.2240 0.8119 6.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.79901 0.19854 49.36 <2e-16 ***
## fatheduc 0.28243 0.02089 13.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.046 on 751 degrees of freedom
## Multiple R-squared: 0.1958, Adjusted R-squared: 0.1947
## F-statistic: 182.8 on 1 and 751 DF, p-value: < 2.2e-16
fit_educ=regrese_iv1$fitted.values
Ve druhém stupni vyuzijeme vyrovnané hodnoty z prvního stupne do následující regrese:
\(\log(wage)=\beta_0+\beta_1\hat{educ}+\epsilon\)
regrese_iv2=lm(lwage~fit_educ,data=data.IV)
summary(regrese_iv2)
##
## Call:
## lm(formula = lwage ~ fit_educ, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2126 -0.3763 0.0563 0.4173 2.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49368 0.43451 1.136 0.257
## fit_educ 0.05645 0.03510 1.608 0.109
##
## Residual standard error: 0.7219 on 426 degrees of freedom
## (325 observations deleted due to missingness)
## Multiple R-squared: 0.006034, Adjusted R-squared: 0.003701
## F-statistic: 2.586 on 1 and 426 DF, p-value: 0.1086
Jak jiz bylo receno, metodu 2SLS pouzíváme v prípade, kdy máme více IV. Budeme mít 2 IV a to \(motheduc\) a \(fatheduc\). V tomto prípade jiz musíme pouzít 2SLS.
Opet první stupen obsahuje následující regresi, ze které získáme vyrovnané hodnoty.
\(educ=\alpha_0+\alpha_1fatheduc+\alpha_2motheduc+u\)
regrese_iv3=lm(educ~fatheduc+motheduc,data=data.IV)
summary(regrese_iv3)
##
## Call:
## lm(formula = educ ~ fatheduc + motheduc, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4596 -1.3760 -0.0924 0.7241 6.9243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97566 0.22567 39.774 < 2e-16 ***
## fatheduc 0.18342 0.02471 7.422 3.14e-13 ***
## motheduc 0.18328 0.02622 6.991 6.04e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.984 on 750 degrees of freedom
## Multiple R-squared: 0.245, Adjusted R-squared: 0.243
## F-statistic: 121.7 on 2 and 750 DF, p-value: < 2.2e-16
fit_educ_2sls=regrese_iv3$fitted.values
Vyrovnané hodnoty pak pouzijeme ve druhém stupni:
\(\log(wage)=\beta_0+\beta_1\hat{educ}+\epsilon\)
regrese_iv4=lm(lwage~fit_educ_2sls,data=data.IV)
summary(regrese_iv2)
##
## Call:
## lm(formula = lwage ~ fit_educ, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2126 -0.3763 0.0563 0.4173 2.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49368 0.43451 1.136 0.257
## fit_educ 0.05645 0.03510 1.608 0.109
##
## Residual standard error: 0.7219 on 426 degrees of freedom
## (325 observations deleted due to missingness)
## Multiple R-squared: 0.006034, Adjusted R-squared: 0.003701
## F-statistic: 2.586 on 1 and 426 DF, p-value: 0.1086
\(\textbf{POZOR}\) Standartní chyby jsou v prípade predchozího postupu zavádející! Musíme vyuzít jiný odhad rozptylu náhodné slozky.
Vyuzijeme package.
library(AER)
library(ivpack)
beta.IV=ivreg(lwage~educ| fatheduc+motheduc,data = data.IV) # za | dáváme vsechny exogenní promenné!
summary(beta.IV)
##
## Call:
## ivreg(formula = lwage ~ educ | fatheduc + motheduc, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.11009 -0.33930 0.04706 0.39745 2.06197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55102 0.40858 1.349 0.178
## educ 0.05049 0.03217 1.570 0.117
##
## Residual standard error: 0.6929 on 426 degrees of freedom
## Multiple R-Squared: 0.08411, Adjusted R-squared: 0.08196
## Wald test: 2.464 on 1 and 426 DF, p-value: 0.1172
A otestujeme na prítomnost heteroskedasticity.
library(sandwich)
bptest(beta.IV)
##
## studentized Breusch-Pagan test
##
## data: beta.IV
## BP = 0.30859, df = 1, p-value = 0.5785
Nyní budeme odhadovat následující regresní funkci:
\(\log(wage)=\beta_0+\beta_1educ+\beta_2exper+\beta_3exper^2+\epsilon\)
První stupen odhadu obsahuje krome IV i vsechny exogenní promenné z regrese!!
\(educ=\alpha_0+\alpha_1fatheduc+\alpha_1motheduc+\gamma_1exper+\gamma_2exper^2+u\)
beta.IV2=ivreg(lwage~educ+exper+expersq|exper+expersq+fatheduc+motheduc,data = data.IV) # za | dáváme vsechny exogenní promenné!
summary(beta.IV2)
##
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | exper + expersq +
## fatheduc + motheduc, data = data.IV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0986 -0.3196 0.0551 0.3689 2.3493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0481003 0.4003281 0.120 0.90442
## educ 0.0613966 0.0314367 1.953 0.05147 .
## exper 0.0441704 0.0134325 3.288 0.00109 **
## expersq -0.0008990 0.0004017 -2.238 0.02574 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357, Adjusted R-squared: 0.1296
## Wald test: 8.141 on 3 and 424 DF, p-value: 2.787e-05
bptest(beta.IV2)
##
## studentized Breusch-Pagan test
##
## data: beta.IV2
## BP = 11.709, df = 3, p-value = 0.008449
Nyní je prítomná heteroskedasticita a proto pouzijeme robustní odhad chyb.
robust.se(beta.IV2) #Robustní odhad pro IV. Je jiný nez v prípade robustních odhadu bez IV
## [1] "Robust Standard Errors"
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04810031 0.42778460 0.1124 0.910527
## educ 0.06139663 0.03318243 1.8503 0.064969 .
## exper 0.04417039 0.01547356 2.8546 0.004521 **
## expersq -0.00089897 0.00042807 -2.1001 0.036314 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pomocí IV muzeme testovat problém endogenity. Princip spocívá v tom, ze deláme úsudek o významnosti vzdálenosti mezi odhadem pomocí OLS a 2SLS. Pokud v regresi nebude problém s endogenitou, pak by nemel být rozdíl mezi odhady statisticky významný, jelikoz obe metody poskytují konzistentní odhad. Akorát metoda OLS bude poskytovat vydatnejsí odhad. Na druhou stranu, pokud nastane problém s endogenitou, rozdíl odhadu by mel být statisticky významný.
K tomuto úcelu pouzijeme Hausmanuv test. S tím se setkáte i v panelových datech.
Metoda | \(H_0:Exogenní\) | \(H_1:Endogenní\) |
---|---|---|
OLS | Konzistetní a vydatný | Nekonzistetní |
2SLS | Konzistetní, není vydatný | Konzistetní |
library(systemfit)
ols_model=systemfit(lwage~educ,data=data.IV,method="OLS")
iv_model=systemfit(lwage~educ,data=data.IV,method="2SLS",inst=~fatheduc+motheduc)
hausman.systemfit(iv_model,ols_model)
##
## Hausman specification test for consistency of the 3SLS estimation
##
## data: data.IV
## Hausman = 4.088, df = 2, p-value = 0.1295
Nemuzeme zamítnou nulovou hypotézu. Nepodarilo se nám tak prokázat endogenita.
Hausman test
\(h=(b_{2SLS}-b_{OLS})'[\hat{avar}(b_{2SLS})-\hat{avar}(b_{OLS})](b_{2SLS}-b_{OLS}) \sim \chi^2(k)\)
d= coef(beta.IV) - coef(regrese_ols)
var_diff = vcov(beta.IV) - vcov(regrese_ols)
hausman = as.vector(t(d) %*% solve(var_diff) %*% d)
pchisq(hausman, df = 2, lower.tail = FALSE)
## [1] 0.1295123
Pro robustní verzi bychom museli pouzít bootstrap, coz je mimo rámec kurzu.
\(log(med_expen)=\beta_0+\beta_1 insurance+\beta_2ill+\beta_3age+\beta_4 log(income)+\epsilon\)
Medical expenditure a health insurance jsou endogenní promenné.
IV jsou SS income ratio a firm multiple locations
data.health=read.csv("C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/iv_health.csv")
attach(data.health)
health_ols=lm(log(medexpense)~healthinsu+illnesses+age+log(income))
bptest(health_ols)
##
## studentized Breusch-Pagan test
##
## data: health_ols
## BP = 203.69, df = 4, p-value < 2.2e-16
coeftest(health_ols, vcov = vcovHC(health_ols, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7801266 0.1540331 37.5252 < 2.2e-16 ***
## healthinsu 0.0749595 0.0259022 2.8939 0.003813 **
## illnesses 0.4406530 0.0093638 47.0594 < 2.2e-16 ***
## age -0.0025946 0.0019275 -1.3461 0.178304
## log(income) 0.0172363 0.0135807 1.2692 0.204405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
health.IV=ivreg(log(medexpense)~healthinsu+illnesses+age+log(income)|ssiratio+illnesses+age+log(income)) # za | dáváme vsechny exogenní promenné!
summary(health.IV)
##
## Call:
## ivreg(formula = log(medexpense) ~ healthinsu + illnesses + age +
## log(income) | ssiratio + illnesses + age + log(income))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7141 -0.7468 0.1288 0.8907 4.0895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.589839 0.234676 28.081 < 2e-16 ***
## healthinsu -0.852201 0.198386 -4.296 1.76e-05 ***
## illnesses 0.448512 0.010293 43.575 < 2e-16 ***
## age -0.011797 0.002789 -4.230 2.36e-05 ***
## log(income) 0.097693 0.022464 4.349 1.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.313 on 10084 degrees of freedom
## Multiple R-Squared: 0.07094, Adjusted R-squared: 0.07058
## Wald test: 477.3 on 4 and 10084 DF, p-value: < 2.2e-16
bptest(health.IV)
##
## studentized Breusch-Pagan test
##
## data: health.IV
## BP = 203.69, df = 4, p-value < 2.2e-16
robust.se(health.IV)
## [1] "Robust Standard Errors"
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5898388 0.2453980 26.8537 < 2.2e-16 ***
## healthinsu -0.8522010 0.2113027 -4.0331 5.546e-05 ***
## illnesses 0.4485123 0.0100689 44.5442 < 2.2e-16 ***
## age -0.0117975 0.0029007 -4.0671 4.797e-05 ***
## log(income) 0.0976929 0.0233306 4.1873 2.847e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pridáme dalsí IV
health.IV2=ivreg(log(medexpense)~healthinsu+illnesses+age+log(income)|ssiratio+firmlocation+illnesses+age+log(income)) # za | dáváme vsechny exogenní promenné!
bptest(health.IV2)
##
## studentized Breusch-Pagan test
##
## data: health.IV2
## BP = 203.69, df = 4, p-value < 2.2e-16
robust.se(health.IV2)
## [1] "Robust Standard Errors"
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6923868 0.2388114 28.0237 < 2.2e-16 ***
## healthinsu -0.9696236 0.1987108 -4.8796 1.079e-06 ***
## illnesses 0.4495077 0.0102219 43.9750 < 2.2e-16 ***
## age -0.0129630 0.0028378 -4.5680 4.983e-06 ***
## log(income) 0.1078825 0.0227967 4.7324 2.249e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hausman test
d= coef(health.IV2) - coef(health_ols)
var_diff = vcov(health.IV2) - vcov(health_ols)
hausman = as.vector(t(d) %*% solve(var_diff) %*% d)
pchisq(hausman, df = 2, lower.tail = FALSE)
## [1] 1.106622e-07
Nasím ukolem je zjistit, zdali má kourení vliv na váhu novorozence.
\(log(bwght)=\beta_0+\beta_1 packs+\epsilon\)
bwght je váha dítete v librách
packs je pocet vykourených balícku za den
We might worry that packs is correlated with other health factors or the availability of good prenatal care, so that packs and u might be correlated. A possible instrumental variable for packs is the average price of cigarettes in the state of residence, cigprice.
Podmínka exogenity \(Cov(price, \epsilon)=0\)
Podmínka relevance \(Cov(price, packs)\neq0\)
getwd()
## [1] "C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/IV"
setwd("C:/Users/Lukas/Desktop")
load("bwght.RData")
Odhadneme puvodní model
ols=lm(lbwght~packs,data = data)
summary(ols)
##
## Call:
## lm(formula = lbwght ~ packs, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63391 -0.08727 0.01926 0.12095 0.83272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.769404 0.005369 888.26 < 2e-16 ***
## packs -0.089813 0.016979 -5.29 1.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 1386 degrees of freedom
## Multiple R-squared: 0.01979, Adjusted R-squared: 0.01908
## F-statistic: 27.98 on 1 and 1386 DF, p-value: 1.422e-07
Otestujeme podmínku relevance
relevance=lm(packs~cigprice,data = data)
summary(relevance)
##
## Call:
## lm(formula = packs ~ cigprice, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1106 -0.1061 -0.1032 -0.1015 2.4016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0674257 0.1025384 0.658 0.511
## cigprice 0.0002829 0.0007830 0.361 0.718
##
## Residual standard error: 0.2987 on 1386 degrees of freedom
## Multiple R-squared: 9.417e-05, Adjusted R-squared: -0.0006273
## F-statistic: 0.1305 on 1 and 1386 DF, p-value: 0.7179
Vsimnete si zejména hodnot koeficientu determinace. Jedná se o Weak instrument. Co se vsak stane, pokud jej stejne pouzijeme pri odhadu 2SLS.
library(AER)
sls=ivreg(lbwght~packs|cigprice,data = data)
summary(sls)
##
## Call:
## ivreg(formula = lbwght ~ packs | cigprice, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4200 0.1368 0.3055 0.4194 1.1540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4481 0.9082 4.898 1.08e-06 ***
## packs 2.9887 8.6989 0.344 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9389 on 1386 degrees of freedom
## Multiple R-Squared: -23.23, Adjusted R-squared: -23.25
## Wald test: 0.118 on 1 and 1386 DF, p-value: 0.7312
Pokud bychom predpokládali, ze \(Cov(cigprice,\epsilon)=0\), tak díky weak instrumet dostaneme odhad parametru s “obrovskou” smerodatnou chybou a také znacne vychýlený odhad. Koeficient determinace je záprný, ale to nás nemusí trápit, jelikoz jsme si jiz rekli, ze pri IV jej nelze pouzívat a je nutno pouzít speciální odhad pro R2.