install.packages("AER")
install.packages("systemfit")

Predpoklad exogenity

\(E(y|X)=\beta_0+\beta_1E(x_1|X)+\beta_2E(x_2|X)+E(\epsilon|X)\)

Pokud bude \(E(X^T\epsilon)\neq0\), tak odhad bude zkreslený a nekonzistetní

Mozné prícinny

Jak resit tento problém

  1. specifikace endogenity (simultánní rovnice)

  2. instrumentální promenná

  3. panelová data - pokud je máme

Je dulezité rozlisovat

\(\textbf{Proxy variable}\) vs. \(\textbf{Instrumental variable}\)

Nejprve si predstavme, ze máme následující populacní regresní funkci:

\(\log(wage)=\beta_0+\beta_1educ+\beta_2ability+\epsilon\)

my “nevidíme” \(ability\), tedy promenná prejde do náhodné slozky \(\epsilon\)

Jenze \(Cov(ability,educ)\neq 0\)

Jedna z mozností je nahradit \(ability\) vhodnou promennou napr. \(IQ\), pak:

\(\log(wage)=\beta_0+\beta_1educ+\beta_2IQ+\epsilon\)

\(\textbf{IQ JE PROXY PROMENNÁ K ABILITY}\)

Metoda IV

Co kdyz nemáme proxy promennou? Co nahradit \(educ\) vhodnou prommennou \(Z\), se specifickými vlastnostmi:

Pokud jsou splneny, tak promennou \(Z\) nazýváme instrumentální promennou.

Na IV se muzeme dívat jako na promennou, kterou transformujeme celou rovnici a “zbavíme se” endogenity.

\(Z'y=Z'X\beta+Z'\epsilon\)

\(Z'\epsilon=Z'y-Z'X\beta\)

\(plim_{n\to\infty}\dfrac{Z'\epsilon}{N}=plim_{n\to\infty}\dfrac{Z'y}{N}-plim_{n\to\infty}\dfrac{Z'X}{N}\beta\)

\(plim_{n\to\infty}\dfrac{Z'\epsilon}{N}=0\)

\(plim_{n\to\infty}\dfrac{Z'y}{N}=plim_{n\to\infty}\dfrac{Z'X}{N}\beta\)

\(b=(Z'X)^{-1}Z'y\)

Príklad 1

Budeme pokracovat v predeslém príkladu. Vzdelání je endogenní promenná a odhad pomocí OLS povede k nekonzistentnímu odhadu.

\(\log(wage)=\beta_0+\beta_1educ+\epsilon\)

library(foreign)                 # balcek pro nahravani ruznych datovych formatu
## Warning: package 'foreign' was built under R version 3.2.5
getwd()
## [1] "C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/IV"
data.IV=read.dta("C:/Users/Lukas/Documents/mroz.dta")   # funkce pro nacteni souboru STATA

Nejprve provedeme OLS.

regrese_ols=lm(lwage~educ,data=data.IV)
summary(regrese_ols)
## 
## Call:
## lm(formula = lwage ~ educ, data = data.IV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.10256 -0.31473  0.06434  0.40081  2.10029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1852     0.1852  -1.000    0.318    
## educ          0.1086     0.0144   7.545 2.76e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.68 on 426 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.1179, Adjusted R-squared:  0.1158 
## F-statistic: 56.93 on 1 and 426 DF,  p-value: 2.761e-13

Ve druhé casti si sami napíseme odhad pomocí IV. Je to opet OLS, ale na upravená data.

\(b_{IV}=(Z'X)^{-1}Z'y\)

Rucní výpocet

data.2=na.omit(data.IV)   #odstraníme NA 
Z=cbind(1,data.2$fatheduc)
X=cbind(1,data.2$educ)
y=data.2$lwage

b_iv=solve(t(Z)%*%X)%*%t(Z)%*%y
b_iv
##            [,1]
## [1,] 0.44110341
## [2,] 0.05917348

2SLS

Nekdy máme více mozných IV. Otázkou je, kterou vybrat. K tomu nám pomuze 2SLS, dvoustupnová metoda nejmensích ctvercu.

Nejprve vsak pouzijeme 2SLS na puvodní príklad, prestoze to není nutné, jelikoz máme pouze 1 IV.

V prvním stupni odhadneme následující regresní rovnici. Tedy udeláme regresi IV na “problemovou” promennou.

\(educ=\alpha_0+\alpha_1fatheduc+u\)

regrese_iv1=lm(educ~fatheduc,data=data.IV)   # first-stage regression
summary(regrese_iv1)
## 
## Call:
## lm(formula = educ ~ fatheduc, data = data.IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1881 -1.1881  0.2240  0.8119  6.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.79901    0.19854   49.36   <2e-16 ***
## fatheduc     0.28243    0.02089   13.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.046 on 751 degrees of freedom
## Multiple R-squared:  0.1958, Adjusted R-squared:  0.1947 
## F-statistic: 182.8 on 1 and 751 DF,  p-value: < 2.2e-16
fit_educ=regrese_iv1$fitted.values

Ve druhém stupni vyuzijeme vyrovnané hodnoty z prvního stupne do následující regrese:

\(\log(wage)=\beta_0+\beta_1\hat{educ}+\epsilon\)

regrese_iv2=lm(lwage~fit_educ,data=data.IV)
summary(regrese_iv2)
## 
## Call:
## lm(formula = lwage ~ fit_educ, data = data.IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2126 -0.3763  0.0563  0.4173  2.0604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.49368    0.43451   1.136    0.257
## fit_educ     0.05645    0.03510   1.608    0.109
## 
## Residual standard error: 0.7219 on 426 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.006034,   Adjusted R-squared:  0.003701 
## F-statistic: 2.586 on 1 and 426 DF,  p-value: 0.1086

Jak jiz bylo receno, metodu 2SLS pouzíváme v prípade, kdy máme více IV. Budeme mít 2 IV a to \(motheduc\) a \(fatheduc\). V tomto prípade jiz musíme pouzít 2SLS.

Opet první stupen obsahuje následující regresi, ze které získáme vyrovnané hodnoty.

\(educ=\alpha_0+\alpha_1fatheduc+\alpha_2motheduc+u\)

regrese_iv3=lm(educ~fatheduc+motheduc,data=data.IV)
summary(regrese_iv3)
## 
## Call:
## lm(formula = educ ~ fatheduc + motheduc, data = data.IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4596 -1.3760 -0.0924  0.7241  6.9243 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97566    0.22567  39.774  < 2e-16 ***
## fatheduc     0.18342    0.02471   7.422 3.14e-13 ***
## motheduc     0.18328    0.02622   6.991 6.04e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.984 on 750 degrees of freedom
## Multiple R-squared:  0.245,  Adjusted R-squared:  0.243 
## F-statistic: 121.7 on 2 and 750 DF,  p-value: < 2.2e-16
fit_educ_2sls=regrese_iv3$fitted.values

Vyrovnané hodnoty pak pouzijeme ve druhém stupni:

\(\log(wage)=\beta_0+\beta_1\hat{educ}+\epsilon\)

regrese_iv4=lm(lwage~fit_educ_2sls,data=data.IV)
summary(regrese_iv2)
## 
## Call:
## lm(formula = lwage ~ fit_educ, data = data.IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2126 -0.3763  0.0563  0.4173  2.0604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.49368    0.43451   1.136    0.257
## fit_educ     0.05645    0.03510   1.608    0.109
## 
## Residual standard error: 0.7219 on 426 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.006034,   Adjusted R-squared:  0.003701 
## F-statistic: 2.586 on 1 and 426 DF,  p-value: 0.1086

\(\textbf{POZOR}\) Standartní chyby jsou v prípade predchozího postupu zavádející! Musíme vyuzít jiný odhad rozptylu náhodné slozky.

Vyuzijeme package.

library(AER)
library(ivpack)
beta.IV=ivreg(lwage~educ| fatheduc+motheduc,data = data.IV)  # za | dáváme vsechny exogenní promenné!
summary(beta.IV)
## 
## Call:
## ivreg(formula = lwage ~ educ | fatheduc + motheduc, data = data.IV)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11009 -0.33930  0.04706  0.39745  2.06197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.55102    0.40858   1.349    0.178
## educ         0.05049    0.03217   1.570    0.117
## 
## Residual standard error: 0.6929 on 426 degrees of freedom
## Multiple R-Squared: 0.08411, Adjusted R-squared: 0.08196 
## Wald test: 2.464 on 1 and 426 DF,  p-value: 0.1172

A otestujeme na prítomnost heteroskedasticity.

library(sandwich)
bptest(beta.IV)
## 
##  studentized Breusch-Pagan test
## 
## data:  beta.IV
## BP = 0.30859, df = 1, p-value = 0.5785

Nyní budeme odhadovat následující regresní funkci:

\(\log(wage)=\beta_0+\beta_1educ+\beta_2exper+\beta_3exper^2+\epsilon\)

První stupen odhadu obsahuje krome IV i vsechny exogenní promenné z regrese!!

\(educ=\alpha_0+\alpha_1fatheduc+\alpha_1motheduc+\gamma_1exper+\gamma_2exper^2+u\)

beta.IV2=ivreg(lwage~educ+exper+expersq|exper+expersq+fatheduc+motheduc,data = data.IV)  # za | dáváme vsechny exogenní promenné!
summary(beta.IV2)
## 
## Call:
## ivreg(formula = lwage ~ educ + exper + expersq | exper + expersq + 
##     fatheduc + motheduc, data = data.IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0986 -0.3196  0.0551  0.3689  2.3493 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.0481003  0.4003281   0.120  0.90442   
## educ         0.0613966  0.0314367   1.953  0.05147 . 
## exper        0.0441704  0.0134325   3.288  0.00109 **
## expersq     -0.0008990  0.0004017  -2.238  0.02574 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6747 on 424 degrees of freedom
## Multiple R-Squared: 0.1357,  Adjusted R-squared: 0.1296 
## Wald test: 8.141 on 3 and 424 DF,  p-value: 2.787e-05
bptest(beta.IV2)
## 
##  studentized Breusch-Pagan test
## 
## data:  beta.IV2
## BP = 11.709, df = 3, p-value = 0.008449

Nyní je prítomná heteroskedasticita a proto pouzijeme robustní odhad chyb.

robust.se(beta.IV2)  #Robustní odhad pro IV. Je jiný nez v prípade robustních odhadu bez IV
## [1] "Robust Standard Errors"
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value Pr(>|t|)   
## (Intercept)  0.04810031  0.42778460  0.1124 0.910527   
## educ         0.06139663  0.03318243  1.8503 0.064969 . 
## exper        0.04417039  0.01547356  2.8546 0.004521 **
## expersq     -0.00089897  0.00042807 -2.1001 0.036314 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Pro IV rezidua nepouzíváme R2, který muze být i záporný; nehodí se ani pro F -testy.

Testování endogenity

Pomocí IV muzeme testovat problém endogenity. Princip spocívá v tom, ze deláme úsudek o významnosti vzdálenosti mezi odhadem pomocí OLS a 2SLS. Pokud v regresi nebude problém s endogenitou, pak by nemel být rozdíl mezi odhady statisticky významný, jelikoz obe metody poskytují konzistentní odhad. Akorát metoda OLS bude poskytovat vydatnejsí odhad. Na druhou stranu, pokud nastane problém s endogenitou, rozdíl odhadu by mel být statisticky významný.

K tomuto úcelu pouzijeme Hausmanuv test. S tím se setkáte i v panelových datech.

Metoda \(H_0:Exogenní\) \(H_1:Endogenní\)
OLS Konzistetní a vydatný Nekonzistetní
2SLS Konzistetní, není vydatný Konzistetní
library(systemfit)
ols_model=systemfit(lwage~educ,data=data.IV,method="OLS")
iv_model=systemfit(lwage~educ,data=data.IV,method="2SLS",inst=~fatheduc+motheduc)
hausman.systemfit(iv_model,ols_model)
## 
##  Hausman specification test for consistency of the 3SLS estimation
## 
## data:  data.IV
## Hausman = 4.088, df = 2, p-value = 0.1295

Nemuzeme zamítnou nulovou hypotézu. Nepodarilo se nám tak prokázat endogenita.

Hausman test

\(h=(b_{2SLS}-b_{OLS})'[\hat{avar}(b_{2SLS})-\hat{avar}(b_{OLS})](b_{2SLS}-b_{OLS}) \sim \chi^2(k)\)

d= coef(beta.IV) - coef(regrese_ols)
var_diff = vcov(beta.IV) - vcov(regrese_ols)
hausman = as.vector(t(d) %*% solve(var_diff) %*% d)
pchisq(hausman, df = 2, lower.tail = FALSE)
## [1] 0.1295123

Pro robustní verzi bychom museli pouzít bootstrap, coz je mimo rámec kurzu.

Príklad 2

\(log(med_expen)=\beta_0+\beta_1 insurance+\beta_2ill+\beta_3age+\beta_4 log(income)+\epsilon\)

Medical expenditure a health insurance jsou endogenní promenné.

IV jsou SS income ratio a firm multiple locations

data.health=read.csv("C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/iv_health.csv")
attach(data.health)
health_ols=lm(log(medexpense)~healthinsu+illnesses+age+log(income))
bptest(health_ols)
## 
##  studentized Breusch-Pagan test
## 
## data:  health_ols
## BP = 203.69, df = 4, p-value < 2.2e-16
coeftest(health_ols, vcov = vcovHC(health_ols, type = "HC1")) 
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  5.7801266  0.1540331 37.5252 < 2.2e-16 ***
## healthinsu   0.0749595  0.0259022  2.8939  0.003813 ** 
## illnesses    0.4406530  0.0093638 47.0594 < 2.2e-16 ***
## age         -0.0025946  0.0019275 -1.3461  0.178304    
## log(income)  0.0172363  0.0135807  1.2692  0.204405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
health.IV=ivreg(log(medexpense)~healthinsu+illnesses+age+log(income)|ssiratio+illnesses+age+log(income))  # za | dáváme vsechny exogenní promenné!
summary(health.IV)
## 
## Call:
## ivreg(formula = log(medexpense) ~ healthinsu + illnesses + age + 
##     log(income) | ssiratio + illnesses + age + log(income))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7141 -0.7468  0.1288  0.8907  4.0895 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.589839   0.234676  28.081  < 2e-16 ***
## healthinsu  -0.852201   0.198386  -4.296 1.76e-05 ***
## illnesses    0.448512   0.010293  43.575  < 2e-16 ***
## age         -0.011797   0.002789  -4.230 2.36e-05 ***
## log(income)  0.097693   0.022464   4.349 1.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.313 on 10084 degrees of freedom
## Multiple R-Squared: 0.07094, Adjusted R-squared: 0.07058 
## Wald test: 477.3 on 4 and 10084 DF,  p-value: < 2.2e-16
bptest(health.IV)
## 
##  studentized Breusch-Pagan test
## 
## data:  health.IV
## BP = 203.69, df = 4, p-value < 2.2e-16
robust.se(health.IV)
## [1] "Robust Standard Errors"
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  6.5898388  0.2453980 26.8537 < 2.2e-16 ***
## healthinsu  -0.8522010  0.2113027 -4.0331 5.546e-05 ***
## illnesses    0.4485123  0.0100689 44.5442 < 2.2e-16 ***
## age         -0.0117975  0.0029007 -4.0671 4.797e-05 ***
## log(income)  0.0976929  0.0233306  4.1873 2.847e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pridáme dalsí IV

health.IV2=ivreg(log(medexpense)~healthinsu+illnesses+age+log(income)|ssiratio+firmlocation+illnesses+age+log(income))  # za | dáváme vsechny exogenní promenné!
bptest(health.IV2)
## 
##  studentized Breusch-Pagan test
## 
## data:  health.IV2
## BP = 203.69, df = 4, p-value < 2.2e-16
robust.se(health.IV2)
## [1] "Robust Standard Errors"
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  6.6923868  0.2388114 28.0237 < 2.2e-16 ***
## healthinsu  -0.9696236  0.1987108 -4.8796 1.079e-06 ***
## illnesses    0.4495077  0.0102219 43.9750 < 2.2e-16 ***
## age         -0.0129630  0.0028378 -4.5680 4.983e-06 ***
## log(income)  0.1078825  0.0227967  4.7324 2.249e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hausman test

d= coef(health.IV2) - coef(health_ols)
var_diff = vcov(health.IV2) - vcov(health_ols)
hausman = as.vector(t(d) %*% solve(var_diff) %*% d)
pchisq(hausman, df = 2, lower.tail = FALSE)
## [1] 1.106622e-07

Priklad

Nasím ukolem je zjistit, zdali má kourení vliv na váhu novorozence.

\(log(bwght)=\beta_0+\beta_1 packs+\epsilon\)

  • bwght je váha dítete v librách

  • packs je pocet vykourených balícku za den

We might worry that packs is correlated with other health factors or the availability of good prenatal care, so that packs and u might be correlated. A possible instrumental variable for packs is the average price of cigarettes in the state of residence, cigprice.

Podmínka exogenity \(Cov(price, \epsilon)=0\)

Podmínka relevance \(Cov(price, packs)\neq0\)

getwd()
## [1] "C:/Users/Lukas/Desktop/Ekonometrie/AKM/NEW!!!/Cviceni/IV/IV"
setwd("C:/Users/Lukas/Desktop")
load("bwght.RData")

Odhadneme puvodní model

ols=lm(lbwght~packs,data = data)
summary(ols)
## 
## Call:
## lm(formula = lbwght ~ packs, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63391 -0.08727  0.01926  0.12095  0.83272 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.769404   0.005369  888.26  < 2e-16 ***
## packs       -0.089813   0.016979   -5.29 1.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 1386 degrees of freedom
## Multiple R-squared:  0.01979,    Adjusted R-squared:  0.01908 
## F-statistic: 27.98 on 1 and 1386 DF,  p-value: 1.422e-07

Otestujeme podmínku relevance

relevance=lm(packs~cigprice,data = data)
summary(relevance)
## 
## Call:
## lm(formula = packs ~ cigprice, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1106 -0.1061 -0.1032 -0.1015  2.4016 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0674257  0.1025384   0.658    0.511
## cigprice    0.0002829  0.0007830   0.361    0.718
## 
## Residual standard error: 0.2987 on 1386 degrees of freedom
## Multiple R-squared:  9.417e-05,  Adjusted R-squared:  -0.0006273 
## F-statistic: 0.1305 on 1 and 1386 DF,  p-value: 0.7179

Vsimnete si zejména hodnot koeficientu determinace. Jedná se o Weak instrument. Co se vsak stane, pokud jej stejne pouzijeme pri odhadu 2SLS.

library(AER)
sls=ivreg(lbwght~packs|cigprice,data = data)
summary(sls)
## 
## Call:
## ivreg(formula = lbwght ~ packs | cigprice, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4200  0.1368  0.3055  0.4194  1.1540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.4481     0.9082   4.898 1.08e-06 ***
## packs         2.9887     8.6989   0.344    0.731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9389 on 1386 degrees of freedom
## Multiple R-Squared: -23.23,  Adjusted R-squared: -23.25 
## Wald test: 0.118 on 1 and 1386 DF,  p-value: 0.7312

Pokud bychom predpokládali, ze \(Cov(cigprice,\epsilon)=0\), tak díky weak instrumet dostaneme odhad parametru s “obrovskou” smerodatnou chybou a také znacne vychýlený odhad. Koeficient determinace je záprný, ale to nás nemusí trápit, jelikoz jsme si jiz rekli, ze pri IV jej nelze pouzívat a je nutno pouzít speciální odhad pro R2.

GitHub Logo

GitHub Logo