\(Var(\epsilon|X)=\sigma^2 I\) homoskedasticita
\(Var(\epsilon|X)=\sigma^2 \Omega\) heteroskedasticita
\(Var(\epsilon|X)=\sigma^2 h(X)\)
Pro jednodusí predstavu uvazujme prímkovou regresi
\(Var(\epsilon|x)=\sigma^2 h(x)\) kde \(h(x)\) predstavuje funkci “typ” heteroskedasticity. Rozptyl musí být vzdy nezáporný!
Otázkou je jak vypadá funkce \(h(x)\)
\(wage=\beta_0+\beta_1 IQ+\beta_2 expert+\beta_3 educ+\epsilon\)
set.seed(1000)
N=10000
IQ=rnorm(N,100,5)
expert=as.numeric(runif(N, min=0, max=30))
educ = as.numeric(runif(N, min=9, max=20))
sigma=2*(sqrt(educ))^0.5 # tvar podmíneného rozptylu, rozptyl závisí na jednom z regresoru
e4= rnorm(N, 0, sigma)
wage = -50 +10*IQ+ 50*educ+3*expert + e4
populace = data.frame(IQ,expert, educ, wage)
Jen pro zajímavost si odhadneme model.
regrese_hetero=lm(wage~IQ+educ+expert,data = populace) # provedeme odhad regresní rovnice, nyní nedeláme výber
summary(regrese_hetero)
##
## Call:
## lm(formula = wage ~ IQ + educ + expert, data = populace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6544 -2.6213 -0.0286 2.5754 14.8072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.044573 0.804438 -62.21 <2e-16 ***
## IQ 10.001225 0.007817 1279.47 <2e-16 ***
## educ 49.996864 0.012222 4090.56 <2e-16 ***
## expert 3.004730 0.004496 668.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.888 on 9996 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 6.304e+06 on 3 and 9996 DF, p-value: < 2.2e-16
Sestavíme si vlastní BP test.
residua=regrese_hetero$residuals
BP_test=lm(residua^2~IQ+educ+expert)
summary(BP_test)
##
## Call:
## lm(formula = residua^2 ~ IQ + educ + expert)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.804 -13.065 -8.149 4.837 259.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.21890 4.49529 0.716 0.474
## IQ 0.04931 0.04368 1.129 0.259
## educ 0.51657 0.06830 7.563 4.28e-14 ***
## expert -0.03325 0.02512 -1.324 0.186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.73 on 9996 degrees of freedom
## Multiple R-squared: 0.005977, Adjusted R-squared: 0.005678
## F-statistic: 20.03 on 3 and 9996 DF, p-value: 6.097e-13
Ale nyní si ukázeme prímo funkci pro BP test
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(regrese_hetero)
##
## studentized Breusch-Pagan test
##
## data: regrese_hetero
## BP = 59.767, df = 3, p-value = 6.591e-13
White test pro heteroskedasticitu
residua=regrese_hetero$residuals # ulozíme si residua z regrese1
residua_kvadrat=residua^2
y_hat=regrese_hetero$fitted.values
y_hat2=(regrese_hetero$fitted.values)^2
white=lm(residua_kvadrat~y_hat+y_hat2)
summary(white)
##
## Call:
## lm(formula = residua_kvadrat ~ y_hat + y_hat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.511 -13.123 -8.192 4.824 260.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.246e+00 2.220e+01 -0.191 0.848
## y_hat 1.325e-02 2.598e-02 0.510 0.610
## y_hat2 -1.144e-06 7.548e-06 -0.152 0.880
##
## Residual standard error: 21.74 on 9997 degrees of freedom
## Multiple R-squared: 0.005237, Adjusted R-squared: 0.005038
## F-statistic: 26.32 on 2 and 9997 DF, p-value: 3.987e-12
Co tedy delat, kdyz zde máme heteroskedasticitu? Vyuzijeme robustní odhad chyb
install.packages("sandwich")
library(sandwich) # https://cran.r-project.org/web/packages/sandwich/sandwich.pdf
Sestavíme si vlastní t-test robustní vuci heteroskedasticite
kov_matice=vcovHC(regrese_hetero) # kovarianèní matice "béèek" Var(b)
rozptyl=diag(kov_matice) # nyní nás zajímají jen rozptyly, tedy diagonála
odhad_bet=regrese_hetero$coefficients # z regrese vybereme odhady parametrù
t_test=odhad_bet/sqrt(rozptyl) # sestrojíme t-test, s robustními odhady
t_test
## (Intercept) IQ educ expert
## -62.68211 1286.54002 4077.35134 667.21638
Výsledky porovnáme s funkcí prímo pro robustní odhad
coeftest(regrese_hetero, vcovHC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.0445729 0.7983868 -62.682 < 2.2e-16 ***
## IQ 10.0012247 0.0077737 1286.540 < 2.2e-16 ***
## educ 49.9968640 0.0122621 4077.351 < 2.2e-16 ***
## expert 3.0047296 0.0045034 667.216 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pokud budete chtít pouzít jiný odhad matice \(B\) z prednásky, muzete:
coeftest(regrese_hetero, vcov = vcovHC(regrese_hetero, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.0445729 0.7980810 -62.706 < 2.2e-16 ***
## IQ 10.0012247 0.0077707 1287.044 < 2.2e-16 ***
## educ 49.9968640 0.0122586 4078.500 < 2.2e-16 ***
## expert 3.0047296 0.0045021 667.401 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robustní verze F-testu
\(H_0: \beta_1=\beta_2=\beta_3=0\)
\(H_1: non H_0\)
library(car)
linearHypothesis(regrese_hetero,c("IQ=0","educ=0","expert=0"),test="F",white.adjust="hc0")
## Linear hypothesis test
##
## Hypothesis:
## IQ = 0
## educ = 0
## expert = 0
##
## Model 1: restricted model
## Model 2: wage ~ IQ + educ + expert
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 9999
## 2 9996 3 6310696 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nyní si nasimulujeme príklad s výraznou heteroskedasticitou
set.seed(1000)
N=10000
IQ=rnorm(N,100,5)
expert=as.numeric(runif(N, min=0, max=30))
educ = as.numeric(runif(N, min=9, max=20))
sigma=2*(exp(educ))^0.5 # tvar podmíneného rozptylu, rozptyl závisí na jednom z regresoru
e4= rnorm(N, 0, sigma)
wage = -50 +10*IQ+ 50*educ+3*expert + e4
populace = data.frame(IQ,expert, educ, wage)
Zásadním predpokladem GSL odhadu je \(\bf{znalostost}\) tvaru heteroskedasticity!
\(Var(\epsilon|X)=4\times e^{educ}\)
\(h(X)=e^{educ}\)
\(\dfrac{wage}{\sqrt{e^{educ}}}=\dfrac{\beta_0}{\sqrt{e^{educ}}}+\beta_1 \dfrac{IQ}{\sqrt{e^{educ}}}+\beta_2 \dfrac{expert}{\sqrt{e^{educ}}}+\beta_3 \dfrac{educ}{\sqrt{e^{educ}}}+\dfrac{\epsilon}{\sqrt{e^{educ}}}\)
\(Var(\dfrac{\epsilon}{\sqrt{e^{educ}}})=\dfrac{1}{e^{educ}} \times Var(\epsilon) =\dfrac{1}{e^{educ}} \times 4\times e^{educ}=4\)
attach(populace)
## The following objects are masked _by_ .GlobalEnv:
##
## educ, expert, IQ, wage
wage_W=wage/sqrt(exp(educ))
konstanta=cbind(rep(1,length(educ)))/sqrt(exp(educ))
IQ_W=IQ/sqrt(exp(educ))
expert_W=expert/sqrt(exp(educ))
educ_W=educ/sqrt(exp(educ))
Vsimnete si, co nám zpusobila silná heterokedasticita.
regrese_hetero2=lm(wage_W~0+konstanta+IQ_W+educ_W+expert_W,data = populace)
summary(regrese_hetero2)
##
## Call:
## lm(formula = wage_W ~ 0 + konstanta + IQ_W + educ_W + expert_W,
## data = populace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9503 -1.3372 -0.0001 1.3519 8.5191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## konstanta -163.7020 134.6521 -1.216 0.224
## IQ_W 10.0896 1.1994 8.412 < 2e-16 ***
## educ_W 59.7312 5.9634 10.016 < 2e-16 ***
## expert_W 4.1401 0.6918 5.985 2.24e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.001 on 9996 degrees of freedom
## Multiple R-squared: 0.8672, Adjusted R-squared: 0.8672
## F-statistic: 1.632e+04 on 4 and 9996 DF, p-value: < 2.2e-16
bptest(regrese_hetero2)
##
## studentized Breusch-Pagan test
##
## data: regrese_hetero2
## BP = 4.8225, df = 3, p-value = 0.1853
vaha=1/exp(educ) # Pozor na zadání váh!!
regrese_gls=lm(wage~IQ+educ+expert,weights = vaha)
summary(regrese_gls)
##
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vaha)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -7.9503 -1.3372 -0.0001 1.3519 8.5191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -163.7020 134.6521 -1.216 0.224
## IQ 10.0896 1.1994 8.412 < 2e-16 ***
## educ 59.7312 5.9634 10.016 < 2e-16 ***
## expert 4.1401 0.6918 5.985 2.24e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.001 on 9996 degrees of freedom
## Multiple R-squared: 0.01995, Adjusted R-squared: 0.01966
## F-statistic: 67.84 on 3 and 9996 DF, p-value: < 2.2e-16
Co delat, kdyz nevíme jak vypadá h(X)?
\(\textbf{Castecným}\) resením je FGLS. V rámci FGLS se snazíme odhadnout tvar heteroskedasticity, tedy tvar funkce \(h(X)\). Mozností je nekonecno, proto muzeme pouzít napríklad predpoklad podle White:
\(Var(\epsilon|X)=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)\)
\(\epsilon^2=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)u\)
\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)
regrese1=lm(wage~IQ+educ+expert)
residua=regrese1$resid
le2=log(residua^2)
\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)
regrese2=lm(le2~IQ+educ+expert)
nafit_hodnoty=regrese2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h
regrese3=lm(wage~IQ+educ+expert,weights = vahy2)
summary(regrese3)
##
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vahy2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -7.7423 -1.2626 -0.0009 1.2784 7.8133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -169.496 136.844 -1.239 0.216
## IQ 10.136 1.223 8.291 < 2e-16 ***
## educ 59.915 6.025 9.944 < 2e-16 ***
## expert 4.092 0.705 5.805 6.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.892 on 9996 degrees of freedom
## Multiple R-squared: 0.0194, Adjusted R-squared: 0.01911
## F-statistic: 65.92 on 3 and 9996 DF, p-value: < 2.2e-16
bptest(regrese3)
##
## studentized Breusch-Pagan test
##
## data: regrese3
## BP = 1233.9, df = 3, p-value < 2.2e-16
Metoda FGLS vám neodstraní heteroskedasticitu! Neznáte její tvar, pouze vám pomuze získat asymptoticky vydatnejsí odhady parametru. Musíte proto vzdy pouzít i robustní odhad chyb!
library(sandwich)
coeftest(regrese3,vcovHC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -169.49594 135.25464 -1.2532 0.2102
## IQ 10.13640 1.22567 8.2701 < 2.2e-16 ***
## educ 59.91537 6.01045 9.9685 < 2.2e-16 ***
## expert 4.09234 0.69982 5.8477 5.14e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nejsem uchalák, toto je príklad jak z clánku, tak z knihy od Greena! V clanku autor zkoumal rozhodování o práci a volném casu. viz. Fairs (1977) Extramarital Affairs
affairs=read.table("affairs.txt",header = TRUE)
regrese_aff=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
summary(regrese_aff)
##
## Call:
## lm(formula = Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 + Z8, data = affairs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0503 -1.7226 -0.7947 0.2101 12.7036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87201 1.13750 5.162 3.34e-07 ***
## Z1 0.05409 0.30049 0.180 0.8572
## Z2 -0.05098 0.02262 -2.254 0.0246 *
## Z3 0.16947 0.04122 4.111 4.50e-05 ***
## Z4 -0.14262 0.35020 -0.407 0.6840
## Z5 -0.47761 0.11173 -4.275 2.23e-05 ***
## Z6 -0.01375 0.06414 -0.214 0.8303
## Z7 0.10492 0.08888 1.180 0.2383
## Z8 -0.71188 0.12001 -5.932 5.09e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.095 on 592 degrees of freedom
## Multiple R-squared: 0.1317, Adjusted R-squared: 0.12
## F-statistic: 11.23 on 8 and 592 DF, p-value: 7.472e-15
bptest(regrese_aff)
##
## studentized Breusch-Pagan test
##
## data: regrese_aff
## BP = 56.188, df = 8, p-value = 2.593e-09
residua=regrese_aff$resid
le2=log(residua^2)
regrese_aff2=lm(le2~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
nafit_hodnoty=regrese_aff2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h
regrese_aff3=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8,data = affairs,weights = vahy2)
Nesmíme vsak zapomenout na robustní odhad chyb. FGLS neodstraní heteroskedasticitu, jen zmensí její dopad.
coeftest(regrese_aff3, vcov = vcovHC(regrese_aff3, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.906047 1.204729 1.5821 0.114152
## Z1 0.085088 0.230422 0.3693 0.712057
## Z2 0.045144 0.049417 0.9135 0.361334
## Z3 0.028811 0.037210 0.7743 0.439064
## Z4 -0.355230 0.421631 -0.8425 0.399841
## Z5 -0.289497 0.082068 -3.5275 0.000452 ***
## Z6 -0.025546 0.059936 -0.4262 0.670100
## Z7 -0.010622 0.057912 -0.1834 0.854540
## Z8 -0.240043 0.145419 -1.6507 0.099329 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A porovnáme s modelem bez odhadu FGLS.
coeftest(regrese_aff, vcov = vcovHC(regrese3, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.872 134.903 0.0435 0.9653