Heteroskedasticita

\(Var(\epsilon|X)=\sigma^2 I\) homoskedasticita

\(Var(\epsilon|X)=\sigma^2 \Omega\) heteroskedasticita

\(Var(\epsilon|X)=\sigma^2 h(X)\)

Pro jednodusí predstavu uvazujme prímkovou regresi

\(Var(\epsilon|x)=\sigma^2 h(x)\) kde \(h(x)\) predstavuje funkci “typ” heteroskedasticity. Rozptyl musí být vzdy nezáporný!

Otázkou je jak vypadá funkce \(h(x)\)

Model heteroskedasticita

\(wage=\beta_0+\beta_1 IQ+\beta_2 expert+\beta_3 educ+\epsilon\)

set.seed(100)
N=10000
IQ=rnorm(N,100,5)
expert=as.numeric(runif(N, min=0, max=30))
educ = as.numeric(runif(N, min=9, max=20))  
sigma=2*(exp(educ))^0.5             # tvar podmíneného rozptylu, rozptyl závisí na jednom z regresoru 
e4= rnorm(N, 0, sigma)  
wage = -50 +10*IQ+ 50*educ+3*expert + e4       
populace = data.frame(IQ,expert, educ, wage)  

Jen pro zajímavost si odhadneme model.

regrese_hetero=lm(wage~IQ+educ+expert,data = populace)   # provedeme odhad regresní rovnice, nyní nedeláme výber
summary(regrese_hetero)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, data = populace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126807   -1462    -181    1790   97841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -2571.13    2680.08  -0.959   0.3374  
## IQ             56.01      26.07   2.149   0.0317 *
## educ          -82.85      40.88  -2.027   0.0427 *
## expert        -23.64      14.96  -1.580   0.1142  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12960 on 9996 degrees of freedom
## Multiple R-squared:  0.001115,   Adjusted R-squared:  0.0008156 
## F-statistic: 3.721 on 3 and 9996 DF,  p-value: 0.01091
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(regrese_hetero)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_hetero
## BP = 1405.5, df = 3, p-value < 2.2e-16

GLS

Zásadním predpokladem GSL odhadu je \(\bf{znalostost}\) tvaru heteroskedasticity!

\(Var(\epsilon|X)=4\times e^{educ}\)

\(h(X)=e^{educ}\)

\(\dfrac{wage}{\sqrt{e^{educ}}}=\dfrac{\beta_0}{\sqrt{e^{educ}}}+\beta_1 \dfrac{IQ}{\sqrt{e^{educ}}}+\beta_2 \dfrac{expert}{\sqrt{e^{educ}}}+\beta_3 \dfrac{educ}{\sqrt{e^{educ}}}+\dfrac{\epsilon}{\sqrt{e^{educ}}}\)

\(Var(\dfrac{\epsilon}{\sqrt{e^{educ}}})=\dfrac{1}{e^{educ}} \times Var(\epsilon) =\dfrac{1}{e^{educ}} \times 4\times e^{educ}=4\)

attach(populace)
## The following objects are masked _by_ .GlobalEnv:
## 
##     educ, expert, IQ, wage
wage_W=wage/sqrt(exp(educ))
konstanta=cbind(rep(1,length(educ)))/sqrt(exp(educ))
IQ_W=IQ/sqrt(exp(educ))
expert_W=expert/sqrt(exp(educ))
educ_W=educ/sqrt(exp(educ))

Vsimnete si, co nám zpusobila silná heterokedasticita.

regrese_hetero2=lm(wage_W~0+konstanta+IQ_W+educ_W+expert_W,data = populace)   
summary(regrese_hetero2)
## 
## Call:
## lm(formula = wage_W ~ 0 + konstanta + IQ_W + educ_W + expert_W, 
##     data = populace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0748 -1.3715 -0.0315  1.3475  7.2975 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## konstanta  80.5596   136.4402   0.590  0.55491    
## IQ_W        8.5704     1.2012   7.135 1.04e-12 ***
## educ_W     51.7456     6.0206   8.595  < 2e-16 ***
## expert_W    2.6090     0.6935   3.762  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.016 on 9996 degrees of freedom
## Multiple R-squared:  0.8604, Adjusted R-squared:  0.8604 
## F-statistic: 1.54e+04 on 4 and 9996 DF,  p-value: < 2.2e-16
bptest(regrese_hetero2)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_hetero2
## BP = 1.3732, df = 3, p-value = 0.7118

GLS vs. FGLS

vaha=1/exp(educ)            # Pozor na zadání váh!!

regrese_gls=lm(wage~IQ+educ+expert,weights = vaha)
summary(regrese_gls)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vaha)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0748 -1.3715 -0.0315  1.3475  7.2975 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.5596   136.4402   0.590  0.55491    
## IQ            8.5704     1.2012   7.135 1.04e-12 ***
## educ         51.7456     6.0206   8.595  < 2e-16 ***
## expert        2.6090     0.6935   3.762  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.016 on 9996 degrees of freedom
## Multiple R-squared:  0.01332,    Adjusted R-squared:  0.01302 
## F-statistic: 44.98 on 3 and 9996 DF,  p-value: < 2.2e-16

Co delat, kdyz nevíme jak vypadá h(X)?

\(\textbf{Castecným}\) resením je FGLS. V rámci FGLS se snazíme odhadnout tvar heteroskedasticity, tedy tvar funkce \(h(X)\). Mozností je nekonecno, proto muzeme pouzít napríklad predpoklad podle White:

\(Var(\epsilon|X)=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)\)

\(\epsilon^2=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)u\)

\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)

  1. Odhadneme regresní model a získáme residua
  2. Umocníme je a prevedeme na logaritmus.
  3. Provedeme regresi a získáme vyrovnané hodnoty \(\hat{\log{(\epsilon^2)}}\)
  4. Provedeme úpravu, kdy \(\hat{h}=exp(\hat{(\log{\epsilon^2)}})\) a získáme tak váhy.
regrese1=lm(wage~IQ+educ+expert)
residua=regrese1$resid
le2=log(residua^2)

\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)

regrese2=lm(le2~IQ+educ+expert)
nafit_hodnoty=regrese2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h

regrese3=lm(wage~IQ+educ+expert,weights = vahy2)
summary(regrese3)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vahy2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9032 -1.0808 -0.0222  1.0471  7.0005 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.8147   189.7879   0.331   0.7407    
## IQ            8.8137     1.7241   5.112 3.25e-07 ***
## educ         51.4228     7.4101   6.940 4.18e-12 ***
## expert        2.4024     0.9957   2.413   0.0159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.678 on 9996 degrees of freedom
## Multiple R-squared:  0.007752,   Adjusted R-squared:  0.007455 
## F-statistic: 26.03 on 3 and 9996 DF,  p-value: < 2.2e-16
bptest(regrese3)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese3
## BP = 1405.5, df = 3, p-value < 2.2e-16

Metoda FGLS vám neodstraní heteroskedasticitu! Neznáte její tvar, pouze vám pomuze získat asymptoticky vydatnejsí odhady parametru. Musíte proto vzdy pouzít i robustní odhad chyb!

library(sandwich)
## Warning: package 'sandwich' was built under R version 3.2.5
coeftest(regrese3,vcovHC)
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  62.81472  135.38014  0.4640 0.6426667    
## IQ            8.81368    1.18655  7.4280 1.193e-13 ***
## educ         51.42277    6.41880  8.0113 1.262e-15 ***
## expert        2.40235    0.71471  3.3613 0.0007786 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Príklad

Nejsem uchalák, toto je príklad jak z clánku, tak z knihy od Greena! V clanku autor zkoumal rozhodování o práci a volném casu. viz. Fairs (1977) Extramarital Affairs

affairs=read.table("affairs.txt",header = TRUE)
regrese_aff=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
summary(regrese_aff)
## 
## Call:
## lm(formula = Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 + Z8, data = affairs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0503 -1.7226 -0.7947  0.2101 12.7036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.87201    1.13750   5.162 3.34e-07 ***
## Z1           0.05409    0.30049   0.180   0.8572    
## Z2          -0.05098    0.02262  -2.254   0.0246 *  
## Z3           0.16947    0.04122   4.111 4.50e-05 ***
## Z4          -0.14262    0.35020  -0.407   0.6840    
## Z5          -0.47761    0.11173  -4.275 2.23e-05 ***
## Z6          -0.01375    0.06414  -0.214   0.8303    
## Z7           0.10492    0.08888   1.180   0.2383    
## Z8          -0.71188    0.12001  -5.932 5.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.095 on 592 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:   0.12 
## F-statistic: 11.23 on 8 and 592 DF,  p-value: 7.472e-15
bptest(regrese_aff)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_aff
## BP = 56.188, df = 8, p-value = 2.593e-09
residua=regrese_aff$resid
le2=log(residua^2)
regrese_aff2=lm(le2~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
nafit_hodnoty=regrese_aff2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h

regrese_aff3=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8,data = affairs,weights = vahy2)

Nesmíme vsak zapomenout na robustní odhad chyb. FGLS neodstraní heteroskedasticitu, jen zmensí její dopad.

coeftest(regrese_aff3, vcov = vcovHC(regrese_aff3, type = "HC1")) 
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.906047   1.204729  1.5821 0.114152    
## Z1           0.085088   0.230422  0.3693 0.712057    
## Z2           0.045144   0.049417  0.9135 0.361334    
## Z3           0.028811   0.037210  0.7743 0.439064    
## Z4          -0.355230   0.421631 -0.8425 0.399841    
## Z5          -0.289497   0.082068 -3.5275 0.000452 ***
## Z6          -0.025546   0.059936 -0.4262 0.670100    
## Z7          -0.010622   0.057912 -0.1834 0.854540    
## Z8          -0.240043   0.145419 -1.6507 0.099329 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A porovnáme s modelem bez odhadu FGLS.

coeftest(regrese_aff, vcov = vcovHC(regrese3, type = "HC1")) 
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    5.872    135.134  0.0435   0.9654