Heteroskedasticita

\(Var(\epsilon|X)=\sigma^2 I\) homoskedasticita

\(Var(\epsilon|X)=\sigma^2 \Omega\) heteroskedasticita

\(Var(\epsilon|X)=\sigma^2 h(X)\)

Pro jednodusí predstavu uvazujme prímkovou regresi

\(Var(\epsilon|x)=\sigma^2 h(x)\) kde \(h(x)\) predstavuje funkci “typ” heteroskedasticity. Rozptyl musí být vzdy nezáporný!

Otázkou je jak vypadá funkce \(h(x)\)

Model heteroskedasticita

\(wage=\beta_0+\beta_1 IQ+\beta_2 expert+\beta_3 educ+\epsilon\)

set.seed(1000)
N=10000
IQ=rnorm(N,100,5)
expert=as.numeric(runif(N, min=0, max=30))
educ = as.numeric(runif(N, min=9, max=20))  
sigma=2*(sqrt(educ))^0.5             # tvar podmíneného rozptylu, rozptyl závisí na jednom z regresoru 
e4= rnorm(N, 0, sigma)  
wage = -50 +10*IQ+ 50*educ+3*expert + e4       
populace = data.frame(IQ,expert, educ, wage)  

Jen pro zajímavost si odhadneme model.

regrese_hetero=lm(wage~IQ+educ+expert,data = populace)   # provedeme odhad regresní rovnice, nyní nedeláme výber
summary(regrese_hetero)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, data = populace)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6544  -2.6213  -0.0286   2.5754  14.8072 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -50.044573   0.804438  -62.21   <2e-16 ***
## IQ           10.001225   0.007817 1279.47   <2e-16 ***
## educ         49.996864   0.012222 4090.56   <2e-16 ***
## expert        3.004730   0.004496  668.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.888 on 9996 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 6.304e+06 on 3 and 9996 DF,  p-value: < 2.2e-16

Sestavíme si vlastní BP test.

residua=regrese_hetero$residuals
BP_test=lm(residua^2~IQ+educ+expert)  
summary(BP_test)
## 
## Call:
## lm(formula = residua^2 ~ IQ + educ + expert)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.804 -13.065  -8.149   4.837 259.727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.21890    4.49529   0.716    0.474    
## IQ           0.04931    0.04368   1.129    0.259    
## educ         0.51657    0.06830   7.563 4.28e-14 ***
## expert      -0.03325    0.02512  -1.324    0.186    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.73 on 9996 degrees of freedom
## Multiple R-squared:  0.005977,   Adjusted R-squared:  0.005678 
## F-statistic: 20.03 on 3 and 9996 DF,  p-value: 6.097e-13

Ale nyní si ukázeme prímo funkci pro BP test

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(regrese_hetero)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_hetero
## BP = 59.767, df = 3, p-value = 6.591e-13

White test pro heteroskedasticitu

residua=regrese_hetero$residuals     # ulozíme si residua z regrese1 
residua_kvadrat=residua^2
y_hat=regrese_hetero$fitted.values
y_hat2=(regrese_hetero$fitted.values)^2

white=lm(residua_kvadrat~y_hat+y_hat2)
summary(white)
## 
## Call:
## lm(formula = residua_kvadrat ~ y_hat + y_hat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.511 -13.123  -8.192   4.824 260.727 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.246e+00  2.220e+01  -0.191    0.848
## y_hat        1.325e-02  2.598e-02   0.510    0.610
## y_hat2      -1.144e-06  7.548e-06  -0.152    0.880
## 
## Residual standard error: 21.74 on 9997 degrees of freedom
## Multiple R-squared:  0.005237,   Adjusted R-squared:  0.005038 
## F-statistic: 26.32 on 2 and 9997 DF,  p-value: 3.987e-12

Co tedy delat, kdyz zde máme heteroskedasticitu? Vyuzijeme robustní odhad chyb

install.packages("sandwich")
library(sandwich)  # https://cran.r-project.org/web/packages/sandwich/sandwich.pdf

Sestavíme si vlastní t-test robustní vuci heteroskedasticite

kov_matice=vcovHC(regrese_hetero)   # kovarianèní matice "béèek" Var(b)
rozptyl=diag(kov_matice)           # nyní nás zajímají jen rozptyly, tedy diagonála
odhad_bet=regrese_hetero$coefficients     # z regrese vybereme odhady parametrù
t_test=odhad_bet/sqrt(rozptyl)          # sestrojíme t-test, s robustními odhady 
t_test
## (Intercept)          IQ        educ      expert 
##   -62.68211  1286.54002  4077.35134   667.21638

Výsledky porovnáme s funkcí prímo pro robustní odhad

coeftest(regrese_hetero, vcovHC)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) -50.0445729   0.7983868  -62.682 < 2.2e-16 ***
## IQ           10.0012247   0.0077737 1286.540 < 2.2e-16 ***
## educ         49.9968640   0.0122621 4077.351 < 2.2e-16 ***
## expert        3.0047296   0.0045034  667.216 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pokud budete chtít pouzít jiný odhad matice \(B\) z prednásky, muzete:

coeftest(regrese_hetero, vcov = vcovHC(regrese_hetero, type = "HC1")) 
## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) -50.0445729   0.7980810  -62.706 < 2.2e-16 ***
## IQ           10.0012247   0.0077707 1287.044 < 2.2e-16 ***
## educ         49.9968640   0.0122586 4078.500 < 2.2e-16 ***
## expert        3.0047296   0.0045021  667.401 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robustní verze F-testu

\(H_0: \beta_1=\beta_2=\beta_3=0\)

\(H_1: non H_0\)

library(car)
linearHypothesis(regrese_hetero,c("IQ=0","educ=0","expert=0"),test="F",white.adjust="hc0")
## Linear hypothesis test
## 
## Hypothesis:
## IQ = 0
## educ = 0
## expert = 0
## 
## Model 1: restricted model
## Model 2: wage ~ IQ + educ + expert
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df       F    Pr(>F)    
## 1   9999                         
## 2   9996  3 6310696 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLS

Nyní si nasimulujeme príklad s výraznou heteroskedasticitou

set.seed(1000)
N=10000
IQ=rnorm(N,100,5)
expert=as.numeric(runif(N, min=0, max=30))
educ = as.numeric(runif(N, min=9, max=20))  
sigma=2*(exp(educ))^0.5             # tvar podmíneného rozptylu, rozptyl závisí na jednom z regresoru 
e4= rnorm(N, 0, sigma)  
wage = -50 +10*IQ+ 50*educ+3*expert + e4       
populace = data.frame(IQ,expert, educ, wage)  

Zásadním predpokladem GSL odhadu je \(\bf{znalostost}\) tvaru heteroskedasticity!

\(Var(\epsilon|X)=4\times e^{educ}\)

\(h(X)=e^{educ}\)

\(\dfrac{wage}{\sqrt{e^{educ}}}=\dfrac{\beta_0}{\sqrt{e^{educ}}}+\beta_1 \dfrac{IQ}{\sqrt{e^{educ}}}+\beta_2 \dfrac{expert}{\sqrt{e^{educ}}}+\beta_3 \dfrac{educ}{\sqrt{e^{educ}}}+\dfrac{\epsilon}{\sqrt{e^{educ}}}\)

\(Var(\dfrac{\epsilon}{\sqrt{e^{educ}}})=\dfrac{1}{e^{educ}} \times Var(\epsilon) =\dfrac{1}{e^{educ}} \times 4\times e^{educ}=4\)

attach(populace)
## The following objects are masked _by_ .GlobalEnv:
## 
##     educ, expert, IQ, wage
wage_W=wage/sqrt(exp(educ))
konstanta=cbind(rep(1,length(educ)))/sqrt(exp(educ))
IQ_W=IQ/sqrt(exp(educ))
expert_W=expert/sqrt(exp(educ))
educ_W=educ/sqrt(exp(educ))

Vsimnete si, co nám zpusobila silná heterokedasticita.

regrese_hetero2=lm(wage_W~0+konstanta+IQ_W+educ_W+expert_W,data = populace)   
summary(regrese_hetero2)
## 
## Call:
## lm(formula = wage_W ~ 0 + konstanta + IQ_W + educ_W + expert_W, 
##     data = populace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9503 -1.3372 -0.0001  1.3519  8.5191 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## konstanta -163.7020   134.6521  -1.216    0.224    
## IQ_W        10.0896     1.1994   8.412  < 2e-16 ***
## educ_W      59.7312     5.9634  10.016  < 2e-16 ***
## expert_W     4.1401     0.6918   5.985 2.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.001 on 9996 degrees of freedom
## Multiple R-squared:  0.8672, Adjusted R-squared:  0.8672 
## F-statistic: 1.632e+04 on 4 and 9996 DF,  p-value: < 2.2e-16
bptest(regrese_hetero2)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_hetero2
## BP = 4.8225, df = 3, p-value = 0.1853

GLS vs. FGLS

vaha=1/exp(educ)            # Pozor na zadání váh!!

regrese_gls=lm(wage~IQ+educ+expert,weights = vaha)
summary(regrese_gls)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vaha)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9503 -1.3372 -0.0001  1.3519  8.5191 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -163.7020   134.6521  -1.216    0.224    
## IQ            10.0896     1.1994   8.412  < 2e-16 ***
## educ          59.7312     5.9634  10.016  < 2e-16 ***
## expert         4.1401     0.6918   5.985 2.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.001 on 9996 degrees of freedom
## Multiple R-squared:  0.01995,    Adjusted R-squared:  0.01966 
## F-statistic: 67.84 on 3 and 9996 DF,  p-value: < 2.2e-16

Co delat, kdyz nevíme jak vypadá h(X)?

\(\textbf{Castecným}\) resením je FGLS. V rámci FGLS se snazíme odhadnout tvar heteroskedasticity, tedy tvar funkce \(h(X)\). Mozností je nekonecno, proto muzeme pouzít napríklad predpoklad podle White:

\(Var(\epsilon|X)=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)\)

\(\epsilon^2=\sigma^2 exp(\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ)u\)

\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)

  1. Odhadneme regresní model a získáme residua
  2. Umocníme je a prevedeme na logaritmus.
  3. Provedeme regresi a získáme vyrovnané hodnoty \(\hat{\log{(\epsilon^2)}}\)
  4. Provedeme úpravu, kdy \(\hat{h}=exp(\hat{(\log{\epsilon^2)}})\) a získáme tak váhy.
regrese1=lm(wage~IQ+educ+expert)
residua=regrese1$resid
le2=log(residua^2)

\(\log{(\epsilon^2)}=\alpha_0+\gamma_0+\gamma_1 IQ+\gamma_2 expert+\gamma_3 educ+u\)

regrese2=lm(le2~IQ+educ+expert)
nafit_hodnoty=regrese2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h

regrese3=lm(wage~IQ+educ+expert,weights = vahy2)
summary(regrese3)
## 
## Call:
## lm(formula = wage ~ IQ + educ + expert, weights = vahy2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7423 -1.2626 -0.0009  1.2784  7.8133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -169.496    136.844  -1.239    0.216    
## IQ            10.136      1.223   8.291  < 2e-16 ***
## educ          59.915      6.025   9.944  < 2e-16 ***
## expert         4.092      0.705   5.805 6.64e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.892 on 9996 degrees of freedom
## Multiple R-squared:  0.0194, Adjusted R-squared:  0.01911 
## F-statistic: 65.92 on 3 and 9996 DF,  p-value: < 2.2e-16
bptest(regrese3)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese3
## BP = 1233.9, df = 3, p-value < 2.2e-16

Metoda FGLS vám neodstraní heteroskedasticitu! Neznáte její tvar, pouze vám pomuze získat asymptoticky vydatnejsí odhady parametru. Musíte proto vzdy pouzít i robustní odhad chyb!

library(sandwich)
coeftest(regrese3,vcovHC)
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -169.49594  135.25464 -1.2532    0.2102    
## IQ            10.13640    1.22567  8.2701 < 2.2e-16 ***
## educ          59.91537    6.01045  9.9685 < 2.2e-16 ***
## expert         4.09234    0.69982  5.8477  5.14e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Príklad

Nejsem uchalák, toto je príklad jak z clánku, tak z knihy od Greena! V clanku autor zkoumal rozhodování o práci a volném casu. viz. Fairs (1977) Extramarital Affairs

affairs=read.table("affairs.txt",header = TRUE)
regrese_aff=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
summary(regrese_aff)
## 
## Call:
## lm(formula = Y ~ Z1 + Z2 + Z3 + Z4 + Z5 + Z6 + Z7 + Z8, data = affairs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0503 -1.7226 -0.7947  0.2101 12.7036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.87201    1.13750   5.162 3.34e-07 ***
## Z1           0.05409    0.30049   0.180   0.8572    
## Z2          -0.05098    0.02262  -2.254   0.0246 *  
## Z3           0.16947    0.04122   4.111 4.50e-05 ***
## Z4          -0.14262    0.35020  -0.407   0.6840    
## Z5          -0.47761    0.11173  -4.275 2.23e-05 ***
## Z6          -0.01375    0.06414  -0.214   0.8303    
## Z7           0.10492    0.08888   1.180   0.2383    
## Z8          -0.71188    0.12001  -5.932 5.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.095 on 592 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:   0.12 
## F-statistic: 11.23 on 8 and 592 DF,  p-value: 7.472e-15
bptest(regrese_aff)
## 
##  studentized Breusch-Pagan test
## 
## data:  regrese_aff
## BP = 56.188, df = 8, p-value = 2.593e-09
residua=regrese_aff$resid
le2=log(residua^2)
regrese_aff2=lm(le2~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8, data = affairs)
nafit_hodnoty=regrese_aff2$fitted.values
h=exp(nafit_hodnoty)
vahy2=1/h

regrese_aff3=lm(Y~Z1+Z2+Z3+Z4+Z5+Z6+Z7+Z8,data = affairs,weights = vahy2)

Nesmíme vsak zapomenout na robustní odhad chyb. FGLS neodstraní heteroskedasticitu, jen zmensí její dopad.

coeftest(regrese_aff3, vcov = vcovHC(regrese_aff3, type = "HC1")) 
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.906047   1.204729  1.5821 0.114152    
## Z1           0.085088   0.230422  0.3693 0.712057    
## Z2           0.045144   0.049417  0.9135 0.361334    
## Z3           0.028811   0.037210  0.7743 0.439064    
## Z4          -0.355230   0.421631 -0.8425 0.399841    
## Z5          -0.289497   0.082068 -3.5275 0.000452 ***
## Z6          -0.025546   0.059936 -0.4262 0.670100    
## Z7          -0.010622   0.057912 -0.1834 0.854540    
## Z8          -0.240043   0.145419 -1.6507 0.099329 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A porovnáme s modelem bez odhadu FGLS.

coeftest(regrese_aff, vcov = vcovHC(regrese3, type = "HC1")) 
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    5.872    134.903  0.0435   0.9653