admin管理员组

文章数量:1391937

I've recently encountered a problem with predict() function for plm regression that uses AR(1) term in form of lag().

Here is an toy example: 3 countries, 3 periods, current population as a function a lagged population, the models are estimated with pooled OLS.

library(plm)

data <- data.frame(country = rep(c("A", "B", "C"), each = 3),
                  year = rep(c(1, 2, 3), 3),
                  population = c(100, 150, 200, 1000, 1200, 1250, 10, 20, 45)) |>
        pdata.frame(index = c('country', 'year'))

data$lag_population <- plm::lag(data$population, k = 1)

lm_model <- lm(population ~ lag_population, data)
plm_model <- plm(population ~ plm::lag(population, k = 1), data, model = 'pooling')
plm_model2 <- plm(population ~ lag_population, data, model = 'pooling')

All three models are identical:

> print(lm_model)

Call:
lm(formula = population ~ lag_population, data = data)

Coefficients:
   (Intercept)  lag_population  
        31.633           1.079  

+ print(plm_model)

Model Formula: population ~ plm::lag(population, k = 1)

Coefficients:
                (Intercept) plm::lag(population, k = 1) 
                    31.6332                      1.0787 

+ print(plm_model2)

Model Formula: population ~ lag_population

Coefficients:
   (Intercept) lag_population 
       31.6332         1.0787 


Yet, their forecasts for the same data are different:

  • lm() based forecast is exactly what is expected, with NA in first periods for each cross-section (b/c first periods don't have lagged observations)
  • plm() forecast that uses generated variable lag_population instead of the lag() function, is similar to that from lm(), yet, NAs are dropped
  • plm() forecast that uses lag() function is the most problematic: values are shifted 1 period back, plus some values are generated for last periods (the model is pooled, without fixed effects, so how are the forecasts generated for 1st periods, especially that fill.na = FALSE is used?)
predict_plm <- predict(plm_model, newdata = data, na.fill = FALSE)
predict_plm2 <- predict(plm_model2, newdata = data, na.fill = FALSE)
predict_lm <- predict(lm_model, newdata = data)

> print(predict_lm) 
       A-1        A-2        A-3        B-1        B-2        B-3        C-1        C-2        C-3 
        NA  139.50424  193.43973         NA 1110.34313 1326.08511         NA   42.42035   53.20745 
+ print(predict_plm2)
       A-2        A-3        B-2        B-3        C-2        C-3 
 139.50424  193.43973 1110.34313 1326.08511   42.42035   53.20745 
+ print(predict_plm)
       A-1        A-2        A-3        B-1        B-2        B-3        C-1        C-2        C-3 
 139.50424  193.43973  247.37522 1110.34313 1326.08511 1380.02060   42.42035   53.20745   80.17519 

Has anyone encountered a similar issue? How can I make the plm predictions look the same as those based on lm()?

I've recently encountered a problem with predict() function for plm regression that uses AR(1) term in form of lag().

Here is an toy example: 3 countries, 3 periods, current population as a function a lagged population, the models are estimated with pooled OLS.

library(plm)

data <- data.frame(country = rep(c("A", "B", "C"), each = 3),
                  year = rep(c(1, 2, 3), 3),
                  population = c(100, 150, 200, 1000, 1200, 1250, 10, 20, 45)) |>
        pdata.frame(index = c('country', 'year'))

data$lag_population <- plm::lag(data$population, k = 1)

lm_model <- lm(population ~ lag_population, data)
plm_model <- plm(population ~ plm::lag(population, k = 1), data, model = 'pooling')
plm_model2 <- plm(population ~ lag_population, data, model = 'pooling')

All three models are identical:

> print(lm_model)

Call:
lm(formula = population ~ lag_population, data = data)

Coefficients:
   (Intercept)  lag_population  
        31.633           1.079  

+ print(plm_model)

Model Formula: population ~ plm::lag(population, k = 1)

Coefficients:
                (Intercept) plm::lag(population, k = 1) 
                    31.6332                      1.0787 

+ print(plm_model2)

Model Formula: population ~ lag_population

Coefficients:
   (Intercept) lag_population 
       31.6332         1.0787 


Yet, their forecasts for the same data are different:

  • lm() based forecast is exactly what is expected, with NA in first periods for each cross-section (b/c first periods don't have lagged observations)
  • plm() forecast that uses generated variable lag_population instead of the lag() function, is similar to that from lm(), yet, NAs are dropped
  • plm() forecast that uses lag() function is the most problematic: values are shifted 1 period back, plus some values are generated for last periods (the model is pooled, without fixed effects, so how are the forecasts generated for 1st periods, especially that fill.na = FALSE is used?)
predict_plm <- predict(plm_model, newdata = data, na.fill = FALSE)
predict_plm2 <- predict(plm_model2, newdata = data, na.fill = FALSE)
predict_lm <- predict(lm_model, newdata = data)

> print(predict_lm) 
       A-1        A-2        A-3        B-1        B-2        B-3        C-1        C-2        C-3 
        NA  139.50424  193.43973         NA 1110.34313 1326.08511         NA   42.42035   53.20745 
+ print(predict_plm2)
       A-2        A-3        B-2        B-3        C-2        C-3 
 139.50424  193.43973 1110.34313 1326.08511   42.42035   53.20745 
+ print(predict_plm)
       A-1        A-2        A-3        B-1        B-2        B-3        C-1        C-2        C-3 
 139.50424  193.43973  247.37522 1110.34313 1326.08511 1380.02060   42.42035   53.20745   80.17519 

Has anyone encountered a similar issue? How can I make the plm predictions look the same as those based on lm()?

Share Improve this question asked Mar 13 at 22:22 user19957user19957 452 bronze badges 3
  • You could try fixest::feols(population ~ lag_population, data) |> predict(newdata=data) or fixest::feols(population ~ l(population), data, panel.id=~country + year) |> predict(newdata=data), respectively. – jay.sf Commented Mar 14 at 3:43
  • This is a helpful suggestion, thanks! I didn't realize that fixest can estimate pooling regression if one omits fixef argument. The issue with fixest is that, for my actual task I need to use combinations of lag() and diff() but, for some reason, fixest cannot handle formulas with l(d(x,1),1)... – user19957 Commented Mar 14 at 13:13
  • This is fixed already in the development version of plm, see github/ycroissant/plm – Helix123 Commented Mar 15 at 10:18
Add a comment  | 

1 Answer 1

Reset to default 0

EDIT: I see, this is fixed in the newest plm CRAN release 2.6-6

This is fixed already in the development version of plm, see https://github/ycroissant/plm/

Install the development version by, e.g.:

# install.packages("remotes") # remove '#' if pkg 'remotes' is not installed
remotes::install_github("ycroissant/plm")

For your example, it yields:

predict(plm_model, newdata = data)
##       A-2        A-3        B-2        B-3        C-2        C-3 
##  139.50424  193.43973 1110.34313 1326.08511   42.42035   53.20745 

Regards the case with NAs: currently, NA-padding (via na.action) is not supported. So, plm's predict corresponds to lm's predict with these arguments:

predict(lm_model, newdata = data, na.action = na.omit)
##       A-2        A-3        B-2        B-3        C-2        C-3 
## 139.50424  193.43973 1110.34313 1326.08511   42.42035   53.20745 

本文标签: rIssue with predictplm() when using lag() inside the regressionStack Overflow