admin管理员组

文章数量:1406453

I’m performing linear regression in Python with statsmodels. I have two categorical predictors:

  • sample: a factor with 8 levels
  • distractor: a factor with 2 levels

My goal is to determine the “absolute” beta (effect) for each level of each variable. When I fit the model with an intercept using treatment (dummy) coding (the default), statsmodels reports coefficients as differences relative to the reference (baseline) level. For example, consider the following output:

Intercept             5.076e-04
C(sample)[T.1]       -2.333e-18
C(sample)[T.2]       -1.558e-18
C(sample)[T.3]       -7.167e-19
C(sample)[T.4]       -1.402e-18
C(sample)[T.5]        7.694e-04
C(sample)[T.6]        5.478e-19
C(sample)[T.7]        4.516e-03
C(distractor)[T.9]   -1.015e-03

Here, the intercept represents the predicted response when sample is at its reference level (level 0) and distractor is at its reference level (level 8). The coefficient for C(distractor)[T.9] is then the difference from distractor level 8. That means that the “absolute” beta for distractor level 8 is just the intercept, and for distractor level 9 it is (Intercept + -1.015e-03).

My confusion is:

  1. Is it expected that the reference level for both factors is represented solely by the intercept (i.e. that the first level of all variables always has the same beta value)?
  2. How do I extract a full set of betas (i.e. 8 for sample and 2 for distractor) from themodel?

I tried removing the intercept (using - 1 in the formula), but then statsmodels still dropped one dummy variable for distractor due to collinearity (even though distractor clearly has two levels when modeled alone, as shown by fitting response ~ C(distractor) - 1 which returns two coefficients). The two factors are independent.

What is the proper way to obtain “absolute” beta values for all levels? Is it correct to compute them by adding the intercept to the reported contrasts (using zero for the reference level)? If so, is there any cleaner method in statsmodels to directly return a parameter for each level?

Example dummy code:

import pandas as pd
import statsmodels.formula.api as smf

# Create dummy data
data = pd.DataFrame({
    'response': [0.51, 0.52, 0.53, 0.54, 0.60, 0.61, 0.62, 0.63, 0.55, 0.56],
    'sample': ['0', '1', '2', '3', '4', '5', '6', '7', '0', '1'],  # 8 levels (as strings)
    'distractor': ['8', '8', '8', '8', '9', '9', '9', '9', '8', '9']  # 2 levels
})

# Model with intercept (default treatment coding)
model_with_int = smf.ols('response ~ C(sample) + C(distractor)', data=data).fit()
print("Model with intercept:")
print(model_with_int.params)
# Expected output example:
# Intercept             0.000508  (this is the effect at sample=0, distractor=8)
# C(sample)[T.1]       (difference between sample 1 and sample 0)
# ...
# C(distractor)[T.9]   (difference between distractor 9 and distractor 8)

# To get the "absolute" beta for each level:
# For sample:
#   Level 0 beta = Intercept
#   Level 1 beta = Intercept + C(sample)[T.1]
#   ... and so on.
# For distractor:
#   Level 8 beta = Intercept
#   Level 9 beta = Intercept + C(distractor)[T.9]

print("\nAbsolute beta values:")
abs_beta_sample = {}
abs_beta_distractor = {}
intercept = model_with_int.params['Intercept']

# For sample, assume reference level is '0'
abs_beta_sample['0'] = intercept
for lvl in ['1', '2', '3', '4', '5', '6', '7']:
    coef_name = f"C(sample)[T.{lvl}]"
    abs_beta_sample[lvl] = intercept + model_with_int.params.get(coef_name, 0)

# For distractor, assume reference level is '8'
abs_beta_distractor['8'] = intercept
abs_beta_distractor['9'] = intercept + model_with_int.params.get("C(distractor)[T.9]", 0)

print("Sample beta values:", abs_beta_sample)
print("Distractor beta values:", abs_beta_distractor)

I would appreciate any guidance on whether this is the correct approach or if there’s a better way to directly obtain the full set of betas from the model.

本文标签: