admin管理员组文章数量:1401233
I have a training dataset with six features and I am using SequentialFeatureSelector
to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3
.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=0.05, direction='forward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
To check the results, I decided to run the same code using the subset of features X1, X2, X3
instead of X_train
. I was expecting to see the features X1, X2, X3
returned again, but instead it was only the features X1, X2
. Similarly, using these two features again in the same code returned only X1
. It seems that the behavior of sfs
is always to return a proper subset of the input features with at most n_features_in_ - 1
columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing
sfs
to return the full set of features?
I also checked to see if using backward selection would return a full feature set.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=1000, direction='backward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
I set the threshold tol
to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train
. But, instead of returning the six original features, it only returned five. The docs simply state
If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.
So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs
is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs
to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?
I have a training dataset with six features and I am using SequentialFeatureSelector
to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3
.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=0.05, direction='forward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
To check the results, I decided to run the same code using the subset of features X1, X2, X3
instead of X_train
. I was expecting to see the features X1, X2, X3
returned again, but instead it was only the features X1, X2
. Similarly, using these two features again in the same code returned only X1
. It seems that the behavior of sfs
is always to return a proper subset of the input features with at most n_features_in_ - 1
columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing
sfs
to return the full set of features?
I also checked to see if using backward selection would return a full feature set.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=1000, direction='backward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
I set the threshold tol
to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train
. But, instead of returning the six original features, it only returned five. The docs simply state
If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.
So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs
is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs
to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?
1 Answer
Reset to default 3Check the source code, lines 240-46 inside the method fit()
:
if self.n_features_to_select == "auto":
if self.tol is not None:
# With auto feature selection, `n_features_to_select_` will be updated
# to `support_.sum()` after features are selected.
self.n_features_to_select_ = n_features - 1
else:
self.n_features_to_select_ = n_features // 2
As can be seen, even with auto
selection mode and a given tol
, maximum numbers of features that can be added is bounded by n_features - 1
for some reason (may be we can report this issue in github).
We can override the implementation in the following way, by defining a function get_best_new_feature_score()
(similar to the method _get_best_new_feature_score()
from the source code), as shown below:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score
def get_best_new_feature_score(estimator, X, y, cv, current_mask, direction, scoring):
candidate_feature_indices = np.flatnonzero(~current_mask)
scores = {}
for feature_idx in candidate_feature_indices:
candidate_mask = current_mask.copy()
candidate_mask[feature_idx] = True
if direction == "backward":
candidate_mask = ~candidate_mask
X_new = X[:, candidate_mask]
scores[feature_idx] = cross_val_score(
estimator,
X_new,
y,
cv=cv,
scoring=scoring
).mean()
new_feature_idx = max(scores, key=lambda feature_idx: scores[feature_idx])
return new_feature_idx, scores[new_feature_idx]
Now, let's implement the auto
(forward) selection, using a regression dataset with 5 features, let' add all the features one-by-one, reporting the improvement in score and stopping by comparing with provided tol
:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
X, y = make_regression(n_features=5) # data to be used
X.shape
# (100, 5)
lm = LinearRegression() # model to be used
# now implement 'auto' feature selection (forward selection)
cur_mask = np.zeros(X.shape[1]).astype(bool) # no feature selected initially
cv, direction, scoring = 8, 'forward', 'neg_root_mean_squared_error'
tol = 1 # if score improvement > tol, feature will be added in forward selection
old_score = -np.inf
ids, scores = [], []
for i in range(X.shape[1]):
idx, new_score = get_best_new_feature_score(lm, X, y, current_mask=cur_mask, cv=cv, direction=direction, scoring=scoring)
print(new_score - old_score, tol, score - old_score > tol)
if (new_score - old_score) > tol:
cur_mask[idx] = True
ids.append(idx)
scores.append(new_score)
old_score = new_score
print(f'feature {idx} added, CV score {score}, mask {cur_mask}')
# feature 3 added, CV score -90.66899644023539, mask [False False False True False]
# feature 1 added, CV score -59.21188041830155, mask [False True False True False]
# feature 2 added, CV score -16.709218665372905, mask [False True True True False]
# feature 4 added, CV score -3.1862116620446166, mask [False True True True True]
# feature 0 added, CV score -1.4011801838814216e-13, mask [ True True True True True]
If tol=10
, set to 10 instead, then only 4 features will be added in forward-selection. Similarly, if tol=20
, then only 3 features will be added in forward-selection, as expected.
版权声明:本文标题:python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744287156a2598934.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论