python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors? - Stack Overflow

IT技术

更新时间：2025-04-106

admin管理员组
文章数量:1401233

I have a training dataset with six features and I am using SequentialFeatureSelector to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=0.05, direction='forward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

To check the results, I decided to run the same code using the subset of features X1, X2, X3 instead of X_train. I was expecting to see the features X1, X2, X3 returned again, but instead it was only the features X1, X2. Similarly, using these two features again in the same code returned only X1. It seems that the behavior of sfs is always to return a proper subset of the input features with at most n_features_in_ - 1 columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing sfs to return the full set of features?

I also checked to see if using backward selection would return a full feature set.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=1000, direction='backward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

I set the threshold tol to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train. But, instead of returning the six original features, it only returned five. The docs simply state

If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.

So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?

I have a training dataset with six features and I am using SequentialFeatureSelector to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=0.05, direction='forward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

To check the results, I decided to run the same code using the subset of features X1, X2, X3 instead of X_train. I was expecting to see the features X1, X2, X3 returned again, but instead it was only the features X1, X2. Similarly, using these two features again in the same code returned only X1. It seems that the behavior of sfs is always to return a proper subset of the input features with at most n_features_in_ - 1 columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing sfs to return the full set of features?

I also checked to see if using backward selection would return a full feature set.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=1000, direction='backward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

I set the threshold tol to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train. But, instead of returning the six original features, it only returned five. The docs simply state

If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.

So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?

Share Improve this question edited Mar 24 at 8:16 Sandipan Dey 23.3k4 gold badges57 silver badges71 bronze badges asked Mar 23 at 12:16 CodingLikeAFox 231 silver badge5 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

Check the source code, lines 240-46 inside the method fit():

if self.n_features_to_select == "auto":
    if self.tol is not None:
        # With auto feature selection, `n_features_to_select_` will be updated
        # to `support_.sum()` after features are selected.
        self.n_features_to_select_ = n_features - 1
    else:
        self.n_features_to_select_ = n_features // 2

As can be seen, even with auto selection mode and a given tol, maximum numbers of features that can be added is bounded by n_features - 1 for some reason (may be we can report this issue in github).

We can override the implementation in the following way, by defining a function get_best_new_feature_score() (similar to the method _get_best_new_feature_score() from the source code), as shown below:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score

def get_best_new_feature_score(estimator, X, y, cv, current_mask, direction, scoring):
    candidate_feature_indices = np.flatnonzero(~current_mask)
    scores = {}
    for feature_idx in candidate_feature_indices:
        candidate_mask = current_mask.copy()
        candidate_mask[feature_idx] = True
        if direction == "backward":
            candidate_mask = ~candidate_mask
        X_new = X[:, candidate_mask]
        scores[feature_idx] = cross_val_score(
            estimator,
            X_new,
            y,
            cv=cv,
            scoring=scoring
        ).mean()
    new_feature_idx = max(scores, key=lambda feature_idx: scores[feature_idx])
    return new_feature_idx, scores[new_feature_idx]

Now, let's implement the auto (forward) selection, using a regression dataset with 5 features, let' add all the features one-by-one, reporting the improvement in score and stopping by comparing with provided tol:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

X, y = make_regression(n_features=5) # data to be used
X.shape 
# (100, 5)
lm = LinearRegression() # model to be used

# now implement 'auto' feature selection (forward selection)   
cur_mask = np.zeros(X.shape[1]).astype(bool) # no feature selected initially
cv, direction, scoring = 8, 'forward', 'neg_root_mean_squared_error'
tol = 1 # if score improvement > tol, feature will be added in forward selection
old_score = -np.inf
ids, scores = [], []
for i in range(X.shape[1]):
    idx, new_score = get_best_new_feature_score(lm, X, y, current_mask=cur_mask, cv=cv, direction=direction, scoring=scoring)
    print(new_score - old_score, tol, score - old_score > tol)
    if (new_score - old_score) > tol:
        cur_mask[idx] = True
        ids.append(idx)
        scores.append(new_score)
        old_score = new_score
        print(f'feature {idx} added, CV score {score}, mask {cur_mask}')

# feature 3 added, CV score -90.66899644023539, mask [False False False  True False]
# feature 1 added, CV score -59.21188041830155, mask [False  True False  True False]
# feature 2 added, CV score -16.709218665372905, mask [False  True  True  True False]
# feature 4 added, CV score -3.1862116620446166, mask [False  True  True  True  True]
# feature 0 added, CV score -1.4011801838814216e-13, mask [ True  True  True  True  True]

If tol=10, set to 10 instead, then only 4 features will be added in forward-selection. Similarly, if tol=20, then only 3 features will be added in forward-selection, as expected.

本文标签： pythonWhy does SequentialFeatureSelector return at most quotnfeaturesin1quot predictorsStack Overflow

版权声明：本文标题：python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744287156a2598934.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors? - Stack Overflow

1 Answer 1

更多相关文章