admin管理员组

文章数量:1122832

I'm training ml models (Xgboost and LightGbm) using snowpark, but after every run I got different values of the metrics (AUC, Average precision) and thus never know who is my best model.

I tried setting a global variable at the beggining of my notebook random_seed = 42 and put it in my undersampling function and in the initialization of my models :

 if model_type == 'xgboost':
        model = XGBClassifier(
            random_state=random_seed,
            input_cols=feature_cols,
            label_cols=target_col,
            output_cols=['PREDICTION'],
            passthrough_cols=['INDIVIDUAL_SK', 'DATE_MONTH'],
            **hyperparameters
        )

    elif model_type == 'lightgbm':
        model = LGBMClassifier(
            random_state=random_seed,
            input_cols=feature_cols,
            label_cols=target_col,
            output_cols=['PREDICTION'],
            passthrough_cols=['INDIVIDUAL_SK', 'DATE_MONTH'],
            **hyperparameters
         
        )

def undersample_majority_class(df):

df_with_seniority = df.with_column("years_since", (F.col('TIME_SINCE_FIRST_LEAD')/12).cast('int'))

df_with_random = df_with_seniority.with_column('random_order', F.random(seed=random_seed))
window_spec = Window.partition_by("INDIVIDUAL_SK").order_by(F.col('random_order').asc())
df_ranked = df_with_random.with_column("month_rank", F.row_number().over(window_spec)
)

df_majority = df_ranked.filter(F.col("CONVERSION_INDICATOR") == 0)
df_majority_sampled = df_majority.filter(((F.col("years_since") > 10) & (F.col("month_rank") == 1)) |
((F.col("years_since") <= 10) & (F.col("month_rank") <= 2))
)

df_majority_sampled = df_majority_sampled.drop('years_since','month_rank','random_order' )
df_minority = df.filter(F.col("CONVERSION_INDICATOR") == 1)
df_balanced = df_majority_sampled.union_all(df_minority)



return df_balanced

I don't know what to do to fix this.

本文标签: machine learningWhy Do I get Different performance on Different Runs on my ML modelStack Overflow