admin管理员组

文章数量:1129706

I'm learning MLflow in Databricks using the tutorial .html. The tutorial includes using nested MLflow runs for hyperparameter optimization of XGBoost. A parent run is created via

with mlflow.start_run(run_name='xgboost_models'):
    best_params = fmin(
        fn=train_model, 
        space=search_space, 
        algo=tpe.suggest, 
        max_evals=96,
        trials=spark_trials,
    )

which invokes the model training process defined by

def train_model(params):
    mlflow.xgboost.autolog()
    with mlflow.start_run(nested=True):
        train = xgb.DMatrix(data=X_train, label=y_train)
        validation = xgb.DMatrix(data=X_val, label=y_val)
        # Additional training code here

The successful result is that on the Databricks default Experiments page (i.e., MLflow GUI pointing to default location), I see a run called xgboost_models that can be expanded to show a list of child runs where actual ML training was performed. The parent-child grouping as instructed by mlflow.start_run(nested=True) came out nicely.

Trouble comes when I decide that my runs should be logged to an Experiments location that I choose myself, instead of the default location in Databricks. First, I create the new location:

EXPERIMENT_NAME = '/Users/[email protected]/MLflow_experiments/dxxxx_minimal_MLflow'

# Get the experiment ID if it exists, or create a new one
experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

if experiment_id is None:
    # If the experiment does not exist, create it
    experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
else:
    # If the experiment exists, get its ID
    experiment_id = experiment_id.experiment_id

This goes well, in the sense that if I execute a single unnested ML run via with mlflow.start_run(experiment_id=experiment_id, run_name='untuned_random_forest'), the new log for run untuned_random_forest shows up on the dxxxx_minimal_MLflow Experiments page.

It really gets weird when I try this with the hyperopt nested runs. If I modify the outer call to read

with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2'):
    best_params = fmin(
        fn=train_model, 
        space=search_space, 
        algo=tpe.suggest, 
        max_evals=96,
        trials=spark_trials,
    )

and change nothing else, my new parent run xgboost_models_2 shows up on the dxxxx_minimal_MLflow experiment page with no children. And all the child runs show up on back on the default experiment page with no parent -- which is pretty hideous!

Checking on the detail, it may be important to note that the child runs do have a Parent ID tag, and its value seems to be set correctly to point to the ID corresponding to the xgboost_models_2 parent run. This leads me to suspect that the nested argument to mlflow.start_run(nested=True) is doing its job well, and somehow the GUI is simply failing to interpret the parent-child relationship correctly.

Questions:

  1. Anyone got a fix?
  2. Anyone able to clue me in about whether this is a general MLflow problem or just a Databricks problem?

Footnote: I've tried to fix this by shoving additional parameters into the child invocations of mlflow.start_run(), such as experiment_id and parent_run_id, but that seems to make no difference. And that seems very reasonable, because as I noted above, the child runs seem to be correctly tagged with the Parent Run ID in the first place.

I'm learning MLflow in Databricks using the tutorial https://docs.databricks.com/_extras/notebooks/source/mlflow/mlflow-end-to-end-example-uc.html. The tutorial includes using nested MLflow runs for hyperparameter optimization of XGBoost. A parent run is created via

with mlflow.start_run(run_name='xgboost_models'):
    best_params = fmin(
        fn=train_model, 
        space=search_space, 
        algo=tpe.suggest, 
        max_evals=96,
        trials=spark_trials,
    )

which invokes the model training process defined by

def train_model(params):
    mlflow.xgboost.autolog()
    with mlflow.start_run(nested=True):
        train = xgb.DMatrix(data=X_train, label=y_train)
        validation = xgb.DMatrix(data=X_val, label=y_val)
        # Additional training code here

The successful result is that on the Databricks default Experiments page (i.e., MLflow GUI pointing to default location), I see a run called xgboost_models that can be expanded to show a list of child runs where actual ML training was performed. The parent-child grouping as instructed by mlflow.start_run(nested=True) came out nicely.

Trouble comes when I decide that my runs should be logged to an Experiments location that I choose myself, instead of the default location in Databricks. First, I create the new location:

EXPERIMENT_NAME = '/Users/[email protected]/MLflow_experiments/dxxxx_minimal_MLflow'

# Get the experiment ID if it exists, or create a new one
experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

if experiment_id is None:
    # If the experiment does not exist, create it
    experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
else:
    # If the experiment exists, get its ID
    experiment_id = experiment_id.experiment_id

This goes well, in the sense that if I execute a single unnested ML run via with mlflow.start_run(experiment_id=experiment_id, run_name='untuned_random_forest'), the new log for run untuned_random_forest shows up on the dxxxx_minimal_MLflow Experiments page.

It really gets weird when I try this with the hyperopt nested runs. If I modify the outer call to read

with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2'):
    best_params = fmin(
        fn=train_model, 
        space=search_space, 
        algo=tpe.suggest, 
        max_evals=96,
        trials=spark_trials,
    )

and change nothing else, my new parent run xgboost_models_2 shows up on the dxxxx_minimal_MLflow experiment page with no children. And all the child runs show up on back on the default experiment page with no parent -- which is pretty hideous!

Checking on the detail, it may be important to note that the child runs do have a Parent ID tag, and its value seems to be set correctly to point to the ID corresponding to the xgboost_models_2 parent run. This leads me to suspect that the nested argument to mlflow.start_run(nested=True) is doing its job well, and somehow the GUI is simply failing to interpret the parent-child relationship correctly.

Questions:

  1. Anyone got a fix?
  2. Anyone able to clue me in about whether this is a general MLflow problem or just a Databricks problem?

Footnote: I've tried to fix this by shoving additional parameters into the child invocations of mlflow.start_run(), such as experiment_id and parent_run_id, but that seems to make no difference. And that seems very reasonable, because as I noted above, the child runs seem to be correctly tagged with the Parent Run ID in the first place.

Share Improve this question asked Jan 8 at 17:07 David KaufmanDavid Kaufman 1,0691 gold badge11 silver badges21 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

So, a solution.

By logging some extra parameters from the child runs, I determined that my MLflow environment (by whose fault, I can't say) creates the child runs with a different experiment_id parameter value than that of the parent run, seemingly in total defiance of nested=True and in utter disregard for any parameters like experiment_id or parent_run_id that I might pass into the child invocation of mlflow.start_run().

However, we can set experiment_id globally at the point where we initially created/obtained the desired experiment_id in the first place. I mean, the block that sets and uses EXPERIMENT_NAME. Just add the following line to the end of that block:

mlflow.set_experiment(experiment_id=experiment_id)

(But still, the failure of nested=True doesn't seem like a very nice thing.)

本文标签: databricksMLflow nested runs not grouping together in GUIStack Overflow