admin管理员组

文章数量:1404924

I've trained a ResNet50 on v2-8 TPU accelerator using google colab, I've fed it with 5000 images of shape (224, 224, 3). I've normalized them, no nans, no inf, no class imbalance and everything is alright:

INPUT_SHAPE = (224, 224, 3)

with strategy.scope():
    base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
    base_model.trainable = False
    model = tf.keras.models.Sequential([
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(1024, activation='relu'),
        tf.keras.layers.Dense(6, activation='sigmoid') 
    ])
    modelpile(optimizer='adam', 
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
model.fit(
    X_train, 
    y_train, 
    epochs=10, 
    validation_data=(X_val, y_val),
    batch_size=32
  )

When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.

Why is this happening and how to fix it?

I tried training the model on both TPU and CPU in Google Colab. I expected the model to train without any issues, including NaN values for loss or accuracy, especially since the training works fine on the CPU. However, when using the TPU, I encountered NaN values for both accuracy and loss. I also verified that the data is clean, with no NaN, infinity, or imbalance issues, and ensured that the model compilation and training setup were correct.

I've trained a ResNet50 on v2-8 TPU accelerator using google colab, I've fed it with 5000 images of shape (224, 224, 3). I've normalized them, no nans, no inf, no class imbalance and everything is alright:

INPUT_SHAPE = (224, 224, 3)

with strategy.scope():
    base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
    base_model.trainable = False
    model = tf.keras.models.Sequential([
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(1024, activation='relu'),
        tf.keras.layers.Dense(6, activation='sigmoid') 
    ])
    modelpile(optimizer='adam', 
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
model.fit(
    X_train, 
    y_train, 
    epochs=10, 
    validation_data=(X_val, y_val),
    batch_size=32
  )

When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.

Why is this happening and how to fix it?

I tried training the model on both TPU and CPU in Google Colab. I expected the model to train without any issues, including NaN values for loss or accuracy, especially since the training works fine on the CPU. However, when using the TPU, I encountered NaN values for both accuracy and loss. I also verified that the data is clean, with no NaN, infinity, or imbalance issues, and ensured that the model compilation and training setup were correct.

Share Improve this question edited Mar 9 at 11:47 Christoph Rackwitz 15.9k5 gold badges39 silver badges51 bronze badges asked Mar 9 at 7:07 احمد القيسياحمد القيسي 1 2
  • How did you initialize the TPU cluster and choose the distribution strategy ? did you try different batch sizes ? How do you load the data ? Maybe add the parameter steps_per_execution to modelpile(). – rehaqds Commented Mar 9 at 17:33
  • Anoher possibility to get nan loss, try using single gpu and multi-gpu. It could be bug from library side, then you might get nan loss with multi-gpu but would work properly on single gpu. If so, file the issue on github keras. – Innat Commented Mar 9 at 18:19
Add a comment  | 

1 Answer 1

Reset to default 0

You can refer to the following URL

https://github/tensorflow/tensorflow/issues/86953#event-16275455512

This seems to be a problem with keras

I used this method to solve it before

You can try it

But in recent days, colab TPU seems to have problems and I can't connect to TPU

本文标签: