admin管理员组文章数量:1404924
I've trained a ResNet50 on v2-8 TPU accelerator using google colab, I've fed it with 5000 images of shape (224, 224, 3). I've normalized them, no nans, no inf, no class imbalance and everything is alright:
INPUT_SHAPE = (224, 224, 3)
with strategy.scope():
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
base_model.trainable = False
model = tf.keras.models.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(1024, activation='relu'),
tf.keras.layers.Dense(6, activation='sigmoid')
])
modelpile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(
X_train,
y_train,
epochs=10,
validation_data=(X_val, y_val),
batch_size=32
)
When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.
Why is this happening and how to fix it?
I tried training the model on both TPU and CPU in Google Colab. I expected the model to train without any issues, including NaN values for loss or accuracy, especially since the training works fine on the CPU. However, when using the TPU, I encountered NaN values for both accuracy and loss. I also verified that the data is clean, with no NaN, infinity, or imbalance issues, and ensured that the model compilation and training setup were correct.
I've trained a ResNet50 on v2-8 TPU accelerator using google colab, I've fed it with 5000 images of shape (224, 224, 3). I've normalized them, no nans, no inf, no class imbalance and everything is alright:
INPUT_SHAPE = (224, 224, 3)
with strategy.scope():
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
base_model.trainable = False
model = tf.keras.models.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(1024, activation='relu'),
tf.keras.layers.Dense(6, activation='sigmoid')
])
modelpile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(
X_train,
y_train,
epochs=10,
validation_data=(X_val, y_val),
batch_size=32
)
When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.
Why is this happening and how to fix it?
I tried training the model on both TPU and CPU in Google Colab. I expected the model to train without any issues, including NaN values for loss or accuracy, especially since the training works fine on the CPU. However, when using the TPU, I encountered NaN values for both accuracy and loss. I also verified that the data is clean, with no NaN, infinity, or imbalance issues, and ensured that the model compilation and training setup were correct.
Share Improve this question edited Mar 9 at 11:47 Christoph Rackwitz 15.9k5 gold badges39 silver badges51 bronze badges asked Mar 9 at 7:07 احمد القيسياحمد القيسي 1 2- How did you initialize the TPU cluster and choose the distribution strategy ? did you try different batch sizes ? How do you load the data ? Maybe add the parameter steps_per_execution to modelpile(). – rehaqds Commented Mar 9 at 17:33
- Anoher possibility to get nan loss, try using single gpu and multi-gpu. It could be bug from library side, then you might get nan loss with multi-gpu but would work properly on single gpu. If so, file the issue on github keras. – Innat Commented Mar 9 at 18:19
1 Answer
Reset to default 0You can refer to the following URL
https://github/tensorflow/tensorflow/issues/86953#event-16275455512
This seems to be a problem with keras
I used this method to solve it before
You can try it
But in recent days, colab TPU seems to have problems and I can't connect to TPU
本文标签:
版权声明:本文标题:tensorflow - ResNet50 on TPU calculates NaN for accuracy and loss, works fine on CPU in Google Colab - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744875983a2629917.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论