tensorflow - ResNet50 on TPU calculates NaN for accuracy and loss, works fine on CPU in Google Colab - Stack Overflow

IT技术

更新时间：2025-04-175

admin管理员组
文章数量:1404924

I've trained a ResNet50 on v2-8 TPU accelerator using google colab, I've fed it with 5000 images of shape (224, 224, 3). I've normalized them, no nans, no inf, no class imbalance and everything is alright:

INPUT_SHAPE = (224, 224, 3)

with strategy.scope():
    base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
    base_model.trainable = False
    model = tf.keras.models.Sequential([
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(1024, activation='relu'),
        tf.keras.layers.Dense(6, activation='sigmoid') 
    ])
    modelpile(optimizer='adam', 
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
model.fit(
    X_train, 
    y_train, 
    epochs=10, 
    validation_data=(X_val, y_val),
    batch_size=32
  )

When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.

Why is this happening and how to fix it?

I tried training the model on both TPU and CPU in Google Colab. I expected the model to train without any issues, including NaN values for loss or accuracy, especially since the training works fine on the CPU. However, when using the TPU, I encountered NaN values for both accuracy and loss. I also verified that the data is clean, with no NaN, infinity, or imbalance issues, and ensured that the model compilation and training setup were correct.

INPUT_SHAPE = (224, 224, 3)

with strategy.scope():
    base_model = ResNet50(weights='imagenet', include_top=False, input_shape=INPUT_SHAPE)
    base_model.trainable = False
    model = tf.keras.models.Sequential([
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(1024, activation='relu'),
        tf.keras.layers.Dense(6, activation='sigmoid') 
    ])
    modelpile(optimizer='adam', 
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
model.fit(
    X_train, 
    y_train, 
    epochs=10, 
    validation_data=(X_val, y_val),
    batch_size=32
  )

When I train on TPU, accuracy and loss become NaN during training. When I switch to CPU, everything works fine.

Why is this happening and how to fix it?

Share Improve this question edited Mar 9 at 11:47 Christoph Rackwitz 15.9k5 gold badges39 silver badges51 bronze badges asked Mar 9 at 7:07 احمد القيسي 1

How did you initialize the TPU cluster and choose the distribution strategy ? did you try different batch sizes ? How do you load the data ? Maybe add the parameter steps_per_execution to modelpile(). – rehaqds Commented Mar 9 at 17:33
Anoher possibility to get nan loss, try using single gpu and multi-gpu. It could be bug from library side, then you might get nan loss with multi-gpu but would work properly on single gpu. If so, file the issue on github keras. – Innat Commented Mar 9 at 18:19

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You can refer to the following URL

https://github/tensorflow/tensorflow/issues/86953#event-16275455512

This seems to be a problem with keras

I used this method to solve it before

You can try it

But in recent days, colab TPU seems to have problems and I can't connect to TPU

本文标签：

版权声明：本文标题：tensorflow - ResNet50 on TPU calculates NaN for accuracy and loss, works fine on CPU in Google Colab - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744875983a2629917.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

tensorflow - ResNet50 on TPU calculates NaN for accuracy and loss, works fine on CPU in Google Colab - Stack Overflow

1 Answer 1

更多相关文章

javascript - Empty &quot;&lt;div class=&quot;modal-backdrop fade in&quot;&gt;&lt;div&gt;&quot; -

Twenty seventeen header is not responsive

tensorflow.js - TypeError: prediction.argMax is not a function - Stack Overflow

php - How to get html input file type properties in javascript? - Stack Overflow

javascript - How to get the timezone offset for a specific timezone? - Stack Overflow

javascript - fire event when losing focus on bootstrap-modal - Stack Overflow

import - Batch attach unattached images

javascript - Getting asyncawait to work with superagent - Stack Overflow

java - How can I use SNAPSHOT isolation level with Spring - Stack Overflow

clang format - Bad Visual Studio Code indentation after pressing Enter - Stack Overflow

javascript - Programmatically show Bootstrap modal window - Stack Overflow

javascript - How to get timezone from address value with Google API? - Stack Overflow

azure - Invalid HTTP Header When Uploading PDF - Stack Overflow

javascript - How to create a showdown.js markdown extension - Stack Overflow

javascript - Tracking the mouse relative to starting drag position HTML5, Ember - Stack Overflow

javascript - How to get access to a directive attrs with isolated scope? - Stack Overflow

typescript - eslint-plugin-compat not linting unsupported feature - Stack Overflow

c - What is needed to enable libcurl WebSocket support, `CURLE_UNSUPPORTED_PROTOCOL`? - Stack Overflow

javascript - Does jQuery do some magic with the `this` variable? - Stack Overflow

walker - Is it possible to display the Navigation Menu without using Walker_Nav Class

发表评论

推荐文章

d3.js - Javascript functions returning functions with parameters - Stack Overflow

Gulp errors on &#39;require&#39; but if i change it to &#39;import&#39;, the task using the item errors with &am

Move woocommerce directory

How do I remove the admin bar (styling) from frontend only?

javascript - Insert dashes into a number - Stack Overflow

热门文章

reactjs - Shared package configuration used in monorepo with react and react native - Stack Overflow

Get Author Post on author.php with AJAX

webjobs stopped working and unable to add and configure azure webjobs - Stack Overflow

swift - My Flutter Call Kit Implementation works for Android but not for iOS - Stack Overflow

javascript - popstate HTML 5 event being fired multiple times - Stack Overflow

Using generics in a quarkus buildstep not working as expected when creating a synthetic bean - Stack Overflow

Getting ACF Field in Page - From the Footer

javascript - DataTables table plugin for jquery: how to set a css class for tr and td - Stack Overflow

queue does not detect database connection changes in laravel octane - Stack Overflow

javascript - React Native - pass callback to another screen - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

walker - Is it possible to display the Navigation Menu without using Walker_Nav Class

javascript - Specify origin of Math.atan2 - Stack Overflow

javascript - How to save 1 million records to mongodb asyncronously? - Stack Overflow

javascript - Class Decorator in Typescript - Stack Overflow

java - Thymeleaf - Replace common parts in a fragment - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Empty "<div class="modal-backdrop fade in"><div>" -

Gulp errors on 'require' but if i change it to 'import', the task using the item errors with &am