admin管理员组文章数量:1332873
I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim)
. I want to feed text information to image using a cross attention mechanism and I was wondering whether this line of code will give me what I need:
cross_attention = nn.MultiheadAttention(embed_dim=768, num_heads=12, dropout=0.1)
attn_output, attn_output_weights = cross_attention(text_last, img_last, img_last)
I tried the provided code but I am not sure whether it is the right approach
本文标签: deep learningMultiModal Cross attentionStack Overflow
版权声明:本文标题:deep learning - MultiModal Cross attention - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742316094a2451845.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论