r - Wrong variable comparison result when performing data.table merge of two table with duplicated keys - Stack Overflow

IT技术

更新时间：2025-01-1116

admin管理员组
文章数量:1441767

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

Share Improve this question edited 16 hours ago asked 18 hours ago denis 5,6731 gold badge15 silver badges44 bronze badges

I don't know if you are asking why did ChatGPT came up with that code. But if yes, then there's no way we can answer. – Rui Barradas Commented 17 hours ago
No, I don't care much. My question is (see last lines of my post): What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? – denis Commented 17 hours ago
1 You can try adding print statements: df2[df1,on = "id", xinbetween := {print(data.table(id, x, y1, y2, x >= y1 & x <= y2)); x >= y1 & x <= y2}] – s_baldur Commented 16 hours ago
@s_baldur that's clever, thanks! – denis Commented 16 hours ago
1 You may consider editing the title to be more specific to the question so others with the same question have an easier time finding it when searching – jpsmith Commented 16 hours ago

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default 3

It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"] is a cartesian join:

df1[, rn := as.character(.I)]

df2[df1, on = "id", allow.cartesian = TRUE]
#        id    y1    y2 eds_id     x     rn
#    <char> <num> <num>  <int> <int> <char>
# 1:      A     2     6      1    10      1
# 2:      A     4     8      2    10      1
# 3:      A     2     6      1     8      2
# 4:      A     4     8      2     8      2
# 5:      A     2     6      1     8      3
# 6:      A     4     8      2     8      3
# 7:      B     2     6      3     3      4
# 8:      B     4     8      4     3      4
# 9:      B     2     6      3     9      5
#10:      B     4     8      4     9      5
#11:      B     2     6      3     9      6
#12:      B     4     8      4     9      6
#13:      C     2     6      5     3      7
#14:      C     4     8      6     3      7
#15:      C     2     6      5     4      8
#16:      C     4     8      6     4      8
#17:      C     2     6      5     3      9
#18:      C     4     8      6     3      9
#19:      D     2     6      7    10     10
#20:      D     4     8      8    10     10
#21:      D     2     6      7     7     11
#22:      D     4     8      8     7     11
#23:      D     2     6      7     5     12
#24:      D     4     8      8     5     12
#25:      E     2     6      9    10     13
#26:      E     4     8     10    10     13
#27:      E     2     6      9     7     14
#28:      E     4     8     10     7     14
#29:      E     2     6      9     6     15
#30:      E     4     8     10     6     15
#        id    y1    y2 eds_id     x     rn

It should be elucidating to store the row numbers from df1 that match/are used for the comparison:

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

df1[, rn := as.character(.I)]
df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]

#        id    y1    y2 eds_id xinbetween_true xinbetween
#    <char> <num> <num>  <int>          <char>     <char>
# 1:      A     2     6      1            <NA>         3-
# 2:      A     4     8      2               3          3
# 3:      B     2     6      3               4         6-
# 4:      B     4     8      4            <NA>         6-
# 5:      C     2     6      5               9          9
# 6:      C     4     8      6               8         9-
# 7:      D     2     6      7              12         12
# 8:      D     4     8      8              12         12
# 9:      E     2     6      9              15         15
#10:      E     4     8     10              15         15

As you see, the ChatGPT code uses the last row from df1 with a matching ID.

本文标签：

版权声明：本文标题：r - Wrong variable comparison result when performing data.table merge of two table with duplicated keys - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736573431a1944803.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

r - Wrong variable comparison result when performing data.table merge of two table with duplicated keys - Stack Overflow

1 Answer 1

更多相关文章

#ifndef、#def、#endif说明

数据的水平分片意义

const常量、指向常量的指针和常量指针

Windows虚拟内存问题

解密encrypt的存储过程

什么是 JRE

Hashtable全面使用

有关statistics

Java运算符

SELECT 语句与其子句的详解

什么是数据的表分区（文章附上Server 2005分区实施方案）

Oracle优化

中文分词代码(此代码为作者多年经验总结，以前发表过VB，PB版本)

JavaScript创建的可编辑表格

RLHF(人类反馈强化学习)

人类反馈强化学习（RLHF）

服务网格如何影响您的 Kubernetes 成本

使用 Zuul、Ribbon、Feign、Eureka 和 Sleuth、Zipkin 创建简单spring cloud微服务用例

模型安全攻防战：对抗样本生成与防御的技术军备竞赛

开源项目：一行代码，批量 PDF 转 Word 轻松搞定！

发表评论

推荐文章

TTS和TTT已过时？TTRL横空出世，推理模型摆脱「标注数据」依赖，性能暴涨

AI也要007？Letta、伯克利提出「睡眠时间计算」，推理效率翻倍还不加钱

HiFAR：多阶段课程学习在高动态人形机器人跌倒恢复中的应用

89.4K star！这个开源LLM应用开发平台，让你轻松构建AI工作流！

人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）

热门文章

【Linux篇章】进程通信黑科技：System V 共享内存，开启进程间通信的星际数据通道!

redis单线程为什么这么快

推荐一个GPU编程学习与实践的宝藏网站

JavaScript 中 throw error 与 throw new Error(error) 的用法及区别，分别适合什么场景使用？

用AI把PDF一键变成能玩的可视化网页，这不比PPT酷多了。

PolarDB MySQL 加索引卡主的整体解决方案

Swagger 中的 x

大模型实现通用智能机理与数据驱动的智能涌现

解释器模式(Interpreter Pattern)

延迟队列使用指南

最新文章

AI驱动软件团队变革：未来趋势解读

从个人博客到电商中台：EdgeOne Pages的MCP Server弹性架构×DeepSeek多场景模板实测报告

开源项目：一行代码，批量 PDF 转 Word 轻松搞定！

什么是SSL证书自动化管理?

RSA和ECC在密钥长度相同的情况下哪个更安全？

javascript - Type &#39;undefined&#39; is not assignable to type &#39;menuItemProps[]&#39; - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

清华同方S30i-40 黑参数报价

LG gram Pro 16 2024 Ultra5 125H16GB512GB黑色 参数报价

技械骑士HZ60 13代酷睿i732GB1024GB4G独显参数报价

七彩虹将星X17 Pro Max i9 14900HX32G2TBRTX4090参数报价

ThinkPad R490 i5 8265U8GB256GB+2TBRX540X 参数报价

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow

LG gram Pro 16 2024 Ultra5 125H16GB512GB黑色参数报价