admin管理员组文章数量:1123507
A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.
Here is the example:
Let's consider a first table ( drugs: Patient have an id
, and start a drug at x
):
library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]
id x
<char> <int>
1: A 10
2: A 8
3: A 8
4: B 3
5: B 9
Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1
and y2
) :
df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]
id y1 y2 eds_id
<char> <num> <num> <int>
1: A 2 6 1
2: A 4 8 2
3: B 2 6 3
4: B 4 8 4
Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.
I would do non-equi merge:
df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]
Which work.
ChatGPT came up with:
df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]
Which produce wrong answers:
df2[xinbetween_true != xinbetween]
id y1 y2 eds_id xinbetween xinbetween_true
<char> <num> <num> <int> <lgcl> <lgcl>
1: B 2 6 3 FALSE TRUE
2: C 4 8 6 FALSE TRUE
For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:
df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]
id y1 y2 eds_id xinbetween xinbetween_true x
<char> <num> <num> <int> <lgcl> <lgcl> <int>
1: B 2 6 3 FALSE TRUE 3
2: B 2 6 3 FALSE TRUE 9
3: B 2 6 3 FALSE TRUE 9
4: C 4 8 6 FALSE TRUE 3
5: C 4 8 6 FALSE TRUE 4
6: C 4 8 6 FALSE TRUE 3
So is here my question:
What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]
script do? It does not do a proper non-equi merge, but I don't get what it does.
And in what case can it be used?
A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.
Here is the example:
Let's consider a first table ( drugs: Patient have an id
, and start a drug at x
):
library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]
id x
<char> <int>
1: A 10
2: A 8
3: A 8
4: B 3
5: B 9
Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1
and y2
) :
df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]
id y1 y2 eds_id
<char> <num> <num> <int>
1: A 2 6 1
2: A 4 8 2
3: B 2 6 3
4: B 4 8 4
Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.
I would do non-equi merge:
df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]
Which work.
ChatGPT came up with:
df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]
Which produce wrong answers:
df2[xinbetween_true != xinbetween]
id y1 y2 eds_id xinbetween xinbetween_true
<char> <num> <num> <int> <lgcl> <lgcl>
1: B 2 6 3 FALSE TRUE
2: C 4 8 6 FALSE TRUE
For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:
df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]
id y1 y2 eds_id xinbetween xinbetween_true x
<char> <num> <num> <int> <lgcl> <lgcl> <int>
1: B 2 6 3 FALSE TRUE 3
2: B 2 6 3 FALSE TRUE 9
3: B 2 6 3 FALSE TRUE 9
4: C 4 8 6 FALSE TRUE 3
5: C 4 8 6 FALSE TRUE 4
6: C 4 8 6 FALSE TRUE 3
So is here my question:
What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]
script do? It does not do a proper non-equi merge, but I don't get what it does.
And in what case can it be used?
Share Improve this question edited 16 hours ago denis asked 18 hours ago denisdenis 5,6731 gold badge15 silver badges44 bronze badges 6 | Show 1 more comment1 Answer
Reset to default 3It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"]
is a cartesian join:
df1[, rn := as.character(.I)]
df2[df1, on = "id", allow.cartesian = TRUE]
# id y1 y2 eds_id x rn
# <char> <num> <num> <int> <int> <char>
# 1: A 2 6 1 10 1
# 2: A 4 8 2 10 1
# 3: A 2 6 1 8 2
# 4: A 4 8 2 8 2
# 5: A 2 6 1 8 3
# 6: A 4 8 2 8 3
# 7: B 2 6 3 3 4
# 8: B 4 8 4 3 4
# 9: B 2 6 3 9 5
#10: B 4 8 4 9 5
#11: B 2 6 3 9 6
#12: B 4 8 4 9 6
#13: C 2 6 5 3 7
#14: C 4 8 6 3 7
#15: C 2 6 5 4 8
#16: C 4 8 6 4 8
#17: C 2 6 5 3 9
#18: C 4 8 6 3 9
#19: D 2 6 7 10 10
#20: D 4 8 8 10 10
#21: D 2 6 7 7 11
#22: D 4 8 8 7 11
#23: D 2 6 7 5 12
#24: D 4 8 8 5 12
#25: E 2 6 9 10 13
#26: E 4 8 10 10 13
#27: E 2 6 9 7 14
#28: E 4 8 10 7 14
#29: E 2 6 9 6 15
#30: E 4 8 10 6 15
# id y1 y2 eds_id x rn
It should be elucidating to store the row numbers from df1 that match/are used for the comparison:
library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]
df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]
df1[, rn := as.character(.I)]
df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]
# id y1 y2 eds_id xinbetween_true xinbetween
# <char> <num> <num> <int> <char> <char>
# 1: A 2 6 1 <NA> 3-
# 2: A 4 8 2 3 3
# 3: B 2 6 3 4 6-
# 4: B 4 8 4 <NA> 6-
# 5: C 2 6 5 9 9
# 6: C 4 8 6 8 9-
# 7: D 2 6 7 12 12
# 8: D 4 8 8 12 12
# 9: E 2 6 9 15 15
#10: E 4 8 10 15 15
As you see, the ChatGPT code uses the last row from df1
with a matching ID.
本文标签:
版权声明:本文标题:r - Wrong variable comparison result when performing data.table merge of two table with duplicated keys - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736573431a1944803.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do?
– denis Commented 17 hours agodf2[df1,on = "id", xinbetween := {print(data.table(id, x, y1, y2, x >= y1 & x <= y2)); x >= y1 & x <= y2}]
– s_baldur Commented 16 hours ago