admin管理员组

文章数量:1123507

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

Share Improve this question edited 16 hours ago denis asked 18 hours ago denisdenis 5,6731 gold badge15 silver badges44 bronze badges 6
  • I don't know if you are asking why did ChatGPT came up with that code. But if yes, then there's no way we can answer. – Rui Barradas Commented 17 hours ago
  • No, I don't care much. My question is (see last lines of my post): What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? – denis Commented 17 hours ago
  • 1 You can try adding print statements: df2[df1,on = "id", xinbetween := {print(data.table(id, x, y1, y2, x >= y1 & x <= y2)); x >= y1 & x <= y2}] – s_baldur Commented 16 hours ago
  • @s_baldur that's clever, thanks! – denis Commented 16 hours ago
  • 1 You may consider editing the title to be more specific to the question so others with the same question have an easier time finding it when searching – jpsmith Commented 16 hours ago
 |  Show 1 more comment

1 Answer 1

Reset to default 3

It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"] is a cartesian join:

df1[, rn := as.character(.I)]

df2[df1, on = "id", allow.cartesian = TRUE]
#        id    y1    y2 eds_id     x     rn
#    <char> <num> <num>  <int> <int> <char>
# 1:      A     2     6      1    10      1
# 2:      A     4     8      2    10      1
# 3:      A     2     6      1     8      2
# 4:      A     4     8      2     8      2
# 5:      A     2     6      1     8      3
# 6:      A     4     8      2     8      3
# 7:      B     2     6      3     3      4
# 8:      B     4     8      4     3      4
# 9:      B     2     6      3     9      5
#10:      B     4     8      4     9      5
#11:      B     2     6      3     9      6
#12:      B     4     8      4     9      6
#13:      C     2     6      5     3      7
#14:      C     4     8      6     3      7
#15:      C     2     6      5     4      8
#16:      C     4     8      6     4      8
#17:      C     2     6      5     3      9
#18:      C     4     8      6     3      9
#19:      D     2     6      7    10     10
#20:      D     4     8      8    10     10
#21:      D     2     6      7     7     11
#22:      D     4     8      8     7     11
#23:      D     2     6      7     5     12
#24:      D     4     8      8     5     12
#25:      E     2     6      9    10     13
#26:      E     4     8     10    10     13
#27:      E     2     6      9     7     14
#28:      E     4     8     10     7     14
#29:      E     2     6      9     6     15
#30:      E     4     8     10     6     15
#        id    y1    y2 eds_id     x     rn

It should be elucidating to store the row numbers from df1 that match/are used for the comparison:

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

df1[, rn := as.character(.I)]
df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]

#        id    y1    y2 eds_id xinbetween_true xinbetween
#    <char> <num> <num>  <int>          <char>     <char>
# 1:      A     2     6      1            <NA>         3-
# 2:      A     4     8      2               3          3
# 3:      B     2     6      3               4         6-
# 4:      B     4     8      4            <NA>         6-
# 5:      C     2     6      5               9          9
# 6:      C     4     8      6               8         9-
# 7:      D     2     6      7              12         12
# 8:      D     4     8      8              12         12
# 9:      E     2     6      9              15         15
#10:      E     4     8     10              15         15

As you see, the ChatGPT code uses the last row from df1 with a matching ID.

本文标签: