admin管理员组文章数量:1200344
I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:
- the first is a numeric value between 1 and infinity
- the second is a date formatted as "MM/DD/YYYY"
- the third contains time formatted as "HH:MM:SS" or "HH:MM"
- the fourth contains free text
- the fifth contains free text
c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT",
"UNKNOWN",
"CONTRAINDICATION, STOP",
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal",
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ",
"CT neg. Left s w/1mm cyst.",
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)
I pulled a piece of the file from readLines
.
Ideally, the output would be:
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
12315 | 01/01/1999 | 22:31 | Other, specify - Test | BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP |
75 | 01/28/2021 | 19:34 | ct | Bilateral (unable to cannulate), normal |
75 | 07/01/2014 | 15:01:02 | CT/MRI | CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst. |
75 | 06/13/2018 | 14:30 | Cardiac cath | Normal A, EF 66%, no AS/MR |
I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:
- the first is a numeric value between 1 and infinity
- the second is a date formatted as "MM/DD/YYYY"
- the third contains time formatted as "HH:MM:SS" or "HH:MM"
- the fourth contains free text
- the fifth contains free text
c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT",
"UNKNOWN",
"CONTRAINDICATION, STOP",
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal",
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ",
"CT neg. Left s w/1mm cyst.",
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)
I pulled a piece of the file from readLines
.
Ideally, the output would be:
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
12315 | 01/01/1999 | 22:31 | Other, specify - Test | BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP |
75 | 01/28/2021 | 19:34 | ct | Bilateral (unable to cannulate), normal |
75 | 07/01/2014 | 15:01:02 | CT/MRI | CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst. |
75 | 06/13/2018 | 14:30 | Cardiac cath | Normal A, EF 66%, no AS/MR |
2 Answers
Reset to default 2Assuming that tab is not used in free text fields and dat
being the result of readLines()
/ readr::read_lines()
, we could
- create a frame from lines,
- use
grepl("\t", lines) |> cumsum()
for grouping, to collect consecutive related lines (one with tabs optionally followed by those without tabs) - and collapse lines with desired separator,
", "
; - from there we can use
readr::read_tsv()
library(dplyr, warn.conflicts = FALSE)
dat |>
tibble(lines = _) |>
group_by(record = grepl("\t", lines) |> cumsum()) |>
#> lines record
#> <chr> <int>
#> 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT" 1
#> 2 "UNKNOWN" 1
#> 3 "CONTRAINDICATION, STOP" 1
#> 4 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal" 2
#> 5 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise… 3
#> 6 "CT neg. Left s w/1mm cyst." 3
#> 7 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR" 4
summarise(lines = paste0(lines, collapse = ", ")) |>
#> record lines
#> <int> <chr>
#> 1 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT, UNKNOWN, …
#> 2 2 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"
#> 3 3 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…
#> 4 4 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
pull(lines) |>
I() |>
readr::read_tsv(col_names = FALSE)
#> # A tibble: 4 × 5
#> X1 X2 X3 X4 X5
#> <dbl> <chr> <time> <chr> <chr>
#> 1 12315 01/01/1999 22:31:00 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAIND…
#> 2 75 01/28/2021 19:34:00 ct Bilateral (unable to cannulat…
#> 3 75 07/01/2014 15:01:02 CT/MRI CT chest shows collapse...oth…
#> 4 75 06/13/2018 14:30:00 Cardiac cath Normal A, EF 66%, no AS/MR
Example data:
dat <- c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT",
"UNKNOWN",
"CONTRAINDICATION, STOP",
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal",
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ",
"CT neg. Left s w/1mm cyst.",
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)
You can use read.table
like below
d <- read.table(text = s, sep = "\t", fill = TRUE)
d[!rowMeans(nchar(as.matrix(d)) == 0), ]
and you will obtain a table like
V1 V2 V3 V4
1 12315 01/01/1999 22:31 Other, specify - Test
4 75 01/28/2021 19:34 ct
5 75 07/01/2014 15:01:02 CT/MRI
7 75 06/13/2018 14:30 Cardiac cath
V5
1 BILE TUCT
4 Bilateral (unable to cannulate), normal
5 CT chest shows collapse...otherwise OKAY. ETT okay.
7 Normal A, EF 66%, no AS/MR
本文标签: datatableReading DAT file with odd tabdelimited structure in rStack Overflow
版权声明:本文标题:data.table - Reading .DAT file with odd tab-delimited structure in r - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738586763a2101474.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论