data.table - Reading .DAT file with odd tab-delimited structure in r - Stack Overflow

IT技术

更新时间：2025-02-030

admin管理员组
文章数量:1200344

I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:

the first is a numeric value between 1 and infinity
the second is a date formatted as "MM/DD/YYYY"
the third contains time formatted as "HH:MM:SS" or "HH:MM"
the fourth contains free text
the fifth contains free text

c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
"UNKNOWN", 
"CONTRAINDICATION, STOP", 
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
"CT neg.  Left s w/1mm cyst.", 
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

I pulled a piece of the file from readLines.

Ideally, the output would be:

V1	V2	V3	V4	V5
12315	01/01/1999	22:31	Other, specify - Test	BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP
75	01/28/2021	19:34	ct	Bilateral (unable to cannulate), normal
75	07/01/2014	15:01:02	CT/MRI	CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst.
75	06/13/2018	14:30	Cardiac cath	Normal A, EF 66%, no AS/MR

I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:

the first is a numeric value between 1 and infinity
the second is a date formatted as "MM/DD/YYYY"
the third contains time formatted as "HH:MM:SS" or "HH:MM"
the fourth contains free text
the fifth contains free text

c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
"UNKNOWN", 
"CONTRAINDICATION, STOP", 
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
"CT neg.  Left s w/1mm cyst.", 
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

I pulled a piece of the file from readLines.

Ideally, the output would be:

V1	V2	V3	V4	V5
12315	01/01/1999	22:31	Other, specify - Test	BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP
75	01/28/2021	19:34	ct	Bilateral (unable to cannulate), normal
75	07/01/2014	15:01:02	CT/MRI	CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst.
75	06/13/2018	14:30	Cardiac cath	Normal A, EF 66%, no AS/MR

Share Improve this question edited Jan 22 at 7:11 margusl 17.3k2 gold badges19 silver badges28 bronze badges asked Jan 22 at 4:11 afleishman 4332 gold badges6 silver badges12 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

Assuming that tab is not used in free text fields and dat being the result of readLines() / readr::read_lines(), we could

create a frame from lines,
use grepl("\t", lines) |> cumsum() for grouping, to collect consecutive related lines (one with tabs optionally followed by those without tabs)
and collapse lines with desired separator, ", ";
from there we can use readr::read_tsv()

library(dplyr, warn.conflicts = FALSE)

dat |> 
  tibble(lines = _) |> 
  group_by(record = grepl("\t", lines) |> cumsum()) |> 
  #>   lines                                                                   record
  #>   <chr>                                                                    <int>
  #> 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT"                 1
  #> 2 "UNKNOWN"                                                                    1
  #> 3 "CONTRAINDICATION, STOP"                                                     1
  #> 4 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"         2
  #> 5 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…      3
  #> 6 "CT neg.  Left s w/1mm cyst."                                                3
  #> 7 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"            4
  summarise(lines = paste0(lines, collapse = ", ")) |> 
  #>   record lines                                                                  
  #>    <int> <chr>                                                                  
  #> 1      1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT, UNKNOWN, …
  #> 2      2 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"   
  #> 3      3 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…
  #> 4      4 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
  pull(lines) |> 
  I() |> 
  readr::read_tsv(col_names = FALSE) 

#> # A tibble: 4 × 5
#>      X1 X2         X3       X4                    X5                            
#>   <dbl> <chr>      <time>   <chr>                 <chr>                         
#> 1 12315 01/01/1999 22:31:00 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAIND…
#> 2    75 01/28/2021 19:34:00 ct                    Bilateral (unable to cannulat…
#> 3    75 07/01/2014 15:01:02 CT/MRI                CT chest shows collapse...oth…
#> 4    75 06/13/2018 14:30:00 Cardiac cath          Normal A, EF 66%, no AS/MR

Example data:

dat <- c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
         "UNKNOWN", 
         "CONTRAINDICATION, STOP", 
         "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
         "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
         "CT neg.  Left s w/1mm cyst.", 
         "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

You can use read.table like below

d <- read.table(text = s, sep = "\t", fill = TRUE)
d[!rowMeans(nchar(as.matrix(d)) == 0), ]

and you will obtain a table like

     V1         V2       V3                    V4
1 12315 01/01/1999    22:31 Other, specify - Test
4    75 01/28/2021    19:34                    ct
5    75 07/01/2014 15:01:02                CT/MRI
7    75 06/13/2018    14:30          Cardiac cath
                                                    V5
1                                            BILE TUCT
4              Bilateral (unable to cannulate), normal
5 CT chest shows collapse...otherwise OKAY. ETT okay.
7                           Normal A, EF 66%, no AS/MR

本文标签： datatableReading DAT file with odd tabdelimited structure in rStack Overflow

版权声明：本文标题：data.table - Reading .DAT file with odd tab-delimited structure in r - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738586763a2101474.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

data.table - Reading .DAT file with odd tab-delimited structure in r - Stack Overflow

2 Answers 2

更多相关文章