admin管理员组

文章数量:1200344

I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:

  1. the first is a numeric value between 1 and infinity
  2. the second is a date formatted as "MM/DD/YYYY"
  3. the third contains time formatted as "HH:MM:SS" or "HH:MM"
  4. the fourth contains free text
  5. the fifth contains free text
c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
"UNKNOWN", 
"CONTRAINDICATION, STOP", 
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
"CT neg.  Left s w/1mm cyst.", 
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

I pulled a piece of the file from readLines.

Ideally, the output would be:

V1 V2 V3 V4 V5
12315 01/01/1999 22:31 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP
75 01/28/2021 19:34 ct Bilateral (unable to cannulate), normal
75 07/01/2014 15:01:02 CT/MRI CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst.
75 06/13/2018 14:30 Cardiac cath Normal A, EF 66%, no AS/MR

I am trying to read in a .DAT file that is supposed to be tab delimited, however it is a bit messier than anticipated. The data should have five columns:

  1. the first is a numeric value between 1 and infinity
  2. the second is a date formatted as "MM/DD/YYYY"
  3. the third contains time formatted as "HH:MM:SS" or "HH:MM"
  4. the fourth contains free text
  5. the fifth contains free text
c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
"UNKNOWN", 
"CONTRAINDICATION, STOP", 
"75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
"75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
"CT neg.  Left s w/1mm cyst.", 
"75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

I pulled a piece of the file from readLines.

Ideally, the output would be:

V1 V2 V3 V4 V5
12315 01/01/1999 22:31 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAINDICATION, STOP
75 01/28/2021 19:34 ct Bilateral (unable to cannulate), normal
75 07/01/2014 15:01:02 CT/MRI CT chest shows collapse...otherwise OKAY. ETT okay. CT neg. Left s w/1mm cyst.
75 06/13/2018 14:30 Cardiac cath Normal A, EF 66%, no AS/MR
Share Improve this question edited Jan 22 at 7:11 margusl 17.3k2 gold badges19 silver badges28 bronze badges asked Jan 22 at 4:11 afleishmanafleishman 4332 gold badges6 silver badges12 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 2

Assuming that tab is not used in free text fields and dat being the result of readLines() / readr::read_lines(), we could

  • create a frame from lines,
  • use grepl("\t", lines) |> cumsum() for grouping, to collect consecutive related lines (one with tabs optionally followed by those without tabs)
  • and collapse lines with desired separator, ", ";
  • from there we can use readr::read_tsv()
library(dplyr, warn.conflicts = FALSE)

dat |> 
  tibble(lines = _) |> 
  group_by(record = grepl("\t", lines) |> cumsum()) |> 
  #>   lines                                                                   record
  #>   <chr>                                                                    <int>
  #> 1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT"                 1
  #> 2 "UNKNOWN"                                                                    1
  #> 3 "CONTRAINDICATION, STOP"                                                     1
  #> 4 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"         2
  #> 5 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…      3
  #> 6 "CT neg.  Left s w/1mm cyst."                                                3
  #> 7 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"            4
  summarise(lines = paste0(lines, collapse = ", ")) |> 
  #>   record lines                                                                  
  #>    <int> <chr>                                                                  
  #> 1      1 "12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT, UNKNOWN, …
  #> 2      2 "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal"   
  #> 3      3 "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise…
  #> 4      4 "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
  pull(lines) |> 
  I() |> 
  readr::read_tsv(col_names = FALSE) 

#> # A tibble: 4 × 5
#>      X1 X2         X3       X4                    X5                            
#>   <dbl> <chr>      <time>   <chr>                 <chr>                         
#> 1 12315 01/01/1999 22:31:00 Other, specify - Test BILE TUCT, UNKNOWN, CONTRAIND…
#> 2    75 01/28/2021 19:34:00 ct                    Bilateral (unable to cannulat…
#> 3    75 07/01/2014 15:01:02 CT/MRI                CT chest shows collapse...oth…
#> 4    75 06/13/2018 14:30:00 Cardiac cath          Normal A, EF 66%, no AS/MR

Example data:

dat <- c("12315\t01/01/1999\t22:31\tOther, specify - Test\tBILE TUCT", 
         "UNKNOWN", 
         "CONTRAINDICATION, STOP", 
         "75\t01/28/2021\t19:34\tct\tBilateral (unable to cannulate), normal", 
         "75\t07/01/2014\t15:01:02\tCT/MRI\tCT chest shows collapse...otherwise OKAY. ETT okay. ", 
         "CT neg.  Left s w/1mm cyst.", 
         "75\t06/13/2018\t14:30\tCardiac cath\tNormal A, EF 66%, no AS/MR"
)

You can use read.table like below

d <- read.table(text = s, sep = "\t", fill = TRUE)
d[!rowMeans(nchar(as.matrix(d)) == 0), ]

and you will obtain a table like

     V1         V2       V3                    V4
1 12315 01/01/1999    22:31 Other, specify - Test
4    75 01/28/2021    19:34                    ct
5    75 07/01/2014 15:01:02                CT/MRI
7    75 06/13/2018    14:30          Cardiac cath
                                                    V5
1                                            BILE TUCT
4              Bilateral (unable to cannulate), normal
5 CT chest shows collapse...otherwise OKAY. ETT okay.
7                           Normal A, EF 66%, no AS/MR

本文标签: datatableReading DAT file with odd tabdelimited structure in rStack Overflow