admin管理员组文章数量:1310193
I have this huge file (> 25 Mb) of the following structure:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.3314357502021994e+02 1.1517122459132779e+02
-1.3499049172495816e+02 1.5089532983410831e+02
-4.1105766276524565e+02 4.1825892946863183e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2
41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39
42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22
43 -53.6295 -75.5845 -378.205 0 0 0 0 0 0 0 1 0 35
44 -57.267 -76.4807 -378.221 0 0 0 0 0 0 0 1 0 28
45 -57.4497 -77.4334 -381.768 0 0 0 0 0 0 0 1 0 37
46 -56.4146 -80.8904 -381.67 0 0 0 0 0 0 0 1 0 40
47 -58.4181 -80.8895 -378.383 0 0 0 0 0 0 0 1 0 21
48 -61.5076 -80.3974 -380.283 0 0 0 0 0 0 0 1 0 31
49 -60.9285 -83.196 -382.86 0 0 0 0 0 0 0 1 1 22
50 -59.8865 -85.4222 -379.651 0 0 0 0 0 0 0 1 -1 27
51 -63.3418 -84.4792 -378.177 0 0 0 0 0 0 0 1 0 30
52 -64.9464 -85.538 -381.729 0 0 0 0 0 0 0 1 1 22
53 -62.8596 -88.8378 -381.581 0 0 0 0 0 0 0 1 1 22
54 -63.7873 -90.0464 -377.83 0 0 0 0 0 0 0 1 0 39
55 -67.4433 -88.4767 -378.467 0 0 0 0 0 0 0 1 0 26
56 -67.8255 -91.1638 -381.32 0 0 0 0 0 0 0 1 1 32
57 -66.194 -94.1545 -379.992 0 0 0 0 0 0 0 1 0 36
58 -68.004 -96.1613 -376.969 0 0 0 0 0 0 0 1 0 37
59 -65.5268 -97.9105 -375.094 0 0 0 0 0 0 0 1 -1 27
60 -65.3757 -98.4001 -371.262 0 0 0 0 0 0 0 1 0 31
...
17627 49.6773 50.5787 392.185 0 0 0 0 0 0 0 1992 -1 42
17628 45.488 56.5193 406.41 0 0 0 0 0 0 0 1992 -1 42
There are further sections for subsequent time steps, and they all start the same way, e.g. this is the next one
ITEM: TIMESTEP
100000
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.4837772831937818e+02 1.3859288720579670e+02
-1.4339607509329937e+02 1.6977859136366405e+02
-4.8020962419356266e+02 5.1578309694876850e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -70.2744 -123.319 -430.593 0 0 0 0 0 0 0 1 1 2
41 -71.6868 -119.878 -432.341 0 0 0 0 0 0 0 1 0 39
42 -70.2556 -116.45 -430.652 0 0 0 0 0 0 0 1 1 22
43 -71.1458 -115.417 -427.107 0 0 0 0 0 0 0 1 0 35
44 -74.6494 -113.816 -426.925 0 0 0 0 0 0 0 1 0 28
45 -76.0197 -115.121 -429.907 0 0 0 0 0 0 0 1 0 37
46 -77.4939 -118.11 -428.852 0 0 0 0 0 0 0 1 0 40
47 -78.461 -116.067 -425.641 0 0 0 0 0 0 0 1 0 21
48 -80.6403 -113.891 -427.878 0 0 0 0 0 0 0 1 0 31
49 -82.5667 -116.737 -429.45 0 0 0 0 0 0 0 1 1 22
50 -82.8208 -118.067 -425.623 0 0 0 0 0 0 0 1 -1 27
51 -84.3263 -114.418 -424.728 0 0 0 0 0 0 0 1 0 30
52 -86.7863 -115.369 -427.978 0 0 0 0 0 0 0 1 1 22
53 -87.636 -118.593 -426.541 0 0 0 0 0 0 0 1 1 22
54 -88.4756 -117.592 -422.645 0 0 0 0 0 0 0 1 0 39
55 -90.0285 -114.297 -424.2 0 0 0 0 0 0 0 1 0 26
56 -92.7283 -116.552 -425.923 0 0 0 0 0 0 0 1 1 32
57 -93.5545 -119.262 -423.443 0 0 0 0 0 0 0 1 0 36
58 -95.5213 -118.131 -420.219 0 0 0 0 0 0 0 1 0 37
59 -94.5936 -120.806 -417.701 0 0 0 0 0 0 0 1 -1 27
60 -94.3178 -120.163 -413.806 0 0 0 0 0 0 0 1 0 31
...
I want to read the entire file into one big tibble with time_step as an additional column, i.e. with colnames: time_step, ATOMS_id, x, y, z, c_q[1], c_q[2], c_q[3], c_q[4], c_shape[1], c_shape[2], c_shape[3], mol, q, type
. (The NUMBER OF ATOMS and BOX BOUNDS parts are not important here.)
I can do it "manually" with something like
library(dplyr)
library(stringr)
all_dumps <- tibble (
condition = character(),
timestep = numeric(),
atom_id = numeric(),
atom_type = numeric(),
mol_id = numeric(),
x = numeric(),
y = numeric(),
z = numeric()
)
raw_lines <- readLines('dna_temper_A.dump')
for (line in raw_lines) {
if (line != "ITEM: TIMESTEP") {
line_count_in_block = line_count_in_block + 1
if (line_count_in_block == 1) {
time_step = line
}
else if (line_count_in_block > 8) {
line_split <- unlist(str_split(line, ' ')) %>% as.numeric()
all_dumps <- add_row (
all_dumps,
timestep = as.numeric(time_step),
atom_id = line_split[1],
atom_type = line_split[14],
mol_id = line_split[12],
x = line_split[2],
y = line_split[3],
z = line_split[4]
)
}
}
else {
line_count_in_block = 0
}
}
(keeping only those columns I really want to have later on.)
However, this is super slow/inefficient! Is there a better way of doing this (in true tidyr style)?
I have this huge file (> 25 Mb) of the following structure:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.3314357502021994e+02 1.1517122459132779e+02
-1.3499049172495816e+02 1.5089532983410831e+02
-4.1105766276524565e+02 4.1825892946863183e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2
41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39
42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22
43 -53.6295 -75.5845 -378.205 0 0 0 0 0 0 0 1 0 35
44 -57.267 -76.4807 -378.221 0 0 0 0 0 0 0 1 0 28
45 -57.4497 -77.4334 -381.768 0 0 0 0 0 0 0 1 0 37
46 -56.4146 -80.8904 -381.67 0 0 0 0 0 0 0 1 0 40
47 -58.4181 -80.8895 -378.383 0 0 0 0 0 0 0 1 0 21
48 -61.5076 -80.3974 -380.283 0 0 0 0 0 0 0 1 0 31
49 -60.9285 -83.196 -382.86 0 0 0 0 0 0 0 1 1 22
50 -59.8865 -85.4222 -379.651 0 0 0 0 0 0 0 1 -1 27
51 -63.3418 -84.4792 -378.177 0 0 0 0 0 0 0 1 0 30
52 -64.9464 -85.538 -381.729 0 0 0 0 0 0 0 1 1 22
53 -62.8596 -88.8378 -381.581 0 0 0 0 0 0 0 1 1 22
54 -63.7873 -90.0464 -377.83 0 0 0 0 0 0 0 1 0 39
55 -67.4433 -88.4767 -378.467 0 0 0 0 0 0 0 1 0 26
56 -67.8255 -91.1638 -381.32 0 0 0 0 0 0 0 1 1 32
57 -66.194 -94.1545 -379.992 0 0 0 0 0 0 0 1 0 36
58 -68.004 -96.1613 -376.969 0 0 0 0 0 0 0 1 0 37
59 -65.5268 -97.9105 -375.094 0 0 0 0 0 0 0 1 -1 27
60 -65.3757 -98.4001 -371.262 0 0 0 0 0 0 0 1 0 31
...
17627 49.6773 50.5787 392.185 0 0 0 0 0 0 0 1992 -1 42
17628 45.488 56.5193 406.41 0 0 0 0 0 0 0 1992 -1 42
There are further sections for subsequent time steps, and they all start the same way, e.g. this is the next one
ITEM: TIMESTEP
100000
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.4837772831937818e+02 1.3859288720579670e+02
-1.4339607509329937e+02 1.6977859136366405e+02
-4.8020962419356266e+02 5.1578309694876850e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -70.2744 -123.319 -430.593 0 0 0 0 0 0 0 1 1 2
41 -71.6868 -119.878 -432.341 0 0 0 0 0 0 0 1 0 39
42 -70.2556 -116.45 -430.652 0 0 0 0 0 0 0 1 1 22
43 -71.1458 -115.417 -427.107 0 0 0 0 0 0 0 1 0 35
44 -74.6494 -113.816 -426.925 0 0 0 0 0 0 0 1 0 28
45 -76.0197 -115.121 -429.907 0 0 0 0 0 0 0 1 0 37
46 -77.4939 -118.11 -428.852 0 0 0 0 0 0 0 1 0 40
47 -78.461 -116.067 -425.641 0 0 0 0 0 0 0 1 0 21
48 -80.6403 -113.891 -427.878 0 0 0 0 0 0 0 1 0 31
49 -82.5667 -116.737 -429.45 0 0 0 0 0 0 0 1 1 22
50 -82.8208 -118.067 -425.623 0 0 0 0 0 0 0 1 -1 27
51 -84.3263 -114.418 -424.728 0 0 0 0 0 0 0 1 0 30
52 -86.7863 -115.369 -427.978 0 0 0 0 0 0 0 1 1 22
53 -87.636 -118.593 -426.541 0 0 0 0 0 0 0 1 1 22
54 -88.4756 -117.592 -422.645 0 0 0 0 0 0 0 1 0 39
55 -90.0285 -114.297 -424.2 0 0 0 0 0 0 0 1 0 26
56 -92.7283 -116.552 -425.923 0 0 0 0 0 0 0 1 1 32
57 -93.5545 -119.262 -423.443 0 0 0 0 0 0 0 1 0 36
58 -95.5213 -118.131 -420.219 0 0 0 0 0 0 0 1 0 37
59 -94.5936 -120.806 -417.701 0 0 0 0 0 0 0 1 -1 27
60 -94.3178 -120.163 -413.806 0 0 0 0 0 0 0 1 0 31
...
I want to read the entire file into one big tibble with time_step as an additional column, i.e. with colnames: time_step, ATOMS_id, x, y, z, c_q[1], c_q[2], c_q[3], c_q[4], c_shape[1], c_shape[2], c_shape[3], mol, q, type
. (The NUMBER OF ATOMS and BOX BOUNDS parts are not important here.)
I can do it "manually" with something like
library(dplyr)
library(stringr)
all_dumps <- tibble (
condition = character(),
timestep = numeric(),
atom_id = numeric(),
atom_type = numeric(),
mol_id = numeric(),
x = numeric(),
y = numeric(),
z = numeric()
)
raw_lines <- readLines('dna_temper_A.dump')
for (line in raw_lines) {
if (line != "ITEM: TIMESTEP") {
line_count_in_block = line_count_in_block + 1
if (line_count_in_block == 1) {
time_step = line
}
else if (line_count_in_block > 8) {
line_split <- unlist(str_split(line, ' ')) %>% as.numeric()
all_dumps <- add_row (
all_dumps,
timestep = as.numeric(time_step),
atom_id = line_split[1],
atom_type = line_split[14],
mol_id = line_split[12],
x = line_split[2],
y = line_split[3],
z = line_split[4]
)
}
}
else {
line_count_in_block = 0
}
}
(keeping only those columns I really want to have later on.)
However, this is super slow/inefficient! Is there a better way of doing this (in true tidyr style)?
Share Improve this question edited Feb 2 at 14:59 Jan 9,4686 gold badges20 silver badges33 bronze badges asked Feb 2 at 13:27 mce1mce1 256 bronze badges 3 |2 Answers
Reset to default 1We assume that
- each group of lines starts with a line containing TIMESTEP,
- the next line contains its value and
- the header and data start at the line with "id" within the group.
g
defines the groups and by
splits the lines into groups and applies the anonymous function shown to each group creating a list of data tables. rbindlist
then makes a single data.table from them.
A data.table is a data.frame but if you prefer that the result just be a data.frame add the data.table=FALSE
argument to fread
.
library(data.table)
L <- readLines("mce1.dat")
g <- cumsum(grepl("TIMESTEP", L))
dat <- rbindlist(by(L, g, function(x) {
ix <- grep("id", x)
y <- c(x[ix], paste(x[2], tail(x, -ix)))
y[1] <- sub("ITEM:.*ATOMS", "Timestep", y[1])
fread(text = y)
}))
dat
Using the file generated in the Note at the end this code gives
Timestep id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
<int> <int> <num> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 0 40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2
2: 0 41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39
3: 0 42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22
4: 100000 40 -70.2744 -123.3190 -430.593 0 0 0 0 0 0 0 1 1 2
5: 100000 41 -71.6868 -119.8780 -432.341 0 0 0 0 0 0 0 1 0 39
6: 100000 42 -70.2556 -116.4500 -430.652 0 0 0 0 0 0 0 1 1 22
Note
Generate mce1.dat test file.
Lines <- "ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.3314357502021994e+02 1.1517122459132779e+02
-1.3499049172495816e+02 1.5089532983410831e+02
-4.1105766276524565e+02 4.1825892946863183e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -48.7871 -80.7758 -383.704 0 0 0 0 0 0 0 1 1 2
41 -51.0027 -77.7928 -384.047 0 0 0 0 0 0 0 1 0 39
42 -52.2822 -74.6121 -381.792 0 0 0 0 0 0 0 1 1 22
ITEM: TIMESTEP
100000
ITEM: NUMBER OF ATOMS
14748
ITEM: BOX BOUNDS ss ss ss
-1.4837772831937818e+02 1.3859288720579670e+02
-1.4339607509329937e+02 1.6977859136366405e+02
-4.8020962419356266e+02 5.1578309694876850e+02
ITEM: ATOMS id x y z c_q[1] c_q[2] c_q[3] c_q[4] c_shape[1] c_shape[2] c_shape[3] mol q type
40 -70.2744 -123.319 -430.593 0 0 0 0 0 0 0 1 1 2
41 -71.6868 -119.878 -432.341 0 0 0 0 0 0 0 1 0 39
42 -70.2556 -116.45 -430.652 0 0 0 0 0 0 0 1 1 22"
cat(Lines, file = "mce1.dat")
You can do it without loops by finding the indices of relevant landmarks, then reading in data all at once using those indices. I used base R, but you could easily swap in stringr, readr, and dplyr equivalents.
raw_lines <- readLines("dna_temper_A.dump")
lines <- strsplit(raw_lines, "\n")[[1]]
### get landmark and data indices
loc_timestep <- grep("ITEM: TIMESTEP", lines)
loc_atoms <- grep("ITEM: ATOMS id x y z", lines)
loc_starts <- loc_atoms + 1
# assumes each table ends before next "TIMESTEP",
# except last table ends at end of file
loc_ends <- c(loc_timestep[-1] - 1, length(lines))
loc_tbl_lines <- sequence(loc_ends - loc_starts + 1, from = loc_starts)
### read in ATOMS data
tbl_lines <- lines[loc_tbl_lines] |>
paste(collapse = "\n")
col_names <- gsub("ITEM: ATOMS ", "", lines[[loc_atoms[[1]]]])
col_names <- strsplit(col_names, "\\s+")[[1]]
dat <- read.table(text = tbl_lines, col.names = col_names)
### add TIMESTEPs
len_tbl <- loc_ends - loc_starts + 1
timestep <- lines[loc_timestep + 1] |>
rep(len_tbl)
dat <- cbind(timestep, dat)
Result:
#> head(dat)
timestep id x y z c_q.1. c_q.2. c_q.3. c_q.4. c_shape.1.
1 0 40 -48.7871 -80.7758 -383.704 0 0 0 0 0
2 0 41 -51.0027 -77.7928 -384.047 0 0 0 0 0
3 0 42 -52.2822 -74.6121 -381.792 0 0 0 0 0
4 0 43 -53.6295 -75.5845 -378.205 0 0 0 0 0
5 0 44 -57.2670 -76.4807 -378.221 0 0 0 0 0
6 0 45 -57.4497 -77.4334 -381.768 0 0 0 0 0
c_shape.2. c_shape.3. mol q type
1 0 0 1 1 2
2 0 0 1 0 39
3 0 0 1 1 22
4 0 0 1 0 35
5 0 0 1 0 28
6 0 0 1 0 37
#> tail(dat)
timestep id x y z c_q.1. c_q.2. c_q.3. c_q.4.
39 100000 55 -90.0285 -114.297 -424.200 0 0 0 0
40 100000 56 -92.7283 -116.552 -425.923 0 0 0 0
41 100000 57 -93.5545 -119.262 -423.443 0 0 0 0
42 100000 58 -95.5213 -118.131 -420.219 0 0 0 0
43 100000 59 -94.5936 -120.806 -417.701 0 0 0 0
44 100000 60 -94.3178 -120.163 -413.806 0 0 0 0
c_shape.1. c_shape.2. c_shape.3. mol q type
39 0 0 0 1 0 26
40 0 0 0 1 1 32
41 0 0 0 1 0 36
42 0 0 0 1 0 37
43 0 0 0 1 -1 27
44 0 0 0 1 0 31
本文标签: rReading large multipart table from file and combing its parts into one tibbleStack Overflow
版权声明:本文标题:r - Reading large multi-part table from file and combing its parts into one tibble - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741846751a2400823.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
line_count_in_block
in your code come from? – zephryl Commented Feb 2 at 15:37"ATOMS"
table always 17628 rows? Otherwise, what immediately follows on the next line — is it the"ITEM: TIMESTEP"
for the next section? – zephryl Commented Feb 2 at 15:39