admin管理员组文章数量:1410674
tldr:
I finally found the solution for my specific case thanks to @r2evans answer below:
When writing the table:
dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))
When reading the table:
dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))
I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).
I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.
Reproducible example:
df <- tibble::tibble(x = 1:5, y = letters[1:5])
# A tibble: 5 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
duckdb::dbWriteTable(conn, "df", input)
output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
The same (serialized) data is returned from DuckDB, however I cannot deserialize it:
output |> dplyr::mutate(x = lapply(x, unserialize))
# A tibble: 5 × 2
x y
<list> <chr>
1 <int [1]> a
2 <int [1]> b
3 <int [1]> c
4 <int [1]> d
5 <int [1]> e
Any help would be greatly appreciated!
Edit:
The actual data contained in the column that I serialize is essentially a html string, in fact it shows:
# A tibble: 5 × 2
x y
<list> <chr>
1 <html> a
2 <html> b
3 <html> c
4 <html> d
5 <html> e
After running dplyr::mutate(x = lapply(x, unserialize))
, I get back:
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw> a
2 <raw> b
3 <raw> c
4 <raw> d
5 <raw> e
Further info:
This is the data before and after:
> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>
> after[1]
[[1]]
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
[ reached getOption("max.print") -- omitted 166065 entries ]
> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html" "character"
> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"
I have tried all of the below suggestions with no luck. I consistently end up with <raw>
after deserialising.
Additionally:
> unserialize(after[1][[1]])
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
...
tldr:
I finally found the solution for my specific case thanks to @r2evans answer below:
When writing the table:
dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))
When reading the table:
dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))
I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).
I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.
Reproducible example:
df <- tibble::tibble(x = 1:5, y = letters[1:5])
# A tibble: 5 × 2
x y
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
duckdb::dbWriteTable(conn, "df", input)
output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw [35]> a
2 <raw [35]> b
3 <raw [35]> c
4 <raw [35]> d
5 <raw [35]> e
The same (serialized) data is returned from DuckDB, however I cannot deserialize it:
output |> dplyr::mutate(x = lapply(x, unserialize))
# A tibble: 5 × 2
x y
<list> <chr>
1 <int [1]> a
2 <int [1]> b
3 <int [1]> c
4 <int [1]> d
5 <int [1]> e
Any help would be greatly appreciated!
Edit:
The actual data contained in the column that I serialize is essentially a html string, in fact it shows:
# A tibble: 5 × 2
x y
<list> <chr>
1 <html> a
2 <html> b
3 <html> c
4 <html> d
5 <html> e
After running dplyr::mutate(x = lapply(x, unserialize))
, I get back:
# A tibble: 5 × 2
x y
<list> <chr>
1 <raw> a
2 <raw> b
3 <raw> c
4 <raw> d
5 <raw> e
Further info:
This is the data before and after:
> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>
> after[1]
[[1]]
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
[ reached getOption("max.print") -- omitted 166065 entries ]
> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html" "character"
> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"
I have tried all of the below suggestions with no luck. I consistently end up with <raw>
after deserialising.
Additionally:
> unserialize(after[1][[1]])
[1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
...
Share
edited Mar 7 at 9:42
FISR
asked Mar 6 at 19:29
FISRFISR
1651 silver badge9 bronze badges
5
|
1 Answer
Reset to default 3This is just about "list-columns". Usually frames' columns are vectors of simple scalars (perhaps an over-reduction). However, it is certainly possible to include more complex objects in each "cell" of a column, this is done via "list columns", where instead of a vector (in which all objects must be the same class), we use a list (in which each cell may be length-0, length-1, more complex objects, just about anything).
If you are certain that all data in output$x
reduces to a length-1 scalar, then you may be able to replace lapply
(which always returns a list
) with sapply
:
output |> dplyr::mutate(x = sapply(x, unserialize))
# # A tibble: 5 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
This is not always safe, however. For a demonstration of when sapply
may not be desired,
now <- Sys.time()
now + 0
# [1] "2025-03-06 14:45:43 EST"
sapply(now, `+`, 0)
# [1] 1741290344
do.call(c, lapply(now, `+`, 0))
# [1] "2025-03-06 14:45:43 EST"
So if you are certain that all the data in output$x
reduces to a length-1 scalar and you may have attributes that sapply
can dump, do this:
output |> dplyr::mutate(x = do.call(c, lapply(x, unserialize)))
If any of output$x
may be length greater than 1, then consider tidyr::unnest
:
output |>
dplyr::mutate(x = lapply(x, unserialize)) |>
unnest(cols = x)
Note that if any of output$x
may be length-0, NULL
, or similar, then the rows that contain those empty x
values will be removed in the results. For a demonstration, see
set.seed(43)
output |>
dplyr::mutate(x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z))) |>
tidyr::unnest(cols = x)
# # A tibble: 3 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 3 c
# 3 5 e
To better deal with those, I suggest something like this:
set.seed(43)
output |>
dplyr::mutate(
x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z)),
# additional step to convert length-0 to length-1 NA
x = lapply(x, function(z) if (is.null(z)) NA else z)
) |>
tidyr::unnest(cols = x)
# # A tibble: 5 × 2
# x y
# <int> <chr>
# 1 1 a
# 2 NA b
# 3 3 c
# 4 NA d
# 5 5 e
本文标签: dataframeSerializedeserialize a column with R and DuckDBStack Overflow
版权声明:本文标题:dataframe - Serializedeserialize a column with R and DuckDB - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744953956a2634246.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
output$x
is now a list-column, because you explicitly told it to be a list by usinglist
. If you are certain that each ofoutput$x
will unserialize to an atomic object, either usesapply(x, unserialize)
or usedo.call(c, lapply(..))
. (The latter seems obfuscating but it is actually necessary sincesapply
can drop some classes, e.g.,POSIXt
.) – r2evans Commented Mar 6 at 19:42duckdb
, since if you checkidentical(input$x[1], output$x[1])
(the raw versions) you'll find it is identical. – r2evans Commented Mar 6 at 19:42df <- tibble::tibble(x = lapply(letters[1:5], htmltools::HTML), y = letters[1:5])
and repeating all of the steps before presents no difference in the end:identical(input$x, output$x)
is true, and it deserializes just fine. I think you need to find what is different about your list-column of"html"
objects. (My only thought is that the serialization within R changed some time ago, where pre-3 I think produced slightly different results. If you control both sides of the process with the same R version, then it's not this.) – r2evans Commented Mar 7 at 2:02