admin管理员组

文章数量:1410674

tldr:

I finally found the solution for my specific case thanks to @r2evans answer below:

When writing the table:

dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))

When reading the table:

dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))


I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).

I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.

Reproducible example:

df <- tibble::tibble(x = 1:5, y = letters[1:5])

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))

# A tibble: 5 × 2
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e  
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

duckdb::dbWriteTable(conn, "df", input)

output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()

# A tibble: 5 × 2    
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e   

The same (serialized) data is returned from DuckDB, however I cannot deserialize it:

output |> dplyr::mutate(x = lapply(x, unserialize))

# A tibble: 5 × 2
  x         y    
  <list>    <chr>
1 <int [1]> a    
2 <int [1]> b    
3 <int [1]> c    
4 <int [1]> d    
5 <int [1]> e  

Any help would be greatly appreciated!

Edit:

The actual data contained in the column that I serialize is essentially a html string, in fact it shows:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <html> a    
2 <html> b    
3 <html> c    
4 <html> d    
5 <html> e  

After running dplyr::mutate(x = lapply(x, unserialize)), I get back:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <raw>  a    
2 <raw>  b    
3 <raw>  c    
4 <raw>  d     
5 <raw>  e  

Further info:

This is the data before and after:

> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>

> after[1]
[[1]]
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
 [ reached getOption("max.print") -- omitted 166065 entries ]

> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html"      "character"

> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"

I have tried all of the below suggestions with no luck. I consistently end up with <raw> after deserialising.

Additionally:

> unserialize(after[1][[1]])
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
  ...

tldr:

I finally found the solution for my specific case thanks to @r2evans answer below:

When writing the table:

dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))

When reading the table:

dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))


I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).

I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.

Reproducible example:

df <- tibble::tibble(x = 1:5, y = letters[1:5])

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))

# A tibble: 5 × 2
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e  
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

duckdb::dbWriteTable(conn, "df", input)

output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()

# A tibble: 5 × 2    
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e   

The same (serialized) data is returned from DuckDB, however I cannot deserialize it:

output |> dplyr::mutate(x = lapply(x, unserialize))

# A tibble: 5 × 2
  x         y    
  <list>    <chr>
1 <int [1]> a    
2 <int [1]> b    
3 <int [1]> c    
4 <int [1]> d    
5 <int [1]> e  

Any help would be greatly appreciated!

Edit:

The actual data contained in the column that I serialize is essentially a html string, in fact it shows:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <html> a    
2 <html> b    
3 <html> c    
4 <html> d    
5 <html> e  

After running dplyr::mutate(x = lapply(x, unserialize)), I get back:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <raw>  a    
2 <raw>  b    
3 <raw>  c    
4 <raw>  d     
5 <raw>  e  

Further info:

This is the data before and after:

> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>

> after[1]
[[1]]
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
 [ reached getOption("max.print") -- omitted 166065 entries ]

> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html"      "character"

> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"

I have tried all of the below suggestions with no luck. I consistently end up with <raw> after deserialising.

Additionally:

> unserialize(after[1][[1]])
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
  ...
Share edited Mar 7 at 9:42 FISR asked Mar 6 at 19:29 FISRFISR 1651 silver badge9 bronze badges 5
  • Your data is being unserialized just fine. output$x is now a list-column, because you explicitly told it to be a list by using list. If you are certain that each of output$x will unserialize to an atomic object, either use sapply(x, unserialize) or use do.call(c, lapply(..)). (The latter seems obfuscating but it is actually necessary since sapply can drop some classes, e.g., POSIXt.) – r2evans Commented Mar 6 at 19:42
  • Bottom line, this has nothing to do with duckdb, since if you check identical(input$x[1], output$x[1]) (the raw versions) you'll find it is identical. – r2evans Commented Mar 6 at 19:42
  • I just included some additional details about the actual data type I'm working with – FISR Commented Mar 6 at 20:30
  • This is not reproducible. – r2evans Commented Mar 7 at 0:08
  • To be clear, df <- tibble::tibble(x = lapply(letters[1:5], htmltools::HTML), y = letters[1:5]) and repeating all of the steps before presents no difference in the end: identical(input$x, output$x) is true, and it deserializes just fine. I think you need to find what is different about your list-column of "html" objects. (My only thought is that the serialization within R changed some time ago, where pre-3 I think produced slightly different results. If you control both sides of the process with the same R version, then it's not this.) – r2evans Commented Mar 7 at 2:02
Add a comment  | 

1 Answer 1

Reset to default 3

This is just about "list-columns". Usually frames' columns are vectors of simple scalars (perhaps an over-reduction). However, it is certainly possible to include more complex objects in each "cell" of a column, this is done via "list columns", where instead of a vector (in which all objects must be the same class), we use a list (in which each cell may be length-0, length-1, more complex objects, just about anything).

If you are certain that all data in output$x reduces to a length-1 scalar, then you may be able to replace lapply (which always returns a list) with sapply:

output |> dplyr::mutate(x = sapply(x, unserialize))
# # A tibble: 5 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2     2 b    
# 3     3 c    
# 4     4 d    
# 5     5 e    

This is not always safe, however. For a demonstration of when sapply may not be desired,

now <- Sys.time()
now + 0
# [1] "2025-03-06 14:45:43 EST"
sapply(now, `+`, 0)
# [1] 1741290344
do.call(c, lapply(now, `+`, 0))
# [1] "2025-03-06 14:45:43 EST"

So if you are certain that all the data in output$x reduces to a length-1 scalar and you may have attributes that sapply can dump, do this:

output |> dplyr::mutate(x = do.call(c, lapply(x, unserialize)))

If any of output$x may be length greater than 1, then consider tidyr::unnest:

output |>
  dplyr::mutate(x = lapply(x, unserialize)) |>
  unnest(cols = x)

Note that if any of output$x may be length-0, NULL, or similar, then the rows that contain those empty x values will be removed in the results. For a demonstration, see

set.seed(43)
output |>
  dplyr::mutate(x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z))) |>
  tidyr::unnest(cols = x)
# # A tibble: 3 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2     3 c    
# 3     5 e    

To better deal with those, I suggest something like this:

set.seed(43)
output |>
  dplyr::mutate(
    x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z)),
    # additional step to convert length-0 to length-1 NA
    x = lapply(x, function(z) if (is.null(z)) NA else z)
  ) |>
  tidyr::unnest(cols = x)
# # A tibble: 5 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2    NA b    
# 3     3 c    
# 4    NA d    
# 5     5 e    

本文标签: dataframeSerializedeserialize a column with R and DuckDBStack Overflow