dataframe - Serializedeserialize a column with R and DuckDB - Stack Overflow-软件玩家

admin管理员组
文章数量:1410674

tldr:

I finally found the solution for my specific case thanks to @r2evans answer below:

When writing the table:

dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))

When reading the table:

dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))

I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).

I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.

Reproducible example:

df <- tibble::tibble(x = 1:5, y = letters[1:5])

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e

input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))

# A tibble: 5 × 2
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e

conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

duckdb::dbWriteTable(conn, "df", input)

output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()

# A tibble: 5 × 2    
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e

The same (serialized) data is returned from DuckDB, however I cannot deserialize it:

output |> dplyr::mutate(x = lapply(x, unserialize))

# A tibble: 5 × 2
  x         y    
  <list>    <chr>
1 <int [1]> a    
2 <int [1]> b    
3 <int [1]> c    
4 <int [1]> d    
5 <int [1]> e

Any help would be greatly appreciated!

Edit:

The actual data contained in the column that I serialize is essentially a html string, in fact it shows:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <html> a    
2 <html> b    
3 <html> c    
4 <html> d    
5 <html> e

After running dplyr::mutate(x = lapply(x, unserialize)), I get back:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <raw>  a    
2 <raw>  b    
3 <raw>  c    
4 <raw>  d     
5 <raw>  e

Further info:

This is the data before and after:

> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>

> after[1]
[[1]]
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
 [ reached getOption("max.print") -- omitted 166065 entries ]

> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html"      "character"

> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"

I have tried all of the below suggestions with no luck. I consistently end up with <raw> after deserialising.

Additionally:

> unserialize(after[1][[1]])
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
  ...

tldr:

I finally found the solution for my specific case thanks to @r2evans answer below:

When writing the table:

dplyr::mutate(x = sapply(.data$x, serialize, connection = NULL))

When reading the table:

dplyr::mutate(x = lapply(.data$x, \(x) htmltools::HTML(unserialize(x))))

I am trying to serialize a column in a dataframe prior to writing to a database (since the original data-type is not compatible with the database).

I then read the table back from the database and attempt to deserialize the column back to the original data, however I am unable to do this.

Reproducible example:

df <- tibble::tibble(x = 1:5, y = letters[1:5])

# A tibble: 5 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e

input <- df |> dplyr::mutate(x = lapply(x, serialize, NULL))

# A tibble: 5 × 2
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e

conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

duckdb::dbWriteTable(conn, "df", input)

output <- DBI::dbGetQuery(conn, "SELECT * FROM df") |> tibble::as_tibble()

# A tibble: 5 × 2    
  x          y    
  <list>     <chr>
1 <raw [35]> a    
2 <raw [35]> b    
3 <raw [35]> c    
4 <raw [35]> d    
5 <raw [35]> e

The same (serialized) data is returned from DuckDB, however I cannot deserialize it:

output |> dplyr::mutate(x = lapply(x, unserialize))

# A tibble: 5 × 2
  x         y    
  <list>    <chr>
1 <int [1]> a    
2 <int [1]> b    
3 <int [1]> c    
4 <int [1]> d    
5 <int [1]> e

Any help would be greatly appreciated!

Edit:

The actual data contained in the column that I serialize is essentially a html string, in fact it shows:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <html> a    
2 <html> b    
3 <html> c    
4 <html> d    
5 <html> e

After running dplyr::mutate(x = lapply(x, unserialize)), I get back:

# A tibble: 5 × 2    
  x      y    
  <list> <chr>
1 <raw>  a    
2 <raw>  b    
3 <raw>  c    
4 <raw>  d     
5 <raw>  e

Further info:

This is the data before and after:

> before[1]
[[1]]
<a href='data:text/csv;base64,Zm9v' download='foo.csv'><button style='background-color:#67C2DC; color:white; border:none; padding:5px;font-weight:bold; cursor:pointer; border-radius:4px; font-size:12px;'>CSV</button></a>

> after[1]
[[1]]
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 7a 58 0a 00 00 00 03
...
 [ reached getOption("max.print") -- omitted 166065 entries ]

> class(before)
[1] "list"
> class(before[1])
[1] "list"
> class(before[1][[1]])
[1] "html"      "character"

> class(after)
[1] "list"
> class(after[1])
[1] "list"
> class(after[1][[1]])
[1] "raw"

I have tried all of the below suggestions with no luck. I consistently end up with <raw> after deserialising.

Additionally:

> unserialize(after[1][[1]])
   [1] 58 0a 00 00 00 03 00 04 04 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 18 00 02 8c 5b 58 0a 00 00 00 03
  ...

Share edited Mar 7 at 9:42 asked Mar 6 at 19:29 FISR 1651 silver badge9 bronze badges

Your data is being unserialized just fine. output$x is now a list-column, because you explicitly told it to be a list by using list. If you are certain that each of output$x will unserialize to an atomic object, either use sapply(x, unserialize) or use do.call(c, lapply(..)). (The latter seems obfuscating but it is actually necessary since sapply can drop some classes, e.g., POSIXt.) – r2evans Commented Mar 6 at 19:42
Bottom line, this has nothing to do with duckdb, since if you check identical(input$x[1], output$x[1]) (the raw versions) you'll find it is identical. – r2evans Commented Mar 6 at 19:42
I just included some additional details about the actual data type I'm working with – FISR Commented Mar 6 at 20:30
This is not reproducible. – r2evans Commented Mar 7 at 0:08
To be clear, df <- tibble::tibble(x = lapply(letters[1:5], htmltools::HTML), y = letters[1:5]) and repeating all of the steps before presents no difference in the end: identical(input$x, output$x) is true, and it deserializes just fine. I think you need to find what is different about your list-column of "html" objects. (My only thought is that the serialization within R changed some time ago, where pre-3 I think produced slightly different results. If you control both sides of the process with the same R version, then it's not this.) – r2evans Commented Mar 7 at 2:02

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

This is just about "list-columns". Usually frames' columns are vectors of simple scalars (perhaps an over-reduction). However, it is certainly possible to include more complex objects in each "cell" of a column, this is done via "list columns", where instead of a vector (in which all objects must be the same class), we use a list (in which each cell may be length-0, length-1, more complex objects, just about anything).

If you are certain that all data in output$x reduces to a length-1 scalar, then you may be able to replace lapply (which always returns a list) with sapply:

output |> dplyr::mutate(x = sapply(x, unserialize))
# # A tibble: 5 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2     2 b    
# 3     3 c    
# 4     4 d    
# 5     5 e

This is not always safe, however. For a demonstration of when sapply may not be desired,

now <- Sys.time()
now + 0
# [1] "2025-03-06 14:45:43 EST"
sapply(now, `+`, 0)
# [1] 1741290344
do.call(c, lapply(now, `+`, 0))
# [1] "2025-03-06 14:45:43 EST"

So if you are certain that all the data in output$x reduces to a length-1 scalar and you may have attributes that sapply can dump, do this:

output |> dplyr::mutate(x = do.call(c, lapply(x, unserialize)))

If any of output$x may be length greater than 1, then consider tidyr::unnest:

output |>
  dplyr::mutate(x = lapply(x, unserialize)) |>
  unnest(cols = x)

Note that if any of output$x may be length-0, NULL, or similar, then the rows that contain those empty x values will be removed in the results. For a demonstration, see

set.seed(43)
output |>
  dplyr::mutate(x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z))) |>
  tidyr::unnest(cols = x)
# # A tibble: 3 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2     3 c    
# 3     5 e

To better deal with those, I suggest something like this:

set.seed(43)
output |>
  dplyr::mutate(
    x = lapply(x, function(z) if (runif(1) < 0.5) unserialize(z)),
    # additional step to convert length-0 to length-1 NA
    x = lapply(x, function(z) if (is.null(z)) NA else z)
  ) |>
  tidyr::unnest(cols = x)
# # A tibble: 5 × 2
#       x y    
#   <int> <chr>
# 1     1 a    
# 2    NA b    
# 3     3 c    
# 4    NA d    
# 5     5 e

本文标签： dataframeSerializedeserialize a column with R and DuckDBStack Overflow

版权声明：本文标题：dataframe - Serializedeserialize a column with R and DuckDB - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744953956a2634246.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

dataframe - Serializedeserialize a column with R and DuckDB - Stack Overflow

1 Answer 1

更多相关文章