cassandra - Query for the most recent file version in a folder - Stack Overflow

IT技术

更新时间：2025-01-146

admin管理员组
文章数量:1134247

I'm trying to get all the file_id items for the most recent version for the specified folder_id for the example_table defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?

example_table

`folder_id` (Partition Key)	`file_id` (Clustering Key)	`version` (`timestamp` Clustering Key) ORDER DESC
1	1	2
1	1	1
1	2	2
1	2	1
1	3	2
1	3	1

I'm trying to get all the file_id items for the most recent version for the specified folder_id for the example_table defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?

example_table

`folder_id` (Partition Key)	`file_id` (Clustering Key)	`version` (`timestamp` Clustering Key) ORDER DESC
1	1	2
1	1	1
1	2	2
1	2	1
1	3	2
1	3	1

I'm thinking the query would be something like this, but I'm not able to get it working.

SELECT folder_id,file_id,MAX(version) FROM example_table WHERE folder_id=1;

desired result

`folder_id` (Partition Key)	`file_id` (Clustering Key)	`version` (`timestamp`Clustering Key) ORDER DESC
1	1	2
1	2	2
1	3	2

Share Improve this question edited Jan 9 at 1:23 Erick Ramirez 16.2k2 gold badges21 silver badges30 bronze badges asked Dec 30, 2024 at 10:06 foshoshin 1683 silver badges12 bronze badges

What does the schema look like? If version is part of the primary key, you may be able get the last item depending on the clustering columns order. – stevenlacerda Commented Dec 30, 2024 at 14:55
@stevenlacerda The schema is described in the table headers and what you're suggesting is exactly what I was trying to do. – foshoshin Commented Dec 30, 2024 at 23:38

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

This may not be the best performing data model as your chosen partition key of folder_id if/can grow infitely.

Instead, if you alterred the table schema like as below, it will only use the partial partition key to perform the scan within that portion of the partition key (as opposed to performing a full table-scan) when allow filtering is used,

token@cqlsh:chetctor> CREATE TABLE IF NOT EXISTS example_table (
  folder_id int,
  file_id int,
  version timestamp,
  PRIMARY KEY((folder_id, file_id), version)
) WITH CLUSTERING ORDER BY (version DESC)
<... other table properties ...>
;

and insert a couple records as:

token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> select * from example_table ;

 folder_id | file_id | version
-----------+---------+---------------------------------
         1 |       3 | 2024-12-30 15:55:05.565000+0000
         1 |       3 | 2024-12-30 15:54:59.680000+0000
         1 |       2 | 2024-12-30 15:54:53.628000+0000
         1 |       2 | 2024-12-30 15:54:51.540000+0000
         1 |       1 | 2024-12-30 15:54:45.544000+0000
         1 |       1 | 2024-12-30 15:54:34.793000+0000

(6 rows)

While using ALLOW FILTERING is always the bad choice, it may be okay here if we're querying by the partial partition key of folder_id.

token@cqlsh:chetctor> select * from example_table where folder_id=1 per PARTITION LIMIT 1 allow FILTERING ;

 folder_id | file_id | version
-----------+---------+---------------------------------
         1 |       3 | 2024-12-30 15:55:05.565000+0000
         1 |       2 | 2024-12-30 15:54:53.628000+0000
         1 |       1 | 2024-12-30 15:54:45.544000+0000

(3 rows)

This will get us all the latest file_id's (based on the version timestamp value) for the given folder_id of 1. I hope this helps you.

References to read:

PER PARTITON LIMIT
- another better reference here

p/s: I've used DataStax Astra DB's CQL Console as a reference here to demonstrate the above and for full disclosure I am an employee at DataStax (as of this writing).

If you only ever need to retrieve the latest version of a file, Cassandra can handle this natively for you since all writes (INSERT, UPDATE, DELETE) have an associated write-time.

Let me illustrate with an example. Consider this table that stores files partitioned by folder IDs, each file with an associated tag:

CREATE TABLE files_by_folderid (
    folderid int,
    fileid int,
    tag text,
    ...
    PRIMARY KEY (folderid, fileid)
)

When I create a new entry with tag = 'first', the entry is inserted together with the time it was written (shown below using the built-in WRITETIME() CQL function):

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'first');

SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag   | writetime(tag)
----------+--------+-------+------------------
      100 |   1000 | first | 1736386878932871

If I insert a new entry for the same file but with a different tag:

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'second');

Retrieving the contents of the folder will return only the latest version (tag = 'second') since Cassandra filters all versions of a row/cell by their write timestamp:

SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag    | writetime(tag)
----------+--------+--------+------------------
      100 |   1000 | second | 1736386932829366

If I overwrite the tag for the third time:

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'third');

Cassandra will again only return the newest version of the file (row) based on the latest write timestamp:

SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag   | writetime(tag)
----------+--------+-------+------------------
      100 |   1000 | third | 1736386978499439

Unless you have a requirement to explicitly store multiple versions of the files, you don't need to have a complicated data model or unnecessarily store extra data. Cheers!

本文标签： cassandraQuery for the most recent file version in a folderStack Overflow

版权声明：本文标题：cassandra - Query for the most recent file version in a folder - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736816894a1954125.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

cassandra - Query for the most recent file version in a folder - Stack Overflow

2 Answers 2

更多相关文章