admin管理员组

文章数量:1134247

I'm trying to get all the file_id items for the most recent version for the specified folder_id for the example_table defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?

example_table

folder_id (Partition Key) file_id (Clustering Key) version (timestamp Clustering Key) ORDER DESC
1 1 2
1 1 1
1 2 2
1 2 1
1 3 2
1 3 1

I'm trying to get all the file_id items for the most recent version for the specified folder_id for the example_table defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?

example_table

folder_id (Partition Key) file_id (Clustering Key) version (timestamp Clustering Key) ORDER DESC
1 1 2
1 1 1
1 2 2
1 2 1
1 3 2
1 3 1

I'm thinking the query would be something like this, but I'm not able to get it working.

SELECT folder_id,file_id,MAX(version) FROM example_table WHERE folder_id=1;

desired result

folder_id (Partition Key) file_id (Clustering Key) version (timestampClustering Key) ORDER DESC
1 1 2
1 2 2
1 3 2
Share Improve this question edited Jan 9 at 1:23 Erick Ramirez 16.2k2 gold badges21 silver badges30 bronze badges asked Dec 30, 2024 at 10:06 foshoshinfoshoshin 1683 silver badges12 bronze badges 2
  • What does the schema look like? If version is part of the primary key, you may be able get the last item depending on the clustering columns order. – stevenlacerda Commented Dec 30, 2024 at 14:55
  • @stevenlacerda The schema is described in the table headers and what you're suggesting is exactly what I was trying to do. – foshoshin Commented Dec 30, 2024 at 23:38
Add a comment  | 

2 Answers 2

Reset to default 2

This may not be the best performing data model as your chosen partition key of folder_id if/can grow infitely.

Instead, if you alterred the table schema like as below, it will only use the partial partition key to perform the scan within that portion of the partition key (as opposed to performing a full table-scan) when allow filtering is used,

token@cqlsh:chetctor> CREATE TABLE IF NOT EXISTS example_table (
  folder_id int,
  file_id int,
  version timestamp,
  PRIMARY KEY((folder_id, file_id), version)
) WITH CLUSTERING ORDER BY (version DESC)
<... other table properties ...>
;

and insert a couple records as:

token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> select * from example_table ;

 folder_id | file_id | version
-----------+---------+---------------------------------
         1 |       3 | 2024-12-30 15:55:05.565000+0000
         1 |       3 | 2024-12-30 15:54:59.680000+0000
         1 |       2 | 2024-12-30 15:54:53.628000+0000
         1 |       2 | 2024-12-30 15:54:51.540000+0000
         1 |       1 | 2024-12-30 15:54:45.544000+0000
         1 |       1 | 2024-12-30 15:54:34.793000+0000

(6 rows)

While using ALLOW FILTERING is always the bad choice, it may be okay here if we're querying by the partial partition key of folder_id.

token@cqlsh:chetctor> select * from example_table where folder_id=1 per PARTITION LIMIT 1 allow FILTERING ;

 folder_id | file_id | version
-----------+---------+---------------------------------
         1 |       3 | 2024-12-30 15:55:05.565000+0000
         1 |       2 | 2024-12-30 15:54:53.628000+0000
         1 |       1 | 2024-12-30 15:54:45.544000+0000

(3 rows)

This will get us all the latest file_id's (based on the version timestamp value) for the given folder_id of 1. I hope this helps you.

References to read:

  • PER PARTITON LIMIT
    • another better reference here

p/s: I've used DataStax Astra DB's CQL Console as a reference here to demonstrate the above and for full disclosure I am an employee at DataStax (as of this writing).

If you only ever need to retrieve the latest version of a file, Cassandra can handle this natively for you since all writes (INSERT, UPDATE, DELETE) have an associated write-time.

Let me illustrate with an example. Consider this table that stores files partitioned by folder IDs, each file with an associated tag:

CREATE TABLE files_by_folderid (
    folderid int,
    fileid int,
    tag text,
    ...
    PRIMARY KEY (folderid, fileid)
)

When I create a new entry with tag = 'first', the entry is inserted together with the time it was written (shown below using the built-in WRITETIME() CQL function):

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'first');
SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag   | writetime(tag)
----------+--------+-------+------------------
      100 |   1000 | first | 1736386878932871

If I insert a new entry for the same file but with a different tag:

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'second');

Retrieving the contents of the folder will return only the latest version (tag = 'second') since Cassandra filters all versions of a row/cell by their write timestamp:

SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag    | writetime(tag)
----------+--------+--------+------------------
      100 |   1000 | second | 1736386932829366

If I overwrite the tag for the third time:

INSERT INTO files_by_folderid (folderid, fileid, tag)
    VALUES (100, 1000, 'third');

Cassandra will again only return the newest version of the file (row) based on the latest write timestamp:

SELECT folderid, fileid, tag, WRITETIME(tag)
    FROM files_by_folderid
    WHERE folderid = 100;

 folderid | fileid | tag   | writetime(tag)
----------+--------+-------+------------------
      100 |   1000 | third | 1736386978499439

Unless you have a requirement to explicitly store multiple versions of the files, you don't need to have a complicated data model or unnecessarily store extra data. Cheers!

本文标签: cassandraQuery for the most recent file version in a folderStack Overflow