admin管理员组文章数量:1134247
I'm trying to get all the file_id
items for the most recent version
for the specified folder_id
for the example_table
defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?
example_table
folder_id (Partition Key) |
file_id (Clustering Key) |
version (timestamp Clustering Key) ORDER DESC |
---|---|---|
1 | 1 | 2 |
1 | 1 | 1 |
1 | 2 | 2 |
1 | 2 | 1 |
1 | 3 | 2 |
1 | 3 | 1 |
I'm trying to get all the file_id
items for the most recent version
for the specified folder_id
for the example_table
defined below in Cassandra. Is this possible, or is my data designed incorrectly for my use case?
example_table
folder_id (Partition Key) |
file_id (Clustering Key) |
version (timestamp Clustering Key) ORDER DESC |
---|---|---|
1 | 1 | 2 |
1 | 1 | 1 |
1 | 2 | 2 |
1 | 2 | 1 |
1 | 3 | 2 |
1 | 3 | 1 |
I'm thinking the query would be something like this, but I'm not able to get it working.
SELECT folder_id,file_id,MAX(version) FROM example_table WHERE folder_id=1;
desired result
folder_id (Partition Key) |
file_id (Clustering Key) |
version (timestamp Clustering Key) ORDER DESC |
---|---|---|
1 | 1 | 2 |
1 | 2 | 2 |
1 | 3 | 2 |
- What does the schema look like? If version is part of the primary key, you may be able get the last item depending on the clustering columns order. – stevenlacerda Commented Dec 30, 2024 at 14:55
- @stevenlacerda The schema is described in the table headers and what you're suggesting is exactly what I was trying to do. – foshoshin Commented Dec 30, 2024 at 23:38
2 Answers
Reset to default 2This may not be the best performing data model as your chosen partition key of folder_id
if/can grow infitely.
Instead, if you alterred the table schema like as below, it will only use the partial partition key to perform the scan within that portion of the partition key (as opposed to performing a full table-scan) when allow filtering is used,
token@cqlsh:chetctor> CREATE TABLE IF NOT EXISTS example_table (
folder_id int,
file_id int,
version timestamp,
PRIMARY KEY((folder_id, file_id), version)
) WITH CLUSTERING ORDER BY (version DESC)
<... other table properties ...>
;
and insert a couple records as:
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,1,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,2,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> insert into example_table (folder_id , file_id , version ) VALUES ( 1,3,totimestamp(now()));
token@cqlsh:chetctor> select * from example_table ;
folder_id | file_id | version
-----------+---------+---------------------------------
1 | 3 | 2024-12-30 15:55:05.565000+0000
1 | 3 | 2024-12-30 15:54:59.680000+0000
1 | 2 | 2024-12-30 15:54:53.628000+0000
1 | 2 | 2024-12-30 15:54:51.540000+0000
1 | 1 | 2024-12-30 15:54:45.544000+0000
1 | 1 | 2024-12-30 15:54:34.793000+0000
(6 rows)
While using ALLOW FILTERING
is always the bad choice, it may be okay here if we're querying by the partial partition key of folder_id
.
token@cqlsh:chetctor> select * from example_table where folder_id=1 per PARTITION LIMIT 1 allow FILTERING ;
folder_id | file_id | version
-----------+---------+---------------------------------
1 | 3 | 2024-12-30 15:55:05.565000+0000
1 | 2 | 2024-12-30 15:54:53.628000+0000
1 | 1 | 2024-12-30 15:54:45.544000+0000
(3 rows)
This will get us all the latest file_id
's (based on the version
timestamp value) for the given folder_id
of 1
. I hope this helps you.
References to read:
PER PARTITON LIMIT
- another better reference here
p/s: I've used DataStax Astra DB's CQL Console as a reference here to demonstrate the above and for full disclosure I am an employee at DataStax (as of this writing).
If you only ever need to retrieve the latest version of a file, Cassandra can handle this natively for you since all writes (INSERT
, UPDATE
, DELETE
) have an associated write-time.
Let me illustrate with an example. Consider this table that stores files partitioned by folder IDs, each file with an associated tag:
CREATE TABLE files_by_folderid (
folderid int,
fileid int,
tag text,
...
PRIMARY KEY (folderid, fileid)
)
When I create a new entry with tag = 'first'
, the entry is inserted together with the time it was written (shown below using the built-in WRITETIME()
CQL function):
INSERT INTO files_by_folderid (folderid, fileid, tag)
VALUES (100, 1000, 'first');
SELECT folderid, fileid, tag, WRITETIME(tag)
FROM files_by_folderid
WHERE folderid = 100;
folderid | fileid | tag | writetime(tag)
----------+--------+-------+------------------
100 | 1000 | first | 1736386878932871
If I insert a new entry for the same file but with a different tag:
INSERT INTO files_by_folderid (folderid, fileid, tag)
VALUES (100, 1000, 'second');
Retrieving the contents of the folder will return only the latest version (tag = 'second'
) since Cassandra filters all versions of a row/cell by their write timestamp:
SELECT folderid, fileid, tag, WRITETIME(tag)
FROM files_by_folderid
WHERE folderid = 100;
folderid | fileid | tag | writetime(tag)
----------+--------+--------+------------------
100 | 1000 | second | 1736386932829366
If I overwrite the tag for the third time:
INSERT INTO files_by_folderid (folderid, fileid, tag)
VALUES (100, 1000, 'third');
Cassandra will again only return the newest version of the file (row) based on the latest write timestamp:
SELECT folderid, fileid, tag, WRITETIME(tag)
FROM files_by_folderid
WHERE folderid = 100;
folderid | fileid | tag | writetime(tag)
----------+--------+-------+------------------
100 | 1000 | third | 1736386978499439
Unless you have a requirement to explicitly store multiple versions of the files, you don't need to have a complicated data model or unnecessarily store extra data. Cheers!
本文标签: cassandraQuery for the most recent file version in a folderStack Overflow
版权声明:本文标题:cassandra - Query for the most recent file version in a folder - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736816894a1954125.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论