admin管理员组文章数量:1323744
In three node Cassandra cluster I am consistently facing the same kind of fatal situation on tables that are solely written using Cassandra's lightweight transactions (CAS).
Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to high load, any following attempt to write data within a transactions fails, i.e. does not return "[applied]"=true
.
Using select * from system.paxos where cf_id=<id of table>
, I see that there are entries, which I assume to be pending transactions.
Further, in /var/log/Cassandra/system.log
I see logs like:
INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 UncommittedTableData.java:567 - \
Scheduling uncommitted paxos data merge task for `<any other table>
INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 PaxosCleanupLocalCoordinator.java:89 - \
Completing uncommitted paxos instances for <table in stalled state> on ranges
However, I can't figure how to resolve the state nodetool repair -full <keyspace>
(and variations), as well as restarting all nodes did not resolve the issue.
Further information:
- Cassandra version: 4.1.5
- replication strategy: SimpleStrategy
- replication factor: 3
In three node Cassandra cluster I am consistently facing the same kind of fatal situation on tables that are solely written using Cassandra's lightweight transactions (CAS).
Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to high load, any following attempt to write data within a transactions fails, i.e. does not return "[applied]"=true
.
Using select * from system.paxos where cf_id=<id of table>
, I see that there are entries, which I assume to be pending transactions.
Further, in /var/log/Cassandra/system.log
I see logs like:
INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 UncommittedTableData.java:567 - \
Scheduling uncommitted paxos data merge task for `<any other table>
INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 PaxosCleanupLocalCoordinator.java:89 - \
Completing uncommitted paxos instances for <table in stalled state> on ranges
However, I can't figure how to resolve the state nodetool repair -full <keyspace>
(and variations), as well as restarting all nodes did not resolve the issue.
Further information:
- Cassandra version: 4.1.5
- replication strategy: SimpleStrategy
- replication factor: 3
1 Answer
Reset to default 0Lightweight transactions (LWTs) are expensive operations since they require a read-before-write, meaning the data must be read to verify the conditional IF
in the statement before the write is executed.
Prior to Paxos v2 added in Cassandra 4.1 (CASSANDRA-17164), LWTs required four round-trips for the [extended] Paxos phases: prepare/promise, serial read, propose/accept, commit. As a result, LWTs add significantly more load than regular writes. As such, if nodes are overloaded then it is expected for LWTs to perform even worse and not reach a quorum of replicas.
Running a repair does not solve the underlying issue with the nodes being overloaded. In fact, repairs add even more load like adding more fuel to a cluster that's on fire.
You should address the root cause of the problem. I recommend that you review the capacity of your cluster and analyse the utilisation of resources like disk, CPU and memory. It may be necessary for you to consider adding more nodes. Cheers!
本文标签: cassandraWrites fail when lightweight transactions cannot reach quorumStack Overflow
版权声明:本文标题:cassandra - Writes fail when lightweight transactions cannot reach quorum - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742118311a2421574.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论