admin管理员组文章数量:1398206
Today I was attempting to do a small experiment to feel how much performance CRTP is able to improve, as compile-time polymorphism sounds less expensive than a runtime one in theory.
#include <iostream>
#include <chrono>
namespace NonCRTP {
class Base {
public:
virtual void name() = 0;
};
class D1 : public Base {
public:
void name() override {
volatile int _ = 1;
}
};
} // namespace NonCRTP
namespace CRTP {
template <class Derived>
class Base {
public:
void name() {
static_cast<Derived*>(this)->name_impl();
}
protected:
Base() = default;
};
class D1 : public Base<D1> {
public:
void name_impl() {
volatile int _ = 1;
}
};
} // namespace CRTP
void perf_test(int invocation_counts) {
std::chrono::nanoseconds ns_duration_noncrtp{};
std::chrono::nanoseconds ns_duration_crtp{};
{
NonCRTP::D1 obj_noncrtp;
auto time_s{ std::chrono::high_resolution_clock::now() };
for (int i = 0; i < invocation_counts; ++i) {
obj_noncrtp.name();
}
auto time_e{ std::chrono::high_resolution_clock::now() };
ns_duration_noncrtp = time_e - time_s;
}
{
CRTP::D1 obj_crtp;
auto time_s{ std::chrono::high_resolution_clock::now() };
for (int i = 0; i < invocation_counts; ++i) {
obj_crtp.name();
}
auto time_e{ std::chrono::high_resolution_clock::now() };
ns_duration_crtp = time_e - time_s;
}
std::printf("Perf test: %d times of invocation:\n", invocation_counts);
std::printf(" | -- Non-CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_noncrtp.count()));
std::printf(" | ------ CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_crtp.count()));
std::printf("\n");
}
int main() {
perf_test(1);
perf_test(10);
perf_test(100);
perf_test(1000);
perf_test(10000);
perf_test(100000);
perf_test(1000000);
perf_test(10000000);
perf_test(100000000);
return 0;
}
However, it turned out that the one with CRTP is not necessarily faster. I am curious what might cause this.
This is the result without -O2
:
(It can be seen that the CRTP ones are not always better than the non-CRTP ones, even slower)
-> % clang++ crtp -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
| -- Non-CRTP: 100 ns
| ------ CRTP: 31 ns
Perf test: 10 times of invocation:
| -- Non-CRTP: 49 ns
| ------ CRTP: 59 ns
Perf test: 100 times of invocation:
| -- Non-CRTP: 171 ns
| ------ CRTP: 239 ns
Perf test: 1000 times of invocation:
| -- Non-CRTP: 1360 ns
| ------ CRTP: 2142 ns
Perf test: 10000 times of invocation:
| -- Non-CRTP: 16609 ns
| ------ CRTP: 60303 ns
Perf test: 100000 times of invocation:
| -- Non-CRTP: 194464 ns
| ------ CRTP: 234327 ns
Perf test: 1000000 times of invocation:
| -- Non-CRTP: 1489131 ns
| ------ CRTP: 2251113 ns
Perf test: 10000000 times of invocation:
| -- Non-CRTP: 14182943 ns
| ------ CRTP: 23018457 ns
Perf test: 100000000 times of invocation:
| -- Non-CRTP: 139714335 ns
| ------ CRTP: 228018231 ns
And here is the result with -O2
:
(Even looks better, under most circumstances CRTP is better than non-CRTP, but there are still some slower cases)
-> % clang++ crtp -O2 -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
| -- Non-CRTP: 100 ns
| ------ CRTP: 27 ns
Perf test: 10 times of invocation:
| -- Non-CRTP: 30 ns
| ------ CRTP: 34 ns
Perf test: 100 times of invocation:
| -- Non-CRTP: 67 ns
| ------ CRTP: 78 ns
Perf test: 1000 times of invocation:
| -- Non-CRTP: 288 ns
| ------ CRTP: 289 ns
Perf test: 10000 times of invocation:
| -- Non-CRTP: 2856 ns
| ------ CRTP: 2822 ns
Perf test: 100000 times of invocation:
| -- Non-CRTP: 27937 ns
| ------ CRTP: 27965 ns
Perf test: 1000000 times of invocation:
| -- Non-CRTP: 315920 ns
| ------ CRTP: 270106 ns
Perf test: 10000000 times of invocation:
| -- Non-CRTP: 2632259 ns
| ------ CRTP: 2705106 ns
Perf test: 100000000 times of invocation:
| -- Non-CRTP: 24681903 ns
| ------ CRTP: 22951820 ns
(clang version 18.1.3)
Just a guess - I suppose the result might be more impacted by cache than CRTP. (Not really sure) Am I right?
If so, the impact on performance by CRTP seems able to be ignored, what's the best practice to use CRTP?
Would really appreciate if you could provide some ideas! Thank you in advance.
Today I was attempting to do a small experiment to feel how much performance CRTP is able to improve, as compile-time polymorphism sounds less expensive than a runtime one in theory.
#include <iostream>
#include <chrono>
namespace NonCRTP {
class Base {
public:
virtual void name() = 0;
};
class D1 : public Base {
public:
void name() override {
volatile int _ = 1;
}
};
} // namespace NonCRTP
namespace CRTP {
template <class Derived>
class Base {
public:
void name() {
static_cast<Derived*>(this)->name_impl();
}
protected:
Base() = default;
};
class D1 : public Base<D1> {
public:
void name_impl() {
volatile int _ = 1;
}
};
} // namespace CRTP
void perf_test(int invocation_counts) {
std::chrono::nanoseconds ns_duration_noncrtp{};
std::chrono::nanoseconds ns_duration_crtp{};
{
NonCRTP::D1 obj_noncrtp;
auto time_s{ std::chrono::high_resolution_clock::now() };
for (int i = 0; i < invocation_counts; ++i) {
obj_noncrtp.name();
}
auto time_e{ std::chrono::high_resolution_clock::now() };
ns_duration_noncrtp = time_e - time_s;
}
{
CRTP::D1 obj_crtp;
auto time_s{ std::chrono::high_resolution_clock::now() };
for (int i = 0; i < invocation_counts; ++i) {
obj_crtp.name();
}
auto time_e{ std::chrono::high_resolution_clock::now() };
ns_duration_crtp = time_e - time_s;
}
std::printf("Perf test: %d times of invocation:\n", invocation_counts);
std::printf(" | -- Non-CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_noncrtp.count()));
std::printf(" | ------ CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_crtp.count()));
std::printf("\n");
}
int main() {
perf_test(1);
perf_test(10);
perf_test(100);
perf_test(1000);
perf_test(10000);
perf_test(100000);
perf_test(1000000);
perf_test(10000000);
perf_test(100000000);
return 0;
}
However, it turned out that the one with CRTP is not necessarily faster. I am curious what might cause this.
This is the result without -O2
:
(It can be seen that the CRTP ones are not always better than the non-CRTP ones, even slower)
-> % clang++ crtp -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
| -- Non-CRTP: 100 ns
| ------ CRTP: 31 ns
Perf test: 10 times of invocation:
| -- Non-CRTP: 49 ns
| ------ CRTP: 59 ns
Perf test: 100 times of invocation:
| -- Non-CRTP: 171 ns
| ------ CRTP: 239 ns
Perf test: 1000 times of invocation:
| -- Non-CRTP: 1360 ns
| ------ CRTP: 2142 ns
Perf test: 10000 times of invocation:
| -- Non-CRTP: 16609 ns
| ------ CRTP: 60303 ns
Perf test: 100000 times of invocation:
| -- Non-CRTP: 194464 ns
| ------ CRTP: 234327 ns
Perf test: 1000000 times of invocation:
| -- Non-CRTP: 1489131 ns
| ------ CRTP: 2251113 ns
Perf test: 10000000 times of invocation:
| -- Non-CRTP: 14182943 ns
| ------ CRTP: 23018457 ns
Perf test: 100000000 times of invocation:
| -- Non-CRTP: 139714335 ns
| ------ CRTP: 228018231 ns
And here is the result with -O2
:
(Even looks better, under most circumstances CRTP is better than non-CRTP, but there are still some slower cases)
-> % clang++ crtp -O2 -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
| -- Non-CRTP: 100 ns
| ------ CRTP: 27 ns
Perf test: 10 times of invocation:
| -- Non-CRTP: 30 ns
| ------ CRTP: 34 ns
Perf test: 100 times of invocation:
| -- Non-CRTP: 67 ns
| ------ CRTP: 78 ns
Perf test: 1000 times of invocation:
| -- Non-CRTP: 288 ns
| ------ CRTP: 289 ns
Perf test: 10000 times of invocation:
| -- Non-CRTP: 2856 ns
| ------ CRTP: 2822 ns
Perf test: 100000 times of invocation:
| -- Non-CRTP: 27937 ns
| ------ CRTP: 27965 ns
Perf test: 1000000 times of invocation:
| -- Non-CRTP: 315920 ns
| ------ CRTP: 270106 ns
Perf test: 10000000 times of invocation:
| -- Non-CRTP: 2632259 ns
| ------ CRTP: 2705106 ns
Perf test: 100000000 times of invocation:
| -- Non-CRTP: 24681903 ns
| ------ CRTP: 22951820 ns
(clang version 18.1.3)
Just a guess - I suppose the result might be more impacted by cache than CRTP. (Not really sure) Am I right?
If so, the impact on performance by CRTP seems able to be ignored, what's the best practice to use CRTP?
Would really appreciate if you could provide some ideas! Thank you in advance.
Share Improve this question edited Mar 26 at 13:12 Yuwei Zhao asked Mar 26 at 13:08 Yuwei ZhaoYuwei Zhao 12 bronze badges 4 |1 Answer
Reset to default 4The point of CRTP is, simply put, to get rid of calls to pure virtual functions of the base class. In your case, you are calling the derived classes directly (both of them). You should compare the performance between the calls to the base classes, not the derived ones.
Also, volatile seems to be getting in the way of optimizations by the compiler. I modified your code a little to help the effects of CRTP surface: I removed volatile and added a member int
to represent a counter. (timings at the end)
Notice that the huge gap in performance is due to the fact that the compiler can optimize a lot more the call to CRTP::base::count
than the call to NonCRTP::base::count
since the implementation of the former is not hidden behind a pure virtual specifier = 0
.
#include <chrono>
#include <cstdio>
namespace NonCRTP {
class base {
public:
base() = default;
virtual ~base() = default;
virtual void do_count() = 0;
int count = 0;
};
class derived : public base {
public:
~derived() { }
void do_count() override
{
++count;
}
};
} // namespace NonCRTP
namespace CRTP {
template <class Derived>
class base {
public:
base() = default;
~base() = default;
void do_count()
{
static_cast<Derived *>(this)->do_count();
}
int count = 0;
};
class derived : public base<derived> {
public:
derived() = default;
~derived() = default;
void do_count()
{
++count;
}
};
} // namespace CRTP
void perf_test(int invocation_counts)
{
std::chrono::nanoseconds ns_duration_noncrtp{};
std::chrono::nanoseconds ns_duration_crtp{};
{
NonCRTP::base *b1 = new NonCRTP::derived();
auto time_s{std::chrono::high_resolution_clock::now()};
for (int i = 0; i < invocation_counts; ++i) {
b1->do_count();
}
auto time_e{std::chrono::high_resolution_clock::now()};
delete b1;
ns_duration_noncrtp = time_e - time_s;
}
{
CRTP::base<CRTP::derived> b2;
auto time_s{std::chrono::high_resolution_clock::now()};
for (int i = 0; i < invocation_counts; ++i) {
b2.do_count();
}
auto time_e{std::chrono::high_resolution_clock::now()};
ns_duration_crtp = time_e - time_s;
}
std::printf("Perf test: %d times of invocation:\n", invocation_counts);
std::printf(
" | -- Non-CRTP: %lu ns\n",
static_cast<uint64_t>(ns_duration_noncrtp.count())
);
std::printf(
" | ------ CRTP: %lu ns\n",
static_cast<uint64_t>(ns_duration_crtp.count())
);
std::printf("\n");
}
int main()
{
perf_test(1);
perf_test(10);
perf_test(100);
perf_test(1000);
perf_test(10000);
perf_test(100000);
perf_test(1000000);
perf_test(10000000);
perf_test(100000000);
return 0;
}
Execution times (compiled with -O2
like you did):
Perf test: 1 times of invocation:
| -- Non-CRTP: 537 ns
| ------ CRTP: 122 ns
Perf test: 10 times of invocation:
| -- Non-CRTP: 261 ns
| ------ CRTP: 112 ns
Perf test: 100 times of invocation:
| -- Non-CRTP: 644 ns
| ------ CRTP: 69 ns
Perf test: 1000 times of invocation:
| -- Non-CRTP: 5934 ns
| ------ CRTP: 68 ns
Perf test: 10000 times of invocation:
| -- Non-CRTP: 58247 ns
| ------ CRTP: 74 ns
Perf test: 100000 times of invocation:
| -- Non-CRTP: 599081 ns
| ------ CRTP: 74 ns
Perf test: 1000000 times of invocation:
| -- Non-CRTP: 5882245 ns
| ------ CRTP: 69 ns
Perf test: 10000000 times of invocation:
| -- Non-CRTP: 38568550 ns
| ------ CRTP: 26 ns
Perf test: 100000000 times of invocation:
| -- Non-CRTP: 160194476 ns
| ------ CRTP: 29 ns
本文标签: cWhy CRTP does not bring obvious performance improvementStack Overflow
版权声明:本文标题:c++ - Why CRTP does not bring obvious performance improvement - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744142482a2592681.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
high_resolution_clock
- usesteady_clock
– Ted Lyngmo Commented Mar 26 at 18:06