admin管理员组

文章数量:1398206

Today I was attempting to do a small experiment to feel how much performance CRTP is able to improve, as compile-time polymorphism sounds less expensive than a runtime one in theory.

#include <iostream>
#include <chrono>

namespace NonCRTP {

class Base {
public:
  virtual void name() = 0;
};

class D1 : public Base {
public:
  void name() override {
    volatile int _ = 1;
  }
};

} // namespace NonCRTP

namespace CRTP {

template <class Derived>
class Base {
public:
  void name() {
    static_cast<Derived*>(this)->name_impl();
  }

protected:
  Base() = default;
};

class D1 : public Base<D1> {
public:
  void name_impl() {
    volatile int _ = 1;
  }
};

} // namespace CRTP

void perf_test(int invocation_counts) {
  std::chrono::nanoseconds ns_duration_noncrtp{};
  std::chrono::nanoseconds ns_duration_crtp{};

  {
    NonCRTP::D1 obj_noncrtp;
    auto time_s{ std::chrono::high_resolution_clock::now() };
    for (int i = 0; i < invocation_counts; ++i) {
      obj_noncrtp.name();
    }

    auto time_e{ std::chrono::high_resolution_clock::now() };
    ns_duration_noncrtp = time_e - time_s;
  }

  {
    CRTP::D1 obj_crtp;
    auto time_s{ std::chrono::high_resolution_clock::now() };
    for (int i = 0; i < invocation_counts; ++i) {
      obj_crtp.name();
    }

    auto time_e{ std::chrono::high_resolution_clock::now() };
    ns_duration_crtp = time_e - time_s;
  }

  std::printf("Perf test: %d times of invocation:\n", invocation_counts);
  std::printf("  | -- Non-CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_noncrtp.count()));
  std::printf("  | ------ CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_crtp.count()));
  std::printf("\n");
}

int main() {
  perf_test(1);
  perf_test(10);
  perf_test(100);
  perf_test(1000);
  perf_test(10000);
  perf_test(100000);
  perf_test(1000000);
  perf_test(10000000);
  perf_test(100000000);
  return 0;
}

However, it turned out that the one with CRTP is not necessarily faster. I am curious what might cause this.

This is the result without -O2: (It can be seen that the CRTP ones are not always better than the non-CRTP ones, even slower)

-> % clang++ crtp -std=c++20 -o crtp; ./crtp 
Perf test: 1 times of invocation:
  | -- Non-CRTP: 100 ns
  | ------ CRTP: 31 ns

Perf test: 10 times of invocation:
  | -- Non-CRTP: 49 ns
  | ------ CRTP: 59 ns

Perf test: 100 times of invocation:
  | -- Non-CRTP: 171 ns
  | ------ CRTP: 239 ns

Perf test: 1000 times of invocation:
  | -- Non-CRTP: 1360 ns
  | ------ CRTP: 2142 ns

Perf test: 10000 times of invocation:
  | -- Non-CRTP: 16609 ns
  | ------ CRTP: 60303 ns

Perf test: 100000 times of invocation:
  | -- Non-CRTP: 194464 ns
  | ------ CRTP: 234327 ns

Perf test: 1000000 times of invocation:
  | -- Non-CRTP: 1489131 ns
  | ------ CRTP: 2251113 ns

Perf test: 10000000 times of invocation:
  | -- Non-CRTP: 14182943 ns
  | ------ CRTP: 23018457 ns

Perf test: 100000000 times of invocation:
  | -- Non-CRTP: 139714335 ns
  | ------ CRTP: 228018231 ns

And here is the result with -O2: (Even looks better, under most circumstances CRTP is better than non-CRTP, but there are still some slower cases)

-> % clang++ crtp -O2 -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
  | -- Non-CRTP: 100 ns
  | ------ CRTP: 27 ns

Perf test: 10 times of invocation:
  | -- Non-CRTP: 30 ns
  | ------ CRTP: 34 ns

Perf test: 100 times of invocation:
  | -- Non-CRTP: 67 ns
  | ------ CRTP: 78 ns

Perf test: 1000 times of invocation:
  | -- Non-CRTP: 288 ns
  | ------ CRTP: 289 ns

Perf test: 10000 times of invocation:
  | -- Non-CRTP: 2856 ns
  | ------ CRTP: 2822 ns

Perf test: 100000 times of invocation:
  | -- Non-CRTP: 27937 ns
  | ------ CRTP: 27965 ns

Perf test: 1000000 times of invocation:
  | -- Non-CRTP: 315920 ns
  | ------ CRTP: 270106 ns

Perf test: 10000000 times of invocation:
  | -- Non-CRTP: 2632259 ns
  | ------ CRTP: 2705106 ns

Perf test: 100000000 times of invocation:
  | -- Non-CRTP: 24681903 ns
  | ------ CRTP: 22951820 ns

(clang version 18.1.3)

Just a guess - I suppose the result might be more impacted by cache than CRTP. (Not really sure) Am I right?

If so, the impact on performance by CRTP seems able to be ignored, what's the best practice to use CRTP?

Would really appreciate if you could provide some ideas! Thank you in advance.

Today I was attempting to do a small experiment to feel how much performance CRTP is able to improve, as compile-time polymorphism sounds less expensive than a runtime one in theory.

#include <iostream>
#include <chrono>

namespace NonCRTP {

class Base {
public:
  virtual void name() = 0;
};

class D1 : public Base {
public:
  void name() override {
    volatile int _ = 1;
  }
};

} // namespace NonCRTP

namespace CRTP {

template <class Derived>
class Base {
public:
  void name() {
    static_cast<Derived*>(this)->name_impl();
  }

protected:
  Base() = default;
};

class D1 : public Base<D1> {
public:
  void name_impl() {
    volatile int _ = 1;
  }
};

} // namespace CRTP

void perf_test(int invocation_counts) {
  std::chrono::nanoseconds ns_duration_noncrtp{};
  std::chrono::nanoseconds ns_duration_crtp{};

  {
    NonCRTP::D1 obj_noncrtp;
    auto time_s{ std::chrono::high_resolution_clock::now() };
    for (int i = 0; i < invocation_counts; ++i) {
      obj_noncrtp.name();
    }

    auto time_e{ std::chrono::high_resolution_clock::now() };
    ns_duration_noncrtp = time_e - time_s;
  }

  {
    CRTP::D1 obj_crtp;
    auto time_s{ std::chrono::high_resolution_clock::now() };
    for (int i = 0; i < invocation_counts; ++i) {
      obj_crtp.name();
    }

    auto time_e{ std::chrono::high_resolution_clock::now() };
    ns_duration_crtp = time_e - time_s;
  }

  std::printf("Perf test: %d times of invocation:\n", invocation_counts);
  std::printf("  | -- Non-CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_noncrtp.count()));
  std::printf("  | ------ CRTP: %lu ns\n", static_cast<uint64_t>(ns_duration_crtp.count()));
  std::printf("\n");
}

int main() {
  perf_test(1);
  perf_test(10);
  perf_test(100);
  perf_test(1000);
  perf_test(10000);
  perf_test(100000);
  perf_test(1000000);
  perf_test(10000000);
  perf_test(100000000);
  return 0;
}

However, it turned out that the one with CRTP is not necessarily faster. I am curious what might cause this.

This is the result without -O2: (It can be seen that the CRTP ones are not always better than the non-CRTP ones, even slower)

-> % clang++ crtp -std=c++20 -o crtp; ./crtp 
Perf test: 1 times of invocation:
  | -- Non-CRTP: 100 ns
  | ------ CRTP: 31 ns

Perf test: 10 times of invocation:
  | -- Non-CRTP: 49 ns
  | ------ CRTP: 59 ns

Perf test: 100 times of invocation:
  | -- Non-CRTP: 171 ns
  | ------ CRTP: 239 ns

Perf test: 1000 times of invocation:
  | -- Non-CRTP: 1360 ns
  | ------ CRTP: 2142 ns

Perf test: 10000 times of invocation:
  | -- Non-CRTP: 16609 ns
  | ------ CRTP: 60303 ns

Perf test: 100000 times of invocation:
  | -- Non-CRTP: 194464 ns
  | ------ CRTP: 234327 ns

Perf test: 1000000 times of invocation:
  | -- Non-CRTP: 1489131 ns
  | ------ CRTP: 2251113 ns

Perf test: 10000000 times of invocation:
  | -- Non-CRTP: 14182943 ns
  | ------ CRTP: 23018457 ns

Perf test: 100000000 times of invocation:
  | -- Non-CRTP: 139714335 ns
  | ------ CRTP: 228018231 ns

And here is the result with -O2: (Even looks better, under most circumstances CRTP is better than non-CRTP, but there are still some slower cases)

-> % clang++ crtp -O2 -std=c++20 -o crtp; ./crtp
Perf test: 1 times of invocation:
  | -- Non-CRTP: 100 ns
  | ------ CRTP: 27 ns

Perf test: 10 times of invocation:
  | -- Non-CRTP: 30 ns
  | ------ CRTP: 34 ns

Perf test: 100 times of invocation:
  | -- Non-CRTP: 67 ns
  | ------ CRTP: 78 ns

Perf test: 1000 times of invocation:
  | -- Non-CRTP: 288 ns
  | ------ CRTP: 289 ns

Perf test: 10000 times of invocation:
  | -- Non-CRTP: 2856 ns
  | ------ CRTP: 2822 ns

Perf test: 100000 times of invocation:
  | -- Non-CRTP: 27937 ns
  | ------ CRTP: 27965 ns

Perf test: 1000000 times of invocation:
  | -- Non-CRTP: 315920 ns
  | ------ CRTP: 270106 ns

Perf test: 10000000 times of invocation:
  | -- Non-CRTP: 2632259 ns
  | ------ CRTP: 2705106 ns

Perf test: 100000000 times of invocation:
  | -- Non-CRTP: 24681903 ns
  | ------ CRTP: 22951820 ns

(clang version 18.1.3)

Just a guess - I suppose the result might be more impacted by cache than CRTP. (Not really sure) Am I right?

If so, the impact on performance by CRTP seems able to be ignored, what's the best practice to use CRTP?

Would really appreciate if you could provide some ideas! Thank you in advance.

Share Improve this question edited Mar 26 at 13:12 Yuwei Zhao asked Mar 26 at 13:08 Yuwei ZhaoYuwei Zhao 12 bronze badges 4
  • 6 non-CRTP case can be de-virtualized by compiler. – Jarod42 Commented Mar 26 at 13:10
  • 2 Did you run this through a profiler and see where runtime is actually being spent? Did you perform any cache analysis? Have you compared the assembly between the two implementations? – Stephen Newell Commented Mar 26 at 13:15
  • 1 The first thing you need to do is look at the disassembly in a really simple case. That can help you work out if de-virtualization is occurring. – Yakk - Adam Nevraumont Commented Mar 26 at 13:41
  • Sidenote: Don't use high_resolution_clock - use steady_clock – Ted Lyngmo Commented Mar 26 at 18:06
Add a comment  | 

1 Answer 1

Reset to default 4

The point of CRTP is, simply put, to get rid of calls to pure virtual functions of the base class. In your case, you are calling the derived classes directly (both of them). You should compare the performance between the calls to the base classes, not the derived ones.

Also, volatile seems to be getting in the way of optimizations by the compiler. I modified your code a little to help the effects of CRTP surface: I removed volatile and added a member int to represent a counter. (timings at the end)

Notice that the huge gap in performance is due to the fact that the compiler can optimize a lot more the call to CRTP::base::count than the call to NonCRTP::base::count since the implementation of the former is not hidden behind a pure virtual specifier = 0.

#include <chrono>
#include <cstdio>

namespace NonCRTP {

class base {
public:

    base() = default;
    virtual ~base() = default;

    virtual void do_count() = 0;

    int count = 0;
};

class derived : public base {
public:

    ~derived() { }

    void do_count() override
    {
        ++count;
    }
};

} // namespace NonCRTP

namespace CRTP {

template <class Derived>
class base {
public:

    base() = default;
    ~base() = default;

    void do_count()
    {
        static_cast<Derived *>(this)->do_count();
    }

    int count = 0;
};

class derived : public base<derived> {
public:

    derived() = default;
    ~derived() = default;

    void do_count()
    {
        ++count;
    }
};

} // namespace CRTP

void perf_test(int invocation_counts)
{
    std::chrono::nanoseconds ns_duration_noncrtp{};
    std::chrono::nanoseconds ns_duration_crtp{};

    {
        NonCRTP::base *b1 = new NonCRTP::derived();

        auto time_s{std::chrono::high_resolution_clock::now()};
        for (int i = 0; i < invocation_counts; ++i) {
            b1->do_count();
        }
        auto time_e{std::chrono::high_resolution_clock::now()};

        delete b1;
        ns_duration_noncrtp = time_e - time_s;
    }

    {
        CRTP::base<CRTP::derived> b2;

        auto time_s{std::chrono::high_resolution_clock::now()};
        for (int i = 0; i < invocation_counts; ++i) {
            b2.do_count();
        }
        auto time_e{std::chrono::high_resolution_clock::now()};

        ns_duration_crtp = time_e - time_s;
    }

    std::printf("Perf test: %d times of invocation:\n", invocation_counts);
    std::printf(
        "  | -- Non-CRTP: %lu ns\n",
        static_cast<uint64_t>(ns_duration_noncrtp.count())
    );
    std::printf(
        "  | ------ CRTP: %lu ns\n",
        static_cast<uint64_t>(ns_duration_crtp.count())
    );
    std::printf("\n");
}

int main()
{
    perf_test(1);
    perf_test(10);
    perf_test(100);
    perf_test(1000);
    perf_test(10000);
    perf_test(100000);
    perf_test(1000000);
    perf_test(10000000);
    perf_test(100000000);
    return 0;
}

Execution times (compiled with -O2 like you did):

Perf test: 1 times of invocation:
  | -- Non-CRTP: 537 ns
  | ------ CRTP: 122 ns

Perf test: 10 times of invocation:
  | -- Non-CRTP: 261 ns
  | ------ CRTP: 112 ns

Perf test: 100 times of invocation:
  | -- Non-CRTP: 644 ns
  | ------ CRTP: 69 ns

Perf test: 1000 times of invocation:
  | -- Non-CRTP: 5934 ns
  | ------ CRTP: 68 ns

Perf test: 10000 times of invocation:
  | -- Non-CRTP: 58247 ns
  | ------ CRTP: 74 ns

Perf test: 100000 times of invocation:
  | -- Non-CRTP: 599081 ns
  | ------ CRTP: 74 ns

Perf test: 1000000 times of invocation:
  | -- Non-CRTP: 5882245 ns
  | ------ CRTP: 69 ns

Perf test: 10000000 times of invocation:
  | -- Non-CRTP: 38568550 ns
  | ------ CRTP: 26 ns

Perf test: 100000000 times of invocation:
  | -- Non-CRTP: 160194476 ns
  | ------ CRTP: 29 ns

本文标签: cWhy CRTP does not bring obvious performance improvementStack Overflow