Edit:
- Please disregard MSVC Results. I missed that CMake generates an app in
$builddir/Release
, and not in$builddir
. MSVC's results were so close to clang's because they were, in fact, clang's. But don't worry it doesn't matter much, because - This benchmark was designed on assuption that since we access shapes in predicatable way, cache won't give us much trouble. Well, cmuratori and AcheretoTM pointed that it's probably wrong. I checked, and I was wrong. Now virtual calls are even faster, so please disregard everything else as well. Now, the code don't generate all 20 MiB objects for each case, but iterate over 8192 (about 128 KB, half of my L2 cache for a single core) 2560 times. That's the same 20 MiB iterations. Also, vectors are made before every loop and destroyed after every measurement. So, it's always cold start for every loop and low memory consumption for the bench.
Recently, Case Muratori made an article and a video named "Clean" Code, Horrible Performance
I agree that sometimes, a particular software architecture that is regarded as better (that is, Clean Code) in general can reduce performance quite a bit. I also agree that the Clean Code example can be misleading and make people think it's fine even in cases like shape area size. There is a reason why shaders are not written in object-oriented style after all.
Sometimes after the video was published, Casey and Robert (author of Clean Code) engaged in discussion on the merits of different architecture and on this day (16 Mar 2023) Bob said about "a hypothetical compiler that produces identical binary code irrespective of whether the input is operand or operation primal." Casey responded with essentially "it's hypothetical, modern compilers can't do that" and it is as far from reality as proposing "A hypothetical language where the dependency cost is always the same". I think this is a bit unfair since the previous points were more about architecture in general, and Bob agreed that there are cases where it's harmful for performance to use OO style with virtual functions.
About four days ago (13 Mar 2023), I made my own benchmark to be absolutely sure things are as they are, as well as to test different approaches to the problem.
Along with Casey's code, I tested:
- std::variant containing four different classes with no base class:
class square_no_vt
{
public:
square_no_vt(f32 SideInit) : Side(SideInit) {}
f32 Area() {return Side*Side;}
private:
f32 Side;
};
class rectangle_no_vt
{
public:
rectangle_no_vt(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
f32 Area() {return Width*Height;}
private:
f32 Width, Height;
};
class triangle_no_vt
{
public:
triangle_no_vt(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
f32 Area() {return 0.5f*Base*Height;}
private:
f32 Base, Height;
};
class circle_no_vt
{
public:
circle_no_vt(f32 RadiusInit) : Radius(RadiusInit) {}
f32 Area() {return Pi32*Radius*Radius;}
private:
f32 Radius;
};
using shape_variant = std::variant<square_no_vt, rectangle_no_vt, triangle_no_vt, circle_no_vt>;
template<typename ShapeVariant>
f32 TotalAreaVariant(u32 ShapeCount, ShapeVariant *Shapes) noexcept
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += std::visit([](auto shape){return shape.Area();}, Shapes[ShapeIndex]);
}
return Accum;
}
- std::variant with one base class with optimized
Area()
function
class shape_base_no_vt_opt{
public:
shape_base_no_vt_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
shape_index(shape_index_), side1(side1_), side2(side2_)
{}
f32 Area() { return c_table[shape_index] * side1 * side2; }
private:
static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
shape_type shape_index;
f32 side1;
f32 side2;
};
class square_no_vt_opt : public shape_base_no_vt_opt
{
public:
square_no_vt_opt(f32 SideInit) : shape_base_no_vt_opt(Shape_Square, SideInit, SideInit) {}
};
class rectangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
rectangle_no_vt_opt(f32 WidthInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};
class triangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
triangle_no_vt_opt(f32 BaseInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Triangle, BaseInit, HeightInit) {}
};
class circle_no_vt_opt: public shape_base_no_vt_opt
{
public:
circle_no_vt_opt(f32 RadiusInit) : shape_base_no_vt_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};
using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, triangle_no_vt_opt, circle_no_vt_opt>;
// same TotalAreaVariant as above
- Finally, another
shape_base
overloaded classes, with an intermediate base class with optimized area function
class shape_base_opt: public shape_base{
public:
shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
shape_index(shape_index_), side1(side1_), side2(side2_)
{}
virtual f32 Area() { return c_table[shape_index] * side1 * side2; }
private:
static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
shape_type shape_index;
f32 side1;
f32 side2;
};
class square_opt : public shape_base_opt
{
public:
square_opt(f32 SideInit) : shape_base_opt(Shape_Square, SideInit, SideInit) {}
};
class rectangle_opt: public shape_base_opt
{
public:
rectangle_opt(f32 WidthInit, f32 HeightInit) : shape_base_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};
class triangle_opt: public shape_base_opt
{
public:
triangle_opt(f32 BaseInit, f32 HeightInit) : shape_base_opt(Shape_Triangle, BaseInit, HeightInit) {}
};
class circle_opt: public shape_base_opt
{
public:
circle_opt(f32 RadiusInit) : shape_base_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};
// used for original shapes as well
f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += Shapes[ShapeIndex]->Area();
}
return Accum;
}
Edit: there's new measurements from Windows machine:
Windows 10 (different machine, so I can't compare it with the above benchmarks):
- clang 15.0.6 (MSVC runtime)
Release:
Bench: virtual functions
Regular loop, time: 154 ms area: 7.50355e+14
Unrolled loop, time: 148 ms area: 7.39093e+14
Bench: switch case
Regular loop, time: 131 ms area: 7.50355e+14
Unrolled loop, time: 119 ms area: 7.39093e+14
Bench: lookup table
Regular loop, time: 18 ms area: 7.45486e+14
Unrolled loop, time: 12 ms area: 7.60836e+14
Bench: std::variant
Regular loop, time: 133 ms area: 7.50355e+14
Unrolled loop, time: 121 ms area: 7.39093e+14
Bench: std::variant optimized
Regular loop, time: 19 ms area: 7.50355e+14
Unrolled loop, time: 18 ms area: 7.39093e+14
Bench: virtual functions optimized
Regular loop, time: 41 ms area: 7.50355e+14
Unrolled loop, time: 39 ms area: 7.39093e+14
Debug:
Bench: virtual functions
Regular loop, time: 251 ms area: 7.50355e+14
Unrolled loop, time: 209 ms area: 7.39093e+14
Bench: switch case
Regular loop, time: 278 ms area: 7.50355e+14
Unrolled loop, time: 237 ms area: 7.39093e+14
Bench: lookup table
Regular loop, time: 120 ms area: 7.45486e+14
Unrolled loop, time: 114 ms area: 7.60836e+14
Bench: std::variant
Regular loop, time: 684 ms area: 7.50355e+14
Unrolled loop, time: 687 ms area: 7.39093e+14
Bench: std::variant optimized
Regular loop, time: 687 ms area: 7.50355e+14
Unrolled loop, time: 701 ms area: 7.39093e+14
Bench: virtual functions optimized
Regular loop, time: 99 ms area: 7.50355e+14
Unrolled loop, time: 74 ms area: 7.39093e+14
- MSVC 2022 17.4.3
Release:
Bench: virtual functions
Regular loop, time: 153 ms area: 7.50355e+14
Unrolled loop, time: 142 ms area: 7.39093e+14
Bench: switch case
Regular loop, time: 125 ms area: 7.50355e+14
Unrolled loop, time: 91 ms area: 7.39093e+14
Bench: lookup table
Regular loop, time: 21 ms area: 7.45486e+14
Unrolled loop, time: 18 ms area: 7.60836e+14
Bench: std::variant
Regular loop, time: 124 ms area: 7.50355e+14
Unrolled loop, time: 79 ms area: 7.39093e+14
Bench: std::variant optimized
Regular loop, time: 116 ms area: 7.50355e+14
Unrolled loop, time: 79 ms area: 7.39093e+14
Bench: virtual functions optimized
Regular loop, time: 41 ms area: 7.50355e+14
Unrolled loop, time: 37 ms area: 7.39093e+14
Debug:
Bench: virtual functions
Regular loop, time: 265 ms area: 7.50355e+14
Unrolled loop, time: 233 ms area: 7.39093e+14
Bench: switch case
Regular loop, time: 398 ms area: 7.50355e+14
Unrolled loop, time: 384 ms area: 7.39093e+14
Bench: lookup table
Regular loop, time: 243 ms area: 7.45486e+14
Unrolled loop, time: 232 ms area: 7.60836e+14
Bench: std::variant
Regular loop, time: 894 ms area: 7.50355e+14
Unrolled loop, time: 887 ms area: 7.39093e+14
Bench: std::variant optimized
Regular loop, time: 993 ms area: 7.50355e+14
Unrolled loop, time: 1027 ms area: 7.39093e+14
Bench: virtual functions optimized
Regular loop, time: 115 ms area: 7.50355e+14
Unrolled loop, time: 91 ms area: 7.39093e+14
Ubuntu 20.04, chrooted in Ubuntu 22.04
- gcc 12.1.0
Release:
Bench: virtual functions
Regular loop, time: 141 ms area: 7.45026e+14
Unrolled loop, time: 128 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 83 ms area: 7.45026e+14
Unrolled loop, time: 56 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 21 ms area: 7.5042e+14
Unrolled loop, time: 12 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 78 ms area: 7.45026e+14
Unrolled loop, time: 54 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 68 ms area: 7.45026e+14
Unrolled loop, time: 29 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 45 ms area: 7.45026e+14
Unrolled loop, time: 31 ms area: 7.36963e+14
Debug:
Bench: virtual functions
Regular loop, time: 224 ms area: 7.45026e+14
Unrolled loop, time: 186 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 236 ms area: 7.45026e+14
Unrolled loop, time: 197 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 66 ms area: 7.5042e+14
Unrolled loop, time: 47 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 1092 ms area: 7.45026e+14
Unrolled loop, time: 1076 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 1098 ms area: 7.45026e+14
Unrolled loop, time: 1087 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 83 ms area: 7.45026e+14
Unrolled loop, time: 57 ms area: 7.36963e+14
Ubuntu 20.04
- gcc 11.1.0
Release:
Bench: virtual functions
Regular loop, time: 128 ms area: 7.45026e+14
Unrolled loop, time: 128 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 88 ms area: 7.45026e+14
Unrolled loop, time: 55 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 19 ms area: 7.5042e+14
Unrolled loop, time: 10 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 114 ms area: 7.45026e+14
Unrolled loop, time: 108 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 115 ms area: 7.45026e+14
Unrolled loop, time: 111 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 43 ms area: 7.45026e+14
Unrolled loop, time: 28 ms area: 7.36963e+14
Debug:
Bench: virtual functions
Regular loop, time: 227 ms area: 7.45026e+14
Unrolled loop, time: 188 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 231 ms area: 7.45026e+14
Unrolled loop, time: 198 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 65 ms area: 7.5042e+14
Unrolled loop, time: 48 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 1140 ms area: 7.45026e+14
Unrolled loop, time: 1165 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 1150 ms area: 7.45026e+14
Unrolled loop, time: 1151 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 79 ms area: 7.45026e+14
Unrolled loop, time: 59 ms area: 7.36963e+14
- clang 15.0.7
Release:
Bench: virtual functions
Regular loop, time: 136 ms area: 7.45026e+14
Unrolled loop, time: 127 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 119 ms area: 7.45026e+14
Unrolled loop, time: 112 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 19 ms area: 7.5042e+14
Unrolled loop, time: 15 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 116 ms area: 7.45026e+14
Unrolled loop, time: 110 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 122 ms area: 7.45026e+14
Unrolled loop, time: 117 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 67 ms area: 7.45026e+14
Unrolled loop, time: 31 ms area: 7.36963e+14
Debug:
Bench: virtual functions
Regular loop, time: 208 ms area: 7.45026e+14
Unrolled loop, time: 178 ms area: 7.36963e+14
Bench: switch case
Regular loop, time: 290 ms area: 7.45026e+14
Unrolled loop, time: 264 ms area: 7.36963e+14
Bench: lookup table
Regular loop, time: 105 ms area: 7.5042e+14
Unrolled loop, time: 91 ms area: 7.65487e+14
Bench: std::variant
Regular loop, time: 586 ms area: 7.45026e+14
Unrolled loop, time: 603 ms area: 7.36963e+14
Bench: std::variant optimized
Regular loop, time: 619 ms area: 7.45026e+14
Unrolled loop, time: 631 ms area: 7.36963e+14
Bench: virtual functions optimized
Regular loop, time: 60 ms area: 7.45026e+14
Unrolled loop, time: 48 ms area: 7.36963e+14
So, lets compare it by categories, for regular loops:
gcc 12 | gcc 11 | clang 15 Lin | clang 15 Win | MSVC | |
---|---|---|---|---|---|
virtual functions | 141 | 128 | 136 | 154 | 152 |
switch case | 83 | 88 | 119 | 131 | 130 |
lookup table | 21 | 19 | 19 | 18 | 21 |
std::variant | 78 | 114 | 116 | 133 | 124 |
std::variant optimized | 68 | 115 | 122 | 19 | 118 |
virtual functions optimized | 45 | 43 | 67 | 41 | 45 |
and for unrolled loops:
gcc 12 | gcc 11 | clang 15 Lin | clang 15 Win | MSVC | |
---|---|---|---|---|---|
virtual functions | 128 | 128 | 127 | 148 | 139 |
switch case | 56 | 55 | 112 | 119 | 90 |
lookup table | 12 | 10 | 15 | 12 | 18 |
std::variant | 54 | 108 | 110 | 121 | 79 |
std::variant optimized | 29 | 111 | 117 | 18 | 78 |
virtual functions optimized | 31 | 28 | 31 | 39 | 40 |
Note, Win machine is not the same as Lin, so there will be no direct comparison. I don't think it's worth comparing Debug versions with each other, so I'll leave it as an exercise for a reader.
So, conclusions:
- Yes, counting area of 20 megashapes all placed in memory is not the same as counting area of 8 kiloshapes 2.5k times. All data fitting in at least L3 cache can significantly change results of a benchmark
- Nevertheless, virtual shapes with that have one single base class counts faster than naive shapes with 4 different area functions
- Lookup table version gained from about 2x to just a bit speed, depending on a compiler. 2x is for gcc at least on Linux, MSVC gives no significant speedup
- Effect of unrolling switches and variants is compiler and platform dependent, but never to a degree the LuT code has.
- On most platforms and compilers in the table,
variant
andvisit
is about the same speed as switch, except for GCC 11. - On most platform, the "optimized variant" code has no speedup compared to naive implementation. GCC 12 somehow improved on unrolled loops, and only Clang on windows showed exactly the same performance as the LuT code. I'm not sure why. My guess is, Windows linker can eliminate same code (Partial-redundancy elimination) and clang can take advantage of it and throw away all branches, leaving single function aside. But, again, I may be completely wrong and clang 15.0.6 can do it everywhere, but clang 15.0.7 has some regression and so it can't optimize it like that. Or some clang defaults on Linux and Windows are different enough so it works on Win10 but not on Ubuntu 20.04.
- Finally, optimized version of shapes with single middle-class (let's call it that) is still faster than four naive classes each with a different area counting function. About 3 to 4 times faster now. Using the smae function for both cases. My guess is, it's branch prediction at work, but I can't neither proof nor disproof it. All I want to say, it's not really fair to compare optimized procedural code with unoptimized OO, especially since it's still 2 to 4 times faster.
I think I should clarify lots of stuff:
- I totally agree OO is way less optimizable then Operation First, procedural style. At least, that's what my personal experience is
- I agree that to OO is less understandable than procedural. I mean, just try to understand how FizzBuzzEnterpriseEdition even works. I tried, and I failed to find where
main
is. Maybe one or two level of indirection is fine, but five or more are pain - Therefore, I agree that it's almost impossible to find optimization opportunities in OO code, with IDE always pointing to base class and not the one class I need, and code being dispersed across way too many files
- I agree there is no place for OO in tight loops that should work fast with lots of data
- I do not agree you can't have same performance with Operand First approach.
variant
,visit
and clang on visit shows that it's at least possible with modern C++ compilers. Yes, no base class, no vtables, but the structure of a code is about the same. Should we use it or not is beside the point (but probably we shouldn't except for some strange cases) - I do not agree we should not use clean code approach. Yet, do think that we should limit levels of indirection only for user API, and using rarely and carefully in a code that we fully control.
Well, that's all for now.