Cache xsort and qsort sort.cc3/13/2023 ![]() ![]() I re-wrote this paragaph several times, since sometimes after a reboot stable_sort was slower and sometimes it was faster (as shown above). It is interesting that std::stable_sort which has stricly more requirements on its implementation than std::sort (i.e., any stable sort is also suitable for std::sort) ends up faster. The C++ sort functions, other than perhaps std::partial_sort 9, put in a good showing. Without further ado, let’s just throw std::sort, std::stable_sort and std::partial_sort into the mix: Since the sort code and the comparator are being compiler together, we expect the comparator to be easily inlined, and perhaps other optimizations may occur. Finally, the comparator does a three-way compare (returning distrinct results for and =), but the merge code only needs a two-way compare ( ) - inlining the comparator manages to remove extra code associated with distinguishing the header? Unlike C’s qsort which is generic by virtue of taking a function pointer and information about the object size, the C++ sort functions use templates to achieve genericity and so are implemented directly in header files. The loop has only two loads and one store, so the memory access redundancy between the merge code and the comparator has been eliminated 8. Indeed, we measure with perf stat that the misprediction rate has dropped from to close to 12 mispredicts per element to around 0.75 per element. Yes, there are still two conditional jumps, but they are both just checking for the termination condition (that one of the lists to merge is exhausted), so we expect this loop to be free of branch mispredictions other than the final iteration. How much better could things get if we inline the comparator into the merge loop? That’s what we do in qsort-inlined 6, and here’s the main loop which now includes the comparator function 7 : 0.07 : 401dc8: test rbp,rbpĠ.66 : 401dcb: je 401e0c (msort_param const*, void*, unsigned long, CompareU64)+0xbc>ħ.48 : 401e03: mov QWORD PTR ,raxĠ.71 : 401e0a: jne 401dc8 (msort_param const*, void*, unsigned long, CompareU64)+0x78>Ī key difference is that the core of the loop is now branch free. Note that the comparator has to redundantly load from memory the two locations to compare, something the merge loop already did (the merge loop reads them because it is responsible for moving the elements). This hot loop corresponds directly to this code from glibc (from the file msort.c 5) : ![]() There are also two checks for termination ( test r12,r12 and test rbp,rbp). bench qsort to capture profiling data, and perf report -stdio to print a summary 3:ĥ.24 : 39224: mov rdx,QWORD PTR ģ2.69 : 39236: mov rax,QWORD PTR ĭepending on your level of assembly reading skill, it may not be obvious, but this is basically a classic merge routine: it is merging two lists by comparing the top elements of each list (pointed to by r13 and r15), and then storing the smaller element (the line QWORD PTR ,rax) and loading the next element from that list. Benchmarking Qsortįirst, let’s take a look at what qsort is doing, to see if there is any delicous low-hanging performance fruit. This naturally, this raises the question: how fast is qsort when it comes to sorting integers and can we do better?Īll of the code for this post is available on GitHub, so if you’d like to follow along with the code open in an editor, go right ahead (warning: there are obviously some spoilers if you dig through the code first). bench & perf report) shows that more than 90% of the time spent in this approach is in the sorting routine, qsort - so we are right to focus on this step, rather than say the de-duplication step or the initial random number generation. While Daniel suggests a clever method of avoiding a sort entirely 2, I’m also interested in they why for the underlying performace of the sort method: it takes more than 100 ns per element, which means 100s of CPU clock cycles and usually even more instructions than that (on a superscalar processor)! As a sanity check, a quick benchmark ( perf record. In the case we want sorted output, an obvious solution presents itself: sorting randomly chosen values and de-duplicating the list, which is easy since identical values are now adjacent. Recently, Daniel Lemire tackled the topic of selecting N distinct numbers at random.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |