Message ID | cover.1723567469.git.olivier@trillion01.com (mailing list archive) |
---|---|
Headers | show |
Series | abstract napi tracking strategy | expand |
On 8/13/24 10:44 AM, Olivier Langlois wrote: > the actual napi tracking strategy is inducing a non-negligeable overhead. > Everytime a multishot poll is triggered or any poll armed, if the napi is > enabled on the ring a lookup is performed to either add a new napi id into > the napi_list or its timeout value is updated. > > For many scenarios, this is overkill as the napi id list will be pretty > much static most of the time. To address this common scenario, a new > abstraction has been created following the common Linux kernel idiom of > creating an abstract interface with a struct filled with function pointers. > > Creating an alternate napi tracking strategy is therefore made in 2 phases. > > 1. Introduce the io_napi_tracking_ops interface > 2. Implement a static napi tracking by defining a new io_napi_tracking_ops I don't think we should create ops for this, unless there's a strict need to do so. Indirect function calls aren't cheap, and the CPU side mitigations for security issues made them worse. You're not wrong that ops is not an uncommon idiom in the kernel, but it's a lot less prevalent as a solution than it used to. Exactly because of the above reasons.
On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: > On 8/13/24 10:44 AM, Olivier Langlois wrote: > > the actual napi tracking strategy is inducing a non-negligeable > > overhead. > > Everytime a multishot poll is triggered or any poll armed, if the > > napi is > > enabled on the ring a lookup is performed to either add a new napi > > id into > > the napi_list or its timeout value is updated. > > > > For many scenarios, this is overkill as the napi id list will be > > pretty > > much static most of the time. To address this common scenario, a > > new > > abstraction has been created following the common Linux kernel > > idiom of > > creating an abstract interface with a struct filled with function > > pointers. > > > > Creating an alternate napi tracking strategy is therefore made in 2 > > phases. > > > > 1. Introduce the io_napi_tracking_ops interface > > 2. Implement a static napi tracking by defining a new > > io_napi_tracking_ops > > I don't think we should create ops for this, unless there's a strict > need to do so. Indirect function calls aren't cheap, and the CPU side > mitigations for security issues made them worse. > > You're not wrong that ops is not an uncommon idiom in the kernel, but > it's a lot less prevalent as a solution than it used to. Exactly > because > of the above reasons. > ok. Do you have a reference explaining this? and what type of construct would you use instead? AFAIK, a big performance killer is the branch mispredictions coming from big switch/case or if/else if/else blocks and it was precisely the reason why you removed the big switch/case io_uring was having with function pointers in io_issue_def... I consumme an enormous amount of programming learning material daily and this is the first time that I am hearing this. If there was a performance concern about this type of construct and considering that my main programming language is C++, I am bit surprised that I have not seen anything about some problems with C++ vtbls... but oh well, I am learning new stuff everyday, so please share the references you have about the topic so that I can perfect my knowledge. thank you,
On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: > On 8/13/24 10:44 AM, Olivier Langlois wrote: > > the actual napi tracking strategy is inducing a non-negligeable > > overhead. > > Everytime a multishot poll is triggered or any poll armed, if the > > napi is > > enabled on the ring a lookup is performed to either add a new napi > > id into > > the napi_list or its timeout value is updated. > > > > For many scenarios, this is overkill as the napi id list will be > > pretty > > much static most of the time. To address this common scenario, a > > new > > abstraction has been created following the common Linux kernel > > idiom of > > creating an abstract interface with a struct filled with function > > pointers. > > > > Creating an alternate napi tracking strategy is therefore made in 2 > > phases. > > > > 1. Introduce the io_napi_tracking_ops interface > > 2. Implement a static napi tracking by defining a new > > io_napi_tracking_ops > > I don't think we should create ops for this, unless there's a strict > need to do so. Indirect function calls aren't cheap, and the CPU side > mitigations for security issues made them worse. > > You're not wrong that ops is not an uncommon idiom in the kernel, but > it's a lot less prevalent as a solution than it used to. Exactly > because > of the above reasons. > if indirection is a very big deal, it might be possible to remove one level of indirection. I did entertain the idea of making copies of the io_napi_tracking_ops structs instead of storing their addresses. I did not kept this option because of the way that I did implement io_napi_get_tracking()... but if this would be an acceptable compromise, this is definitely something possible.
On 8/13/24 3:25 PM, Olivier Langlois wrote: > On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: >> On 8/13/24 10:44 AM, Olivier Langlois wrote: >>> the actual napi tracking strategy is inducing a non-negligeable >>> overhead. >>> Everytime a multishot poll is triggered or any poll armed, if the >>> napi is >>> enabled on the ring a lookup is performed to either add a new napi >>> id into >>> the napi_list or its timeout value is updated. >>> >>> For many scenarios, this is overkill as the napi id list will be >>> pretty >>> much static most of the time. To address this common scenario, a >>> new >>> abstraction has been created following the common Linux kernel >>> idiom of >>> creating an abstract interface with a struct filled with function >>> pointers. >>> >>> Creating an alternate napi tracking strategy is therefore made in 2 >>> phases. >>> >>> 1. Introduce the io_napi_tracking_ops interface >>> 2. Implement a static napi tracking by defining a new >>> io_napi_tracking_ops >> >> I don't think we should create ops for this, unless there's a strict >> need to do so. Indirect function calls aren't cheap, and the CPU side >> mitigations for security issues made them worse. >> >> You're not wrong that ops is not an uncommon idiom in the kernel, but >> it's a lot less prevalent as a solution than it used to. Exactly >> because >> of the above reasons. >> > ok. Do you have a reference explaining this? > and what type of construct would you use instead? See all the spectre nonsense, and the mitigations that followed from that. > AFAIK, a big performance killer is the branch mispredictions coming > from big switch/case or if/else if/else blocks and it was precisely the > reason why you removed the big switch/case io_uring was having with > function pointers in io_issue_def... For sure, which is why io_uring itself ended up using indirect function calls, because the table just became unwieldy. But that's a different case from adding it for just a single case, or two. For those, branch prediction should be fine, as it would always have the same outcome. > I consumme an enormous amount of programming learning material daily > and this is the first time that I am hearing this. The kernel and backend programming are a bit different in that regard, for better or for worse. > If there was a performance concern about this type of construct and > considering that my main programming language is C++, I am bit > surprised that I have not seen anything about some problems with C++ > vtbls... It's definitely slower than a direct function call, regardless of whether this is in the kernel or not. Can be mitigated by having the common case be predicted with a branch. See INDIRECT_CALL_*() in the kernel. > but oh well, I am learning new stuff everyday, so please share the > references you have about the topic so that I can perfect my knowledge. I think lwn had a recent thing on indirect function calls as it pertains to the security modules, I'd check that first. But the spectre thing above is likely all you need!
On 8/13/24 3:34 PM, Olivier Langlois wrote: > On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: >> On 8/13/24 10:44 AM, Olivier Langlois wrote: >>> the actual napi tracking strategy is inducing a non-negligeable >>> overhead. >>> Everytime a multishot poll is triggered or any poll armed, if the >>> napi is >>> enabled on the ring a lookup is performed to either add a new napi >>> id into >>> the napi_list or its timeout value is updated. >>> >>> For many scenarios, this is overkill as the napi id list will be >>> pretty >>> much static most of the time. To address this common scenario, a >>> new >>> abstraction has been created following the common Linux kernel >>> idiom of >>> creating an abstract interface with a struct filled with function >>> pointers. >>> >>> Creating an alternate napi tracking strategy is therefore made in 2 >>> phases. >>> >>> 1. Introduce the io_napi_tracking_ops interface >>> 2. Implement a static napi tracking by defining a new >>> io_napi_tracking_ops >> >> I don't think we should create ops for this, unless there's a strict >> need to do so. Indirect function calls aren't cheap, and the CPU side >> mitigations for security issues made them worse. >> >> You're not wrong that ops is not an uncommon idiom in the kernel, but >> it's a lot less prevalent as a solution than it used to. Exactly >> because >> of the above reasons. >> > if indirection is a very big deal, it might be possible to remove one > level of indirection. It's not that it's a huge deal, it's just more that if we're dealing with a single abstraction, then I think it's somewhat overdesigning for the use case. And I'd prefer to avoid that. > I did entertain the idea of making copies of the io_napi_tracking_ops > structs instead of storing their addresses. I did not kept this option > because of the way that I did implement io_napi_get_tracking()... > > but if this would be an acceptable compromise, this is definitely > something possible. Doesn't really change it, I think. See above.
On 8/13/24 22:25, Olivier Langlois wrote: > On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: >> On 8/13/24 10:44 AM, Olivier Langlois wrote: >>> the actual napi tracking strategy is inducing a non-negligeable >>> overhead. >>> Everytime a multishot poll is triggered or any poll armed, if the >>> napi is >>> enabled on the ring a lookup is performed to either add a new napi >>> id into >>> the napi_list or its timeout value is updated. >>> >>> For many scenarios, this is overkill as the napi id list will be >>> pretty >>> much static most of the time. To address this common scenario, a >>> new >>> abstraction has been created following the common Linux kernel >>> idiom of >>> creating an abstract interface with a struct filled with function >>> pointers. >>> >>> Creating an alternate napi tracking strategy is therefore made in 2 >>> phases. >>> >>> 1. Introduce the io_napi_tracking_ops interface >>> 2. Implement a static napi tracking by defining a new >>> io_napi_tracking_ops >> >> I don't think we should create ops for this, unless there's a strict >> need to do so. Indirect function calls aren't cheap, and the CPU side >> mitigations for security issues made them worse. >> >> You're not wrong that ops is not an uncommon idiom in the kernel, but >> it's a lot less prevalent as a solution than it used to. Exactly >> because >> of the above reasons. >> > ok. Do you have a reference explaining this? > and what type of construct would you use instead? > > AFAIK, a big performance killer is the branch mispredictions coming > from big switch/case or if/else if/else blocks and it was precisely the > reason why you removed the big switch/case io_uring was having with > function pointers in io_issue_def... Compilers can optimise switch-case very well, look up what jump tables is, often works even better than indirect functions even without mitigations. And it wasn't converted because of performance, it was a nice efficient jump table before. And not like compilers can devirtualise indirect calls either, I'd say it hits the pipeline even harder. Maybe not as hard as a long if-else-if in the final binary, but jump tables help and we're talking about a single "if". I totally agree, it's way over engineered. > I consumme an enormous amount of programming learning material daily > and this is the first time that I am hearing this. > > If there was a performance concern about this type of construct and > considering that my main programming language is C++, I am bit > surprised that I have not seen anything about some problems with C++ > vtbls... Even without mitigation business, we can look up a lot about devirtualisation, which is also why "final" keyword exists in c++.
On 8/13/24 23:36, Pavel Begunkov wrote: > On 8/13/24 22:25, Olivier Langlois wrote: >> On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: >>> On 8/13/24 10:44 AM, Olivier Langlois wrote: >>>> the actual napi tracking strategy is inducing a non-negligeable >>>> overhead. >>>> Everytime a multishot poll is triggered or any poll armed, if the >>>> napi is >>>> enabled on the ring a lookup is performed to either add a new napi >>>> id into >>>> the napi_list or its timeout value is updated. >>>> >>>> For many scenarios, this is overkill as the napi id list will be >>>> pretty >>>> much static most of the time. To address this common scenario, a >>>> new >>>> abstraction has been created following the common Linux kernel >>>> idiom of >>>> creating an abstract interface with a struct filled with function >>>> pointers. >>>> >>>> Creating an alternate napi tracking strategy is therefore made in 2 >>>> phases. >>>> >>>> 1. Introduce the io_napi_tracking_ops interface >>>> 2. Implement a static napi tracking by defining a new >>>> io_napi_tracking_ops >>> >>> I don't think we should create ops for this, unless there's a strict >>> need to do so. Indirect function calls aren't cheap, and the CPU side >>> mitigations for security issues made them worse. >>> >>> You're not wrong that ops is not an uncommon idiom in the kernel, but >>> it's a lot less prevalent as a solution than it used to. Exactly >>> because >>> of the above reasons. >>> >> ok. Do you have a reference explaining this? >> and what type of construct would you use instead? >> >> AFAIK, a big performance killer is the branch mispredictions coming >> from big switch/case or if/else if/else blocks and it was precisely the >> reason why you removed the big switch/case io_uring was having with >> function pointers in io_issue_def... > > Compilers can optimise switch-case very well, look up what jump > tables is, often works even better than indirect functions even > without mitigations. And it wasn't converted because of performance, > it was a nice efficient jump table before. Correction, switches were optimisable until -fno-jump-tables was added because of speculations, makes it equal with indirect calls, and otherwise there were usually retpolined making it equal to indirect calls.
On Tue, 2024-08-13 at 15:44 -0600, Jens Axboe wrote: > On 8/13/24 3:25 PM, Olivier Langlois wrote: > > On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote: > > > On 8/13/24 10:44 AM, Olivier Langlois wrote: > > > > the actual napi tracking strategy is inducing a non-negligeable > > > > overhead. > > > > Everytime a multishot poll is triggered or any poll armed, if > > > > the > > > > napi is > > > > enabled on the ring a lookup is performed to either add a new > > > > napi > > > > id into > > > > the napi_list or its timeout value is updated. > > > > > > > > For many scenarios, this is overkill as the napi id list will > > > > be > > > > pretty > > > > much static most of the time. To address this common scenario, > > > > a > > > > new > > > > abstraction has been created following the common Linux kernel > > > > idiom of > > > > creating an abstract interface with a struct filled with > > > > function > > > > pointers. > > > > > > > > Creating an alternate napi tracking strategy is therefore made > > > > in 2 > > > > phases. > > > > > > > > 1. Introduce the io_napi_tracking_ops interface > > > > 2. Implement a static napi tracking by defining a new > > > > io_napi_tracking_ops > > > > > > I don't think we should create ops for this, unless there's a > > > strict > > > need to do so. Indirect function calls aren't cheap, and the CPU > > > side > > > mitigations for security issues made them worse. > > > > > > You're not wrong that ops is not an uncommon idiom in the kernel, > > > but > > > it's a lot less prevalent as a solution than it used to. Exactly > > > because > > > of the above reasons. > > > > > ok. Do you have a reference explaining this? > > and what type of construct would you use instead? > > See all the spectre nonsense, and the mitigations that followed from > that. > > > AFAIK, a big performance killer is the branch mispredictions coming > > from big switch/case or if/else if/else blocks and it was precisely > > the > > reason why you removed the big switch/case io_uring was having with > > function pointers in io_issue_def... > > For sure, which is why io_uring itself ended up using indirect > function > calls, because the table just became unwieldy. But that's a different > case from adding it for just a single case, or two. For those, branch > prediction should be fine, as it would always have the same outcome. > > > I consumme an enormous amount of programming learning material > > daily > > and this is the first time that I am hearing this. > > The kernel and backend programming are a bit different in that > regard, > for better or for worse. > > > If there was a performance concern about this type of construct and > > considering that my main programming language is C++, I am bit > > surprised that I have not seen anything about some problems with > > C++ > > vtbls... > > It's definitely slower than a direct function call, regardless of > whether this is in the kernel or not. Can be mitigated by having the > common case be predicted with a branch. See INDIRECT_CALL_*() in the > kernel. > > > but oh well, I am learning new stuff everyday, so please share the > > references you have about the topic so that I can perfect my > > knowledge. > > I think lwn had a recent thing on indirect function calls as it > pertains > to the security modules, I'd check that first. But the spectre thing > above is likely all you need! > Jens, thx a lot about the clarifications. I will for sure investigate these leads to better understand your fear of function callbacks... I have little interest about Spectre and other mitigations and security in general so I know very little about those topics. A guy, that I value a lot his knowledge is Chandler Carruth from Google who made a talk about Spectre in 2018: https://www.youtube.com/watch?v=_f7O3IfIR2k I will rewatch his talk, check LWN about the indirect functions calls INDIRECT_CALL() macros that you mention... AFAIK, the various kernel mitigations are mostly applied when transiting from Kernel mode back to userspace. because otherwise the compiled code for a usespace program would be pretty much identical than kernel compiled code. To my eyes, what is really important, it is that absolute best technical solution is choosen and the only way that this discussion can be done, it is with numbers. So I have created a small created a small benchmark program to compare a function pointer indirect call vs selecting a function in a 3 branches if/else if/else block. Here are the results: ---------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------- BM_test_virtual 0.628 ns 0.627 ns 930255515 BM_test_ifElse 1.59 ns 1.58 ns 446805050 add this result with my concession in https://lore.kernel.org/io-uring/f86da1b705e98cac8c72e807ca50d2b4ce3a50a2.camel@trillion01.com/ that you are right for 2 of the function pointers out of the 3 functions of io_napi_tracking_ops... Hopefully this discussion will lead us toward the best solution. keep in mind that this point is a minuscule issue. if you prefer the 2.5x slower construct for any reason that you do not have to justify, I'll accept your decision and rework my proposal to go your way. I believe that offering some form of NAPI static tracking is still a very interesting feature no matter what is the outcome for this very minor technical issue. code: /* * virtual overhead vs if/else google benchmark * * Olivier Langlois - August 15, 2024 * * compile cmd: * g++ -std=c++26 -I.. -pthread -Wall -g -O3 -pipe * -fno-omit-frame-pointer bench_virtual.cpp -lbenchmark -o bench_virtual */ #include "benchmark/benchmark.h" /* * CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and * Compilers! Oh My! * https://www.youtube.com/watch?v=nXaxk27zwlk */ static void escape(void *p) { asm volatile("" : : "g"(p) : "memory"); } bool no_tracking_do_busy_loop() { int res{0}; escape(&res); return res; } bool dynamic_tracking_do_busy_loop() { int res{1}; escape(&res); return res; } bool static_tracking_do_busy_loop() { int res{2}; escape(&res); return res; } class io_napi_tracking_ops { public: virtual bool do_busy_loop() noexcept = 0; }; class static_tracking_ops : public io_napi_tracking_ops { public: bool do_busy_loop() noexcept override; }; bool static_tracking_ops::do_busy_loop() noexcept { return static_tracking_do_busy_loop(); } bool testVirtual(io_napi_tracking_ops *ptr) { return ptr->do_busy_loop(); } bool testIfElseDispatch(int i) { if (i == 0) return no_tracking_do_busy_loop(); else if (i == 1) return dynamic_tracking_do_busy_loop(); else return static_tracking_do_busy_loop(); } void BM_test_virtual(benchmark::State &state) { static_tracking_ops vObj; volatile io_napi_tracking_ops *ptr = &vObj; for (auto _ : state) { benchmark::DoNotOptimize(testVirtual( const_cast<io_napi_tracking_ops *>(ptr))); } } void BM_test_ifElse(benchmark::State &state) { volatile int i = 2; for (auto _ : state) { benchmark::DoNotOptimize(testIfElseDispatch(i)); } } BENCHMARK(BM_test_virtual); BENCHMARK(BM_test_ifElse); BENCHMARK_MAIN();
On Thu, 2024-08-15 at 18:17 -0400, Olivier Langlois wrote: > > To my eyes, what is really important, it is that absolute best > technical solution is choosen and the only way that this discussion > can > be done, it is with numbers. So I have created a small created a > small > benchmark program to compare a function pointer indirect call vs > selecting a function in a 3 branches if/else if/else block. Here are > the results: > > ---------------------------------------------------------- > Benchmark Time CPU Iterations > ---------------------------------------------------------- > BM_test_virtual 0.628 ns 0.627 ns 930255515 > BM_test_ifElse 1.59 ns 1.58 ns 446805050 > I have fixed my benchmark: ---------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------- BM_test_virtual 2.57 ns 2.53 ns 277764970 BM_test_ifElse 1.58 ns 1.57 ns 445197861 code: using Func_t = bool (*)(); bool testVirtual(Func_t ptr) { return ptr(); } void BM_test_virtual(benchmark::State &state) { volatile Func_t ptr = &static_tracking_do_busy_loop; for (auto _ : state) { benchmark::DoNotOptimize(testVirtual(ptr)); } }
On 8/15/24 23:44, Olivier Langlois wrote: > On Thu, 2024-08-15 at 18:17 -0400, Olivier Langlois wrote: >> >> To my eyes, what is really important, it is that absolute best >> technical solution is choosen and the only way that this discussion >> can >> be done, it is with numbers. So I have created a small created a >> small >> benchmark program to compare a function pointer indirect call vs >> selecting a function in a 3 branches if/else if/else block. Here are >> the results: FWIW, it's just one branch / two different options in this case. We shouldn't be doing a call if napi has never been requested, so napi_dont_do_anything callback is not an option. >> ---------------------------------------------------------- >> Benchmark Time CPU Iterations >> ---------------------------------------------------------- >> BM_test_virtual 0.628 ns 0.627 ns 930255515 >> BM_test_ifElse 1.59 ns 1.58 ns 446805050 >> > I have fixed my benchmark: > > ---------------------------------------------------------- > Benchmark Time CPU Iterations > ---------------------------------------------------------- > BM_test_virtual 2.57 ns 2.53 ns 277764970 > BM_test_ifElse 1.58 ns 1.57 ns 445197861 You passed the compiler flags for mitigations, right? -mindirect-branch=thunk -mfunction-return=thunk -mindirect-branch-register Regardless of overhead, my concern is the complexity and amount of extra code. It's just over engineered. It's like adding a virtual templated hierarchy of classes just to implement a program that prints "2". I pushed what I had (2 last patches), you can use it as a reference, but be aware that it's a completely untested draft with some obvious problems and ugly uapi. https://github.com/isilence/linux.git manual-napi https://github.com/isilence/linux/commits/manual-napi/
On Fri, 2024-08-16 at 15:26 +0100, Pavel Begunkov wrote: > I pushed what I had (2 last patches), you can use it as a > reference, but be aware that it's a completely untested > draft with some obvious problems and ugly uapi. > > https://github.com/isilence/linux.git manual-napi > https://github.com/isilence/linux/commits/manual-napi/ > I did review what you have done... I like your take on this feature idea... especially the autoremove field. the one aspect that I prefer in my version over yours, it is that my implementation avoids totally to call __io_napi_add() which includes a table lookup from io_poll_check_events() which is big chunk of the dynamic tracking overhead. also, when the napi devices are iterated in __io_napi_do_busy_loop(), my version totally remove the conditional branching from the loop by having 2 distinct busy loop functions. but I guess the last point is arguably unimportant since the code is busy polling... Anyway, I had the time to create a v2 version of my implementation to address the remarks made about v1 that I have completed testing. It has been running on my system for the last 24h flawlessly... I will post it here shortly. Please take a look at it and pick the best features of both implementation.