[0/2] abstract napi tracking strategy

Message ID	cover.1723567469.git.olivier@trillion01.com (mailing list archive)
Headers	show Received: from cloud48395.mywhc.ca (cloud48395.mywhc.ca [173.209.37.211]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1CC5826AC6 for <io-uring@vger.kernel.org>; Tue, 13 Aug 2024 17:10:45 +0000 (UTC) From: Olivier Langlois <olivier@trillion01.com> Message-ID: <cover.1723567469.git.olivier@trillion01.com> To: Jens Axboe <axboe@kernel.dk>,Pavel Begunkov <asml.silence@gmail.com>,io-uring@vger.kernel.org Date: Tue, 13 Aug 2024 12:44:29 -0400 Subject: [PATCH 0/2] abstract napi tracking strategy
Series	abstract napi tracking strategy \| expand [0/2] abstract napi tracking strategy [1/2] io_uring/napi: Introduce io_napi_tracking_ops [2/2] io_uring/napi: add static napi tracking strategy

Olivier Langlois Aug. 13, 2024, 4:44 p.m. UTC

the actual napi tracking strategy is inducing a non-negligeable overhead.
Everytime a multishot poll is triggered or any poll armed, if the napi is
enabled on the ring a lookup is performed to either add a new napi id into
the napi_list or its timeout value is updated.

For many scenarios, this is overkill as the napi id list will be pretty
much static most of the time. To address this common scenario, a new
abstraction has been created following the common Linux kernel idiom of
creating an abstract interface with a struct filled with function pointers.

Creating an alternate napi tracking strategy is therefore made in 2 phases.

1. Introduce the io_napi_tracking_ops interface
2. Implement a static napi tracking by defining a new io_napi_tracking_ops

Olivier Langlois (2):
  io_uring/napi: Introduce io_napi_tracking_ops
  io_uring/napi: add static napi tracking strategy

 include/linux/io_uring_types.h |  12 +-
 include/uapi/linux/io_uring.h  |  30 ++++-
 io_uring/fdinfo.c              |   4 +
 io_uring/napi.c                | 226 +++++++++++++++++++++++++++++----
 io_uring/napi.h                |  11 +-
 5 files changed, 243 insertions(+), 40 deletions(-)

Jens Axboe Aug. 13, 2024, 6:33 p.m. UTC | #1

On 8/13/24 10:44 AM, Olivier Langlois wrote:
> the actual napi tracking strategy is inducing a non-negligeable overhead.
> Everytime a multishot poll is triggered or any poll armed, if the napi is
> enabled on the ring a lookup is performed to either add a new napi id into
> the napi_list or its timeout value is updated.
> 
> For many scenarios, this is overkill as the napi id list will be pretty
> much static most of the time. To address this common scenario, a new
> abstraction has been created following the common Linux kernel idiom of
> creating an abstract interface with a struct filled with function pointers.
> 
> Creating an alternate napi tracking strategy is therefore made in 2 phases.
> 
> 1. Introduce the io_napi_tracking_ops interface
> 2. Implement a static napi tracking by defining a new io_napi_tracking_ops

I don't think we should create ops for this, unless there's a strict
need to do so. Indirect function calls aren't cheap, and the CPU side
mitigations for security issues made them worse.

You're not wrong that ops is not an uncommon idiom in the kernel, but
it's a lot less prevalent as a solution than it used to. Exactly because
of the above reasons.

Olivier Langlois Aug. 13, 2024, 9:25 p.m. UTC | #2

On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
> On 8/13/24 10:44 AM, Olivier Langlois wrote:
> > the actual napi tracking strategy is inducing a non-negligeable
> > overhead.
> > Everytime a multishot poll is triggered or any poll armed, if the
> > napi is
> > enabled on the ring a lookup is performed to either add a new napi
> > id into
> > the napi_list or its timeout value is updated.
> > 
> > For many scenarios, this is overkill as the napi id list will be
> > pretty
> > much static most of the time. To address this common scenario, a
> > new
> > abstraction has been created following the common Linux kernel
> > idiom of
> > creating an abstract interface with a struct filled with function
> > pointers.
> > 
> > Creating an alternate napi tracking strategy is therefore made in 2
> > phases.
> > 
> > 1. Introduce the io_napi_tracking_ops interface
> > 2. Implement a static napi tracking by defining a new
> > io_napi_tracking_ops
> 
> I don't think we should create ops for this, unless there's a strict
> need to do so. Indirect function calls aren't cheap, and the CPU side
> mitigations for security issues made them worse.
> 
> You're not wrong that ops is not an uncommon idiom in the kernel, but
> it's a lot less prevalent as a solution than it used to. Exactly
> because
> of the above reasons.
> 
ok. Do you have a reference explaining this?
and what type of construct would you use instead?

AFAIK, a big performance killer is the branch mispredictions coming
from big switch/case or if/else if/else blocks and it was precisely the
reason why you removed the big switch/case io_uring was having with
function pointers in io_issue_def...

I consumme an enormous amount of programming learning material daily
and this is the first time that I am hearing this.

If there was a performance concern about this type of construct and
considering that my main programming language is C++, I am bit
surprised that I have not seen anything about some problems with C++
vtbls...

but oh well, I am learning new stuff everyday, so please share the
references you have about the topic so that I can perfect my knowledge.

thank you,

Olivier Langlois Aug. 13, 2024, 9:34 p.m. UTC | #3

On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
> On 8/13/24 10:44 AM, Olivier Langlois wrote:
> > the actual napi tracking strategy is inducing a non-negligeable
> > overhead.
> > Everytime a multishot poll is triggered or any poll armed, if the
> > napi is
> > enabled on the ring a lookup is performed to either add a new napi
> > id into
> > the napi_list or its timeout value is updated.
> > 
> > For many scenarios, this is overkill as the napi id list will be
> > pretty
> > much static most of the time. To address this common scenario, a
> > new
> > abstraction has been created following the common Linux kernel
> > idiom of
> > creating an abstract interface with a struct filled with function
> > pointers.
> > 
> > Creating an alternate napi tracking strategy is therefore made in 2
> > phases.
> > 
> > 1. Introduce the io_napi_tracking_ops interface
> > 2. Implement a static napi tracking by defining a new
> > io_napi_tracking_ops
> 
> I don't think we should create ops for this, unless there's a strict
> need to do so. Indirect function calls aren't cheap, and the CPU side
> mitigations for security issues made them worse.
> 
> You're not wrong that ops is not an uncommon idiom in the kernel, but
> it's a lot less prevalent as a solution than it used to. Exactly
> because
> of the above reasons.
> 
if indirection is a very big deal, it might be possible to remove one
level of indirection.

I did entertain the idea of making copies of the io_napi_tracking_ops
structs instead of storing their addresses. I did not kept this option
because of the way that I did implement io_napi_get_tracking()...

but if this would be an acceptable compromise, this is definitely
something possible.

Jens Axboe Aug. 13, 2024, 9:44 p.m. UTC | #4

On 8/13/24 3:25 PM, Olivier Langlois wrote:
> On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
>> On 8/13/24 10:44 AM, Olivier Langlois wrote:
>>> the actual napi tracking strategy is inducing a non-negligeable
>>> overhead.
>>> Everytime a multishot poll is triggered or any poll armed, if the
>>> napi is
>>> enabled on the ring a lookup is performed to either add a new napi
>>> id into
>>> the napi_list or its timeout value is updated.
>>>
>>> For many scenarios, this is overkill as the napi id list will be
>>> pretty
>>> much static most of the time. To address this common scenario, a
>>> new
>>> abstraction has been created following the common Linux kernel
>>> idiom of
>>> creating an abstract interface with a struct filled with function
>>> pointers.
>>>
>>> Creating an alternate napi tracking strategy is therefore made in 2
>>> phases.
>>>
>>> 1. Introduce the io_napi_tracking_ops interface
>>> 2. Implement a static napi tracking by defining a new
>>> io_napi_tracking_ops
>>
>> I don't think we should create ops for this, unless there's a strict
>> need to do so. Indirect function calls aren't cheap, and the CPU side
>> mitigations for security issues made them worse.
>>
>> You're not wrong that ops is not an uncommon idiom in the kernel, but
>> it's a lot less prevalent as a solution than it used to. Exactly
>> because
>> of the above reasons.
>>
> ok. Do you have a reference explaining this?
> and what type of construct would you use instead?

See all the spectre nonsense, and the mitigations that followed from
that.

> AFAIK, a big performance killer is the branch mispredictions coming
> from big switch/case or if/else if/else blocks and it was precisely the
> reason why you removed the big switch/case io_uring was having with
> function pointers in io_issue_def...

For sure, which is why io_uring itself ended up using indirect function
calls, because the table just became unwieldy. But that's a different
case from adding it for just a single case, or two. For those, branch
prediction should be fine, as it would always have the same outcome.

> I consumme an enormous amount of programming learning material daily
> and this is the first time that I am hearing this.

The kernel and backend programming are a bit different in that regard,
for better or for worse.

> If there was a performance concern about this type of construct and
> considering that my main programming language is C++, I am bit
> surprised that I have not seen anything about some problems with C++
> vtbls...

It's definitely slower than a direct function call, regardless of
whether this is in the kernel or not. Can be mitigated by having the
common case be predicted with a branch. See INDIRECT_CALL_*() in the
kernel.

> but oh well, I am learning new stuff everyday, so please share the
> references you have about the topic so that I can perfect my knowledge.

I think lwn had a recent thing on indirect function calls as it pertains
to the security modules, I'd check that first. But the spectre thing
above is likely all you need!

Jens Axboe Aug. 13, 2024, 9:45 p.m. UTC | #5

On 8/13/24 3:34 PM, Olivier Langlois wrote:
> On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
>> On 8/13/24 10:44 AM, Olivier Langlois wrote:
>>> the actual napi tracking strategy is inducing a non-negligeable
>>> overhead.
>>> Everytime a multishot poll is triggered or any poll armed, if the
>>> napi is
>>> enabled on the ring a lookup is performed to either add a new napi
>>> id into
>>> the napi_list or its timeout value is updated.
>>>
>>> For many scenarios, this is overkill as the napi id list will be
>>> pretty
>>> much static most of the time. To address this common scenario, a
>>> new
>>> abstraction has been created following the common Linux kernel
>>> idiom of
>>> creating an abstract interface with a struct filled with function
>>> pointers.
>>>
>>> Creating an alternate napi tracking strategy is therefore made in 2
>>> phases.
>>>
>>> 1. Introduce the io_napi_tracking_ops interface
>>> 2. Implement a static napi tracking by defining a new
>>> io_napi_tracking_ops
>>
>> I don't think we should create ops for this, unless there's a strict
>> need to do so. Indirect function calls aren't cheap, and the CPU side
>> mitigations for security issues made them worse.
>>
>> You're not wrong that ops is not an uncommon idiom in the kernel, but
>> it's a lot less prevalent as a solution than it used to. Exactly
>> because
>> of the above reasons.
>>
> if indirection is a very big deal, it might be possible to remove one
> level of indirection.

It's not that it's a huge deal, it's just more that if we're dealing
with a single abstraction, then I think it's somewhat overdesigning for
the use case. And I'd prefer to avoid that.

> I did entertain the idea of making copies of the io_napi_tracking_ops
> structs instead of storing their addresses. I did not kept this option
> because of the way that I did implement io_napi_get_tracking()...
> 
> but if this would be an acceptable compromise, this is definitely
> something possible.

Doesn't really change it, I think. See above.

Pavel Begunkov Aug. 13, 2024, 10:36 p.m. UTC | #6

On 8/13/24 22:25, Olivier Langlois wrote:
> On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
>> On 8/13/24 10:44 AM, Olivier Langlois wrote:
>>> the actual napi tracking strategy is inducing a non-negligeable
>>> overhead.
>>> Everytime a multishot poll is triggered or any poll armed, if the
>>> napi is
>>> enabled on the ring a lookup is performed to either add a new napi
>>> id into
>>> the napi_list or its timeout value is updated.
>>>
>>> For many scenarios, this is overkill as the napi id list will be
>>> pretty
>>> much static most of the time. To address this common scenario, a
>>> new
>>> abstraction has been created following the common Linux kernel
>>> idiom of
>>> creating an abstract interface with a struct filled with function
>>> pointers.
>>>
>>> Creating an alternate napi tracking strategy is therefore made in 2
>>> phases.
>>>
>>> 1. Introduce the io_napi_tracking_ops interface
>>> 2. Implement a static napi tracking by defining a new
>>> io_napi_tracking_ops
>>
>> I don't think we should create ops for this, unless there's a strict
>> need to do so. Indirect function calls aren't cheap, and the CPU side
>> mitigations for security issues made them worse.
>>
>> You're not wrong that ops is not an uncommon idiom in the kernel, but
>> it's a lot less prevalent as a solution than it used to. Exactly
>> because
>> of the above reasons.
>>
> ok. Do you have a reference explaining this?
> and what type of construct would you use instead?
> 
> AFAIK, a big performance killer is the branch mispredictions coming
> from big switch/case or if/else if/else blocks and it was precisely the
> reason why you removed the big switch/case io_uring was having with
> function pointers in io_issue_def...

Compilers can optimise switch-case very well, look up what jump
tables is, often works even better than indirect functions even
without mitigations. And it wasn't converted because of performance,
it was a nice efficient jump table before.

And not like compilers can devirtualise indirect calls either, I'd
say it hits the pipeline even harder. Maybe not as hard as a long
if-else-if in the final binary, but jump tables help and we're
talking about a single "if".

I totally agree, it's way over engineered.

> I consumme an enormous amount of programming learning material daily
> and this is the first time that I am hearing this.
> 
> If there was a performance concern about this type of construct and
> considering that my main programming language is C++, I am bit
> surprised that I have not seen anything about some problems with C++
> vtbls...

Even without mitigation business, we can look up a lot about
devirtualisation, which is also why "final" keyword exists in c++.

Pavel Begunkov Aug. 14, 2024, 1:28 p.m. UTC | #7

On 8/13/24 23:36, Pavel Begunkov wrote:
> On 8/13/24 22:25, Olivier Langlois wrote:
>> On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
>>> On 8/13/24 10:44 AM, Olivier Langlois wrote:
>>>> the actual napi tracking strategy is inducing a non-negligeable
>>>> overhead.
>>>> Everytime a multishot poll is triggered or any poll armed, if the
>>>> napi is
>>>> enabled on the ring a lookup is performed to either add a new napi
>>>> id into
>>>> the napi_list or its timeout value is updated.
>>>>
>>>> For many scenarios, this is overkill as the napi id list will be
>>>> pretty
>>>> much static most of the time. To address this common scenario, a
>>>> new
>>>> abstraction has been created following the common Linux kernel
>>>> idiom of
>>>> creating an abstract interface with a struct filled with function
>>>> pointers.
>>>>
>>>> Creating an alternate napi tracking strategy is therefore made in 2
>>>> phases.
>>>>
>>>> 1. Introduce the io_napi_tracking_ops interface
>>>> 2. Implement a static napi tracking by defining a new
>>>> io_napi_tracking_ops
>>>
>>> I don't think we should create ops for this, unless there's a strict
>>> need to do so. Indirect function calls aren't cheap, and the CPU side
>>> mitigations for security issues made them worse.
>>>
>>> You're not wrong that ops is not an uncommon idiom in the kernel, but
>>> it's a lot less prevalent as a solution than it used to. Exactly
>>> because
>>> of the above reasons.
>>>
>> ok. Do you have a reference explaining this?
>> and what type of construct would you use instead?
>>
>> AFAIK, a big performance killer is the branch mispredictions coming
>> from big switch/case or if/else if/else blocks and it was precisely the
>> reason why you removed the big switch/case io_uring was having with
>> function pointers in io_issue_def...
> 
> Compilers can optimise switch-case very well, look up what jump
> tables is, often works even better than indirect functions even
> without mitigations. And it wasn't converted because of performance,
> it was a nice efficient jump table before.

Correction, switches were optimisable until -fno-jump-tables was
added because of speculations, makes it equal with indirect calls,
and otherwise there were usually retpolined making it equal to
indirect calls.

Olivier Langlois Aug. 15, 2024, 10:17 p.m. UTC | #8

On Tue, 2024-08-13 at 15:44 -0600, Jens Axboe wrote:
> On 8/13/24 3:25 PM, Olivier Langlois wrote:
> > On Tue, 2024-08-13 at 12:33 -0600, Jens Axboe wrote:
> > > On 8/13/24 10:44 AM, Olivier Langlois wrote:
> > > > the actual napi tracking strategy is inducing a non-negligeable
> > > > overhead.
> > > > Everytime a multishot poll is triggered or any poll armed, if
> > > > the
> > > > napi is
> > > > enabled on the ring a lookup is performed to either add a new
> > > > napi
> > > > id into
> > > > the napi_list or its timeout value is updated.
> > > > 
> > > > For many scenarios, this is overkill as the napi id list will
> > > > be
> > > > pretty
> > > > much static most of the time. To address this common scenario,
> > > > a
> > > > new
> > > > abstraction has been created following the common Linux kernel
> > > > idiom of
> > > > creating an abstract interface with a struct filled with
> > > > function
> > > > pointers.
> > > > 
> > > > Creating an alternate napi tracking strategy is therefore made
> > > > in 2
> > > > phases.
> > > > 
> > > > 1. Introduce the io_napi_tracking_ops interface
> > > > 2. Implement a static napi tracking by defining a new
> > > > io_napi_tracking_ops
> > > 
> > > I don't think we should create ops for this, unless there's a
> > > strict
> > > need to do so. Indirect function calls aren't cheap, and the CPU
> > > side
> > > mitigations for security issues made them worse.
> > > 
> > > You're not wrong that ops is not an uncommon idiom in the kernel,
> > > but
> > > it's a lot less prevalent as a solution than it used to. Exactly
> > > because
> > > of the above reasons.
> > > 
> > ok. Do you have a reference explaining this?
> > and what type of construct would you use instead?
> 
> See all the spectre nonsense, and the mitigations that followed from
> that.
> 
> > AFAIK, a big performance killer is the branch mispredictions coming
> > from big switch/case or if/else if/else blocks and it was precisely
> > the
> > reason why you removed the big switch/case io_uring was having with
> > function pointers in io_issue_def...
> 
> For sure, which is why io_uring itself ended up using indirect
> function
> calls, because the table just became unwieldy. But that's a different
> case from adding it for just a single case, or two. For those, branch
> prediction should be fine, as it would always have the same outcome.
> 
> > I consumme an enormous amount of programming learning material
> > daily
> > and this is the first time that I am hearing this.
> 
> The kernel and backend programming are a bit different in that
> regard,
> for better or for worse.
> 
> > If there was a performance concern about this type of construct and
> > considering that my main programming language is C++, I am bit
> > surprised that I have not seen anything about some problems with
> > C++
> > vtbls...
> 
> It's definitely slower than a direct function call, regardless of
> whether this is in the kernel or not. Can be mitigated by having the
> common case be predicted with a branch. See INDIRECT_CALL_*() in the
> kernel.
> 
> > but oh well, I am learning new stuff everyday, so please share the
> > references you have about the topic so that I can perfect my
> > knowledge.
> 
> I think lwn had a recent thing on indirect function calls as it
> pertains
> to the security modules, I'd check that first. But the spectre thing
> above is likely all you need!
> 

Jens,

thx a lot about the clarifications. I will for sure investigate these
leads to better understand your fear of function callbacks...

I have little interest about Spectre and other mitigations and security
in general so I know very little about those topics.

A guy, that I value a lot his knowledge is Chandler Carruth from Google
who made a talk about Spectre in 2018:
https://www.youtube.com/watch?v=_f7O3IfIR2k

I will rewatch his talk, check LWN about the indirect functions calls
INDIRECT_CALL() macros that you mention...

AFAIK, the various kernel mitigations are mostly applied when
transiting from Kernel mode back to userspace.

because otherwise the compiled code for a usespace program would be
pretty much identical than kernel compiled code.

To my eyes, what is really important, it is that absolute best
technical solution is choosen and the only way that this discussion can
be done, it is with numbers. So I have created a small created a small
benchmark program to compare a function pointer indirect call vs
selecting a function in a 3 branches if/else if/else block. Here are
the results:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_test_virtual      0.628 ns        0.627 ns    930255515
BM_test_ifElse        1.59 ns         1.58 ns    446805050

add this result with my concession in
https://lore.kernel.org/io-uring/f86da1b705e98cac8c72e807ca50d2b4ce3a50a2.camel@trillion01.com/

that you are right for 2 of the function pointers out of the 3
functions of io_napi_tracking_ops...

Hopefully this discussion will lead us toward the best solution.

keep in mind that this point is a minuscule issue.

if you prefer the 2.5x slower construct for any reason that you do not
have to justify, I'll accept your decision and rework my proposal to go
your way.

I believe that offering some form of NAPI static tracking is still a
very interesting feature no matter what is the outcome for this very
minor technical issue.

code:
/*
 * virtual overhead vs if/else google benchmark
 *
 * Olivier Langlois - August 15, 2024
 *
 * compile cmd:
 * g++ -std=c++26 -I.. -pthread -Wall -g -O3 -pipe
 * -fno-omit-frame-pointer bench_virtual.cpp -lbenchmark -o
bench_virtual
 */

#include "benchmark/benchmark.h"

/*
 * CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and
 * Compilers! Oh My!
 * https://www.youtube.com/watch?v=nXaxk27zwlk
 */
static void escape(void *p)
{
    asm volatile("" : : "g"(p) : "memory");
}

bool no_tracking_do_busy_loop()
{
    int res{0};

    escape(&res);
    return res;
}

bool dynamic_tracking_do_busy_loop()
{
    int res{1};

    escape(&res);
    return res;
}

bool static_tracking_do_busy_loop()
{
    int res{2};

    escape(&res);
    return res;
}

class io_napi_tracking_ops
{
public:
    virtual bool do_busy_loop() noexcept = 0;
};

class static_tracking_ops : public io_napi_tracking_ops
{
public:
    bool do_busy_loop() noexcept override;
};

bool static_tracking_ops::do_busy_loop() noexcept
{
    return static_tracking_do_busy_loop();
}

bool testVirtual(io_napi_tracking_ops *ptr)
{
    return ptr->do_busy_loop();
}

bool testIfElseDispatch(int i)
{
    if (i == 0)
        return no_tracking_do_busy_loop();
    else if (i == 1)
        return dynamic_tracking_do_busy_loop();
    else
        return static_tracking_do_busy_loop();
}

void BM_test_virtual(benchmark::State &state)
{
    static_tracking_ops vObj;
    volatile io_napi_tracking_ops *ptr = &vObj;

    for (auto _ : state) {
        benchmark::DoNotOptimize(testVirtual(
            const_cast<io_napi_tracking_ops *>(ptr)));
    }
}

void BM_test_ifElse(benchmark::State &state)
{
    volatile int i = 2;

    for (auto _ : state) {
        benchmark::DoNotOptimize(testIfElseDispatch(i));
    }
}

BENCHMARK(BM_test_virtual);
BENCHMARK(BM_test_ifElse);
BENCHMARK_MAIN();

Olivier Langlois Aug. 15, 2024, 10:44 p.m. UTC | #9

On Thu, 2024-08-15 at 18:17 -0400, Olivier Langlois wrote:
> 
> To my eyes, what is really important, it is that absolute best
> technical solution is choosen and the only way that this discussion
> can
> be done, it is with numbers. So I have created a small created a
> small
> benchmark program to compare a function pointer indirect call vs
> selecting a function in a 3 branches if/else if/else block. Here are
> the results:
> 
> ----------------------------------------------------------
> Benchmark                Time             CPU   Iterations
> ----------------------------------------------------------
> BM_test_virtual      0.628 ns        0.627 ns    930255515
> BM_test_ifElse        1.59 ns         1.58 ns    446805050
> 
I have fixed my benchmark:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_test_virtual       2.57 ns         2.53 ns    277764970
BM_test_ifElse        1.58 ns         1.57 ns    445197861

code:
using Func_t = bool (*)();

bool testVirtual(Func_t ptr)
{
    return ptr();
}

void BM_test_virtual(benchmark::State &state)
{
    volatile Func_t ptr = &static_tracking_do_busy_loop;

    for (auto _ : state) {
        benchmark::DoNotOptimize(testVirtual(ptr));
    }
}

Pavel Begunkov Aug. 16, 2024, 2:26 p.m. UTC | #10

On 8/15/24 23:44, Olivier Langlois wrote:
> On Thu, 2024-08-15 at 18:17 -0400, Olivier Langlois wrote:
>>
>> To my eyes, what is really important, it is that absolute best
>> technical solution is choosen and the only way that this discussion
>> can
>> be done, it is with numbers. So I have created a small created a
>> small
>> benchmark program to compare a function pointer indirect call vs
>> selecting a function in a 3 branches if/else if/else block. Here are
>> the results:

FWIW, it's just one branch / two different options in this case.
We shouldn't be doing a call if napi has never been requested,
so napi_dont_do_anything callback is not an option.

>> ----------------------------------------------------------
>> Benchmark                Time             CPU   Iterations
>> ----------------------------------------------------------
>> BM_test_virtual      0.628 ns        0.627 ns    930255515
>> BM_test_ifElse        1.59 ns         1.58 ns    446805050
>>
> I have fixed my benchmark:
> 
> ----------------------------------------------------------
> Benchmark                Time             CPU   Iterations
> ----------------------------------------------------------
> BM_test_virtual       2.57 ns         2.53 ns    277764970
> BM_test_ifElse        1.58 ns         1.57 ns    445197861

You passed the compiler flags for mitigations, right?

-mindirect-branch=thunk  -mfunction-return=thunk -mindirect-branch-register

Regardless of overhead, my concern is the complexity and
amount of extra code. It's just over engineered. It's like
adding a virtual templated hierarchy of classes just to
implement a program that prints "2".

I pushed what I had (2 last patches), you can use it as a
reference, but be aware that it's a completely untested
draft with some obvious problems and ugly uapi.

https://github.com/isilence/linux.git manual-napi
https://github.com/isilence/linux/commits/manual-napi/

Olivier Langlois Sept. 16, 2024, 6:29 p.m. UTC | #11

On Fri, 2024-08-16 at 15:26 +0100, Pavel Begunkov wrote:
> I pushed what I had (2 last patches), you can use it as a
> reference, but be aware that it's a completely untested
> draft with some obvious problems and ugly uapi.
> 
> https://github.com/isilence/linux.git manual-napi
> https://github.com/isilence/linux/commits/manual-napi/
> 
I did review what you have done...
I like your take on this feature idea... especially the autoremove
field.

the one aspect that I prefer in my version over yours, it is that my
implementation avoids totally to call __io_napi_add() which includes a
table lookup from io_poll_check_events() which is big chunk of the
dynamic tracking overhead.

also, when the napi devices are iterated in __io_napi_do_busy_loop(),
my version totally remove the conditional branching from the loop by
having 2 distinct busy loop functions.

but I guess the last point is arguably unimportant since the code is
busy polling...

Anyway, I had the time to create a v2 version of my implementation to
address the remarks made about v1 that I have completed testing. It has
been running on my system for the last 24h flawlessly...

I will post it here shortly. Please take a look at it and pick the best
features of both implementation.

[0/2] abstract napi tracking strategy

Message

Comments