mbox series

[RFC,0/2] mm/page_alloc: free_pcppages_bulk safeguard

Message ID 20230817-free_pcppages_bulk-v1-0-c14574a9f80c@kernel.org (mailing list archive)
Headers show
Series mm/page_alloc: free_pcppages_bulk safeguard | expand

Message

Chris Li Aug. 18, 2023, 6:05 a.m. UTC
In this patch series I want to safeguard
the free_pcppage_bulk against change in the
pcp->count outside of this function. e.g.
by BPF program inject on the function tracepoint.

I break up the patches into two seperate patches
for the safeguard and clean up.

Hopefully that is easier to review.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
      mm/page_alloc: safeguard free_pcppages_bulk
      mm/page_alloc: free_pcppages_bulk clean up

 mm/page_alloc.c | 44 +++++++++++++-------------------------------
 1 file changed, 13 insertions(+), 31 deletions(-)
---
base-commit: 5fb2ea3111f4ecc6dc4891ce5b00f0217aae9a04
change-id: 20230817-free_pcppages_bulk-facc18d6fee7

Best regards,

Comments

Mel Gorman Aug. 21, 2023, 10:32 a.m. UTC | #1
On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> In this patch series I want to safeguard
> the free_pcppage_bulk against change in the
> pcp->count outside of this function. e.g.
> by BPF program inject on the function tracepoint.
> 
> I break up the patches into two seperate patches
> for the safeguard and clean up.
> 
> Hopefully that is easier to review.
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>

This sounds like a maintenance nightmare if internal state can be arbitrary
modified by a BPF program and still expected to work properly in all cases.
Every review would have to take into account "what if a BPF script modifies
state behind our back?"
Kemeng Shi Aug. 22, 2023, 1:27 a.m. UTC | #2
on 8/21/2023 6:32 PM, Mel Gorman wrote:
> On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
>> In this patch series I want to safeguard
>> the free_pcppage_bulk against change in the
>> pcp->count outside of this function. e.g.
>> by BPF program inject on the function tracepoint.
>>
>> I break up the patches into two seperate patches
>> for the safeguard and clean up.
>>
>> Hopefully that is easier to review.
>>
>> Signed-off-by: Chris Li <chrisl@kernel.org>
> 
> This sounds like a maintenance nightmare if internal state can be arbitrary
> modified by a BPF program and still expected to work properly in all cases.
> Every review would have to take into account "what if a BPF script modifies
> state behind our back?"
> 
Agreed. We assume pcp->count is protected by pcp->lock. Instead of make code
work in case pcp->count could be changed without lock held, it's more reasonble
to modify pcp->count with pcp->lock held in BPF program.
Chris Li Aug. 22, 2023, 5:48 p.m. UTC | #3
Hi Mel,

Adding Alexei to the discussion.

On Mon, Aug 21, 2023 at 3:32 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> > In this patch series I want to safeguard
> > the free_pcppage_bulk against change in the
> > pcp->count outside of this function. e.g.
> > by BPF program inject on the function tracepoint.
> >
> > I break up the patches into two seperate patches
> > for the safeguard and clean up.
> >
> > Hopefully that is easier to review.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
>
> This sounds like a maintenance nightmare if internal state can be arbitrary
> modified by a BPF program and still expected to work properly in all cases.
> Every review would have to take into account "what if a BPF script modifies
> state behind our back?"

Thanks for the feedback.

I agree that it is hard to support if we allow BPF to change any internal
stage as a rule.  That is why it is a RFC. Would you consider it case
by case basis?
The kernel panic is bad, the first patch is actually very small. I can
also change it
to generate warnings if we detect the inconsistent state.

How about the second (clean up) patch or Keming's clean up version? I can modify
it to take out the pcp->count if the verdict is just not supporting
BPF changing internal
state at all. I do wish to get rid of the pindex_min and pindex_max.

Thanks

Chris
Matthew Wilcox Aug. 22, 2023, 6:28 p.m. UTC | #4
On Tue, Aug 22, 2023 at 10:48:42AM -0700, Chris Li wrote:
> Hi Mel,
> 
> Adding Alexei to the discussion.
> 
> On Mon, Aug 21, 2023 at 3:32 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> > > In this patch series I want to safeguard
> > > the free_pcppage_bulk against change in the
> > > pcp->count outside of this function. e.g.
> > > by BPF program inject on the function tracepoint.
> > >
> > > I break up the patches into two seperate patches
> > > for the safeguard and clean up.
> > >
> > > Hopefully that is easier to review.
> > >
> > > Signed-off-by: Chris Li <chrisl@kernel.org>
> >
> > This sounds like a maintenance nightmare if internal state can be arbitrary
> > modified by a BPF program and still expected to work properly in all cases.
> > Every review would have to take into account "what if a BPF script modifies
> > state behind our back?"
> 
> Thanks for the feedback.
> 
> I agree that it is hard to support if we allow BPF to change any internal
> stage as a rule.  That is why it is a RFC. Would you consider it case
> by case basis?
> The kernel panic is bad, the first patch is actually very small. I can
> also change it
> to generate warnings if we detect the inconsistent state.

We wouldn't allow C code that hooks spinlocks (eg lockdep) to allocate
memory.  I don't understand why BPF code should allocate memory either.
Alexei Starovoitov Aug. 22, 2023, 6:57 p.m. UTC | #5
On Tue, Aug 22, 2023 at 10:48 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Mel,
>
> Adding Alexei to the discussion.
>
> On Mon, Aug 21, 2023 at 3:32 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> > > In this patch series I want to safeguard
> > > the free_pcppage_bulk against change in the
> > > pcp->count outside of this function. e.g.
> > > by BPF program inject on the function tracepoint.
> > >
> > > I break up the patches into two seperate patches
> > > for the safeguard and clean up.
> > >
> > > Hopefully that is easier to review.
> > >
> > > Signed-off-by: Chris Li <chrisl@kernel.org>
> >
> > This sounds like a maintenance nightmare if internal state can be arbitrary
> > modified by a BPF program and still expected to work properly in all cases.
> > Every review would have to take into account "what if a BPF script modifies
> > state behind our back?"

Where did this concern come from?
Since when BPF can modify arbitrary state?

But I wasn't cc-ed on the original patch, so not sure what it attempts to do.
Maybe concern is valid.

> Thanks for the feedback.
>
> I agree that it is hard to support if we allow BPF to change any internal
> stage as a rule.  That is why it is a RFC. Would you consider it case
> by case basis?
> The kernel panic is bad, the first patch is actually very small. I can
> also change it
> to generate warnings if we detect the inconsistent state.

panic and warns because of bpf prog?!
bpf infra takes all the precaution to make sure that bpf progs
can never cause such damage.

>
> How about the second (clean up) patch or Keming's clean up version? I can modify
> it to take out the pcp->count if the verdict is just not supporting
> BPF changing internal
> state at all. I do wish to get rid of the pindex_min and pindex_max.
>
> Thanks
>
> Chris
Chris Li Aug. 22, 2023, 9:14 p.m. UTC | #6
Hi Kemeng,

On Mon, Aug 21, 2023 at 6:27 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> >
> Agreed. We assume pcp->count is protected by pcp->lock. Instead of make code
> work in case pcp->count could be changed without lock held, it's more reasonble
> to modify pcp->count with pcp->lock held in BPF program.

The lock is holded when pcp->count is modified. It is going through
the kernel page
allocation API. The issue is nest memory allocation inside spin_lock()
introduced by BPF.

The execution sequence is like this:

       count = min(pcp->count, count);

        /* Ensure requested pindex is drained first. */
        pindex = pindex - 1;
        bpf_injected_spin_lock_irqsave {
                 alloc_page();
                 original spin_lock_irqsave(&zone->lock, flags) ;
        }

Chris





Chris
Alexei Starovoitov Aug. 22, 2023, 9:19 p.m. UTC | #7
On Tue, Aug 22, 2023 at 2:15 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kemeng,
>
> On Mon, Aug 21, 2023 at 6:27 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> > >
> > Agreed. We assume pcp->count is protected by pcp->lock. Instead of make code
> > work in case pcp->count could be changed without lock held, it's more reasonble
> > to modify pcp->count with pcp->lock held in BPF program.
>
> The lock is holded when pcp->count is modified. It is going through
> the kernel page
> allocation API. The issue is nest memory allocation inside spin_lock()
> introduced by BPF.
>
> The execution sequence is like this:
>
>        count = min(pcp->count, count);
>
>         /* Ensure requested pindex is drained first. */
>         pindex = pindex - 1;
>         bpf_injected_spin_lock_irqsave {
>                  alloc_page();
>                  original spin_lock_irqsave(&zone->lock, flags) ;
>         }

bpf doesn't call into alloc_page() or slab alloc or pcpu alloc from
tracing progs.
All memory is preallocated.
Can you reproduce the issue on the latest upstream kernel?
Chris Li Aug. 22, 2023, 9:29 p.m. UTC | #8
On Tue, Aug 22, 2023 at 2:19 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> >
> > The execution sequence is like this:
> >
> >        count = min(pcp->count, count);
> >
> >         /* Ensure requested pindex is drained first. */
> >         pindex = pindex - 1;
> >         bpf_injected_spin_lock_irqsave {
> >                  alloc_page();
> >                  original spin_lock_irqsave(&zone->lock, flags) ;
> >         }
>
> bpf doesn't call into alloc_page() or slab alloc or pcpu alloc from
> tracing progs.
> All memory is preallocated.

Here is the other patch submission thread which have more detail of
how to reproduce it:
https://lore.kernel.org/linux-mm/20230817-free_pcppages_bulk-v1-1-c14574a9f80c@kernel.org/

It is on older version of the kernel.
> Can you reproduce the issue on the latest upstream kernel?

Hope, the fix on the BPF side went in as commit c66a36af7ba3a628.
I am not aware of other cases.

It seems the consensus is so far is that we don't support BPF doing
nested allocation on spin locks.
That will implite any function called under the spinlocks as well.

Do we care about adding more warnings on this kind of allocation at all?

Chris
Chris Li Aug. 22, 2023, 9:34 p.m. UTC | #9
Hi Alexei,

On Tue, Aug 22, 2023 at 11:57 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 10:48 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Mel,
> >
> > Adding Alexei to the discussion.
> >
> > On Mon, Aug 21, 2023 at 3:32 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> > > > In this patch series I want to safeguard
> > > > the free_pcppage_bulk against change in the
> > > > pcp->count outside of this function. e.g.
> > > > by BPF program inject on the function tracepoint.
> > > >
> > > > I break up the patches into two seperate patches
> > > > for the safeguard and clean up.
> > > >
> > > > Hopefully that is easier to review.
> > > >
> > > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > >
> > > This sounds like a maintenance nightmare if internal state can be arbitrary
> > > modified by a BPF program and still expected to work properly in all cases.
> > > Every review would have to take into account "what if a BPF script modifies
> > > state behind our back?"
>
> Where did this concern come from?
> Since when BPF can modify arbitrary state?
>
> But I wasn't cc-ed on the original patch, so not sure what it attempts to do.
> Maybe concern is valid.

Sorry I did not CC you on the original patch submission.  I should.

Here is the link for the 1/2 patch, which has the step to reproduce.

https://lore.kernel.org/linux-mm/20230817-free_pcppages_bulk-v1-1-c14574a9f80c@kernel.org/

It is using an older version of the BPF program. That spinlock
allocation was fixed
in  commit c66a36af7ba3a628.

Chris


>
> > Thanks for the feedback.
> >
> > I agree that it is hard to support if we allow BPF to change any internal
> > stage as a rule.  That is why it is a RFC. Would you consider it case
> > by case basis?
> > The kernel panic is bad, the first patch is actually very small. I can
> > also change it
> > to generate warnings if we detect the inconsistent state.
>
> panic and warns because of bpf prog?!


> bpf infra takes all the precaution to make sure that bpf progs
> can never cause such damage.
>
> >
> > How about the second (clean up) patch or Keming's clean up version? I can modify
> > it to take out the pcp->count if the verdict is just not supporting
> > BPF changing internal
> > state at all. I do wish to get rid of the pindex_min and pindex_max.
> >
> > Thanks
> >
> > Chris
>
Alexei Starovoitov Aug. 22, 2023, 9:35 p.m. UTC | #10
On Tue, Aug 22, 2023 at 2:29 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Aug 22, 2023 at 2:19 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > >
> > > The execution sequence is like this:
> > >
> > >        count = min(pcp->count, count);
> > >
> > >         /* Ensure requested pindex is drained first. */
> > >         pindex = pindex - 1;
> > >         bpf_injected_spin_lock_irqsave {
> > >                  alloc_page();
> > >                  original spin_lock_irqsave(&zone->lock, flags) ;
> > >         }
> >
> > bpf doesn't call into alloc_page() or slab alloc or pcpu alloc from
> > tracing progs.
> > All memory is preallocated.
>
> Here is the other patch submission thread which have more detail of
> how to reproduce it:
> https://lore.kernel.org/linux-mm/20230817-free_pcppages_bulk-v1-1-c14574a9f80c@kernel.org/
>
> It is on older version of the kernel.

Please demonstrate the issue on the latest kernel.
It's an unnecessary time sink for everyone to review patches
targeting an issue in the old kernel.

> > Can you reproduce the issue on the latest upstream kernel?
>
> Hope, the fix on the BPF side went in as commit c66a36af7ba3a628.
> I am not aware of other cases.

That was a temporary workaround on perf side.
bpf task local storage was properly fixed later.

> It seems the consensus is so far is that we don't support BPF doing
> nested allocation on spin locks.
> That will implite any function called under the spinlocks as well.

We're still talking past each other. bpf uses preallocated memory.
It might look like bpf prog is allocating, but it's actually
not calling into slab.

> Do we care about adding more warnings on this kind of allocation at all?

bpf doesn't mess with mm state.
If you somehow managed to cause mm splat with bpf prog talk to bpf folks first.
It's a bug somewhere in bpf. Not with mm.
Alexei Starovoitov Aug. 22, 2023, 9:37 p.m. UTC | #11
On Tue, Aug 22, 2023 at 2:34 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Alexei,
>
> On Tue, Aug 22, 2023 at 11:57 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 22, 2023 at 10:48 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > Hi Mel,
> > >
> > > Adding Alexei to the discussion.
> > >
> > > On Mon, Aug 21, 2023 at 3:32 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > > On Thu, Aug 17, 2023 at 11:05:22PM -0700, Chris Li wrote:
> > > > > In this patch series I want to safeguard
> > > > > the free_pcppage_bulk against change in the
> > > > > pcp->count outside of this function. e.g.
> > > > > by BPF program inject on the function tracepoint.
> > > > >
> > > > > I break up the patches into two seperate patches
> > > > > for the safeguard and clean up.
> > > > >
> > > > > Hopefully that is easier to review.
> > > > >
> > > > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > >
> > > > This sounds like a maintenance nightmare if internal state can be arbitrary
> > > > modified by a BPF program and still expected to work properly in all cases.
> > > > Every review would have to take into account "what if a BPF script modifies
> > > > state behind our back?"
> >
> > Where did this concern come from?
> > Since when BPF can modify arbitrary state?
> >
> > But I wasn't cc-ed on the original patch, so not sure what it attempts to do.
> > Maybe concern is valid.
>
> Sorry I did not CC you on the original patch submission.  I should.
>
> Here is the link for the 1/2 patch, which has the step to reproduce.
>
> https://lore.kernel.org/linux-mm/20230817-free_pcppages_bulk-v1-1-c14574a9f80c@kernel.org/
>
> It is using an older version of the BPF program. That spinlock
> allocation was fixed
> in  commit c66a36af7ba3a628.

No. It was a temp workaround. It was fixed on bpf local storage side later.
Chris Li Aug. 22, 2023, 9:46 p.m. UTC | #12
On Tue, Aug 22, 2023 at 2:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 22, 2023 at 2:29 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Aug 22, 2023 at 2:19 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > The execution sequence is like this:
> > > >
> > > >        count = min(pcp->count, count);
> > > >
> > > >         /* Ensure requested pindex is drained first. */
> > > >         pindex = pindex - 1;
> > > >         bpf_injected_spin_lock_irqsave {
> > > >                  alloc_page();
> > > >                  original spin_lock_irqsave(&zone->lock, flags) ;
> > > >         }
> > >
> > > bpf doesn't call into alloc_page() or slab alloc or pcpu alloc from
> > > tracing progs.
> > > All memory is preallocated.

That is good to know. Thanks.

> >
> > Here is the other patch submission thread which have more detail of
> > how to reproduce it:
> > https://lore.kernel.org/linux-mm/20230817-free_pcppages_bulk-v1-1-c14574a9f80c@kernel.org/
> >
> > It is on older version of the kernel.
>
> Please demonstrate the issue on the latest kernel.
> It's an unnecessary time sink for everyone to review patches
> targeting an issue in the old kernel.

Thanks, that is the answer I am looking for. That is why I tag it
as RFC.

>
> > > Can you reproduce the issue on the latest upstream kernel?
> >
> > Hope, the fix on the BPF side went in as commit c66a36af7ba3a628.
> > I am not aware of other cases.
>
> That was a temporary workaround on perf side.
> bpf task local storage was properly fixed later.

Ack.

> > It seems the consensus is so far is that we don't support BPF doing
> > nested allocation on spin locks.
> > That will implite any function called under the spinlocks as well.
>
> We're still talking past each other. bpf uses preallocated memory.
> It might look like bpf prog is allocating, but it's actually
> not calling into slab.

Ack.


> > Do we care about adding more warnings on this kind of allocation at all?
>
> bpf doesn't mess with mm state.
> If you somehow managed to cause mm splat with bpf prog talk to bpf folks first.
> It's a bug somewhere in bpf. Not with mm.

Noted. It started as a MM clean up patch. Should include you earlier.

I will spit out the part 2 of the patch as clean up without touching
pcp->count then.

Chris