mbox series

[0/6] rds: rdma: Add ability to force GFP_NOIO

Message ID 20240513125346.764076-1-haakon.bugge@oracle.com (mailing list archive)
Headers show
Series rds: rdma: Add ability to force GFP_NOIO | expand

Message

Haakon Bugge May 13, 2024, 12:53 p.m. UTC
This series enables RDS and the RDMA stack to be used as a block I/O
device. This to support a filesystem on top of a raw block device
which uses RDS and the RDMA stack as the network transport layer.

Under intense memory pressure, we get memory reclaims. Assume the
filesystem reclaims memory, goes to the raw block device, which calls
into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
allocations in RDS or the RDMA stack require reclaims to be fulfilled,
we end up in a circular dependency.

We break this circular dependency by:

1. Force all allocations in RDS and the relevant RDMA stack to use
   GFP_NOIO, by means of a parenthetic use of
   memalloc_noio_{save,restore} on all relevant entry points.

2. Make sure work-queues inherits current->flags
   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
   work-queue inherits the same flag(s).

Håkon Bugge (6):
  workqueue: Inherit NOIO and NOFS alloc flags
  rds: Brute force GFP_NOIO
  RDMA/cma: Brute force GFP_NOIO
  RDMA/cm: Brute force GFP_NOIO
  RDMA/mlx5: Brute force GFP_NOIO
  net/mlx5: Brute force GFP_NOIO

 drivers/infiniband/core/cm.c                  | 15 ++++-
 drivers/infiniband/core/cma.c                 | 20 ++++++-
 drivers/infiniband/hw/mlx5/main.c             | 22 +++++--
 .../net/ethernet/mellanox/mlx5/core/main.c    | 14 ++++-
 include/linux/workqueue.h                     |  2 +
 kernel/workqueue.c                            | 17 ++++++
 net/rds/af_rds.c                              | 60 ++++++++++++++++++-
 7 files changed, 138 insertions(+), 12 deletions(-)

--
2.39.3

Comments

Jason Gunthorpe May 13, 2024, 11:03 p.m. UTC | #1
On Mon, May 13, 2024 at 02:53:40PM +0200, Håkon Bugge wrote:
> This series enables RDS and the RDMA stack to be used as a block I/O
> device. This to support a filesystem on top of a raw block device
> which uses RDS and the RDMA stack as the network transport layer.
> 
> Under intense memory pressure, we get memory reclaims. Assume the
> filesystem reclaims memory, goes to the raw block device, which calls
> into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
> allocations in RDS or the RDMA stack require reclaims to be fulfilled,
> we end up in a circular dependency.
> 
> We break this circular dependency by:
> 
> 1. Force all allocations in RDS and the relevant RDMA stack to use
>    GFP_NOIO, by means of a parenthetic use of
>    memalloc_noio_{save,restore} on all relevant entry points.

I didn't see an obvious explanation why each of these changes was
necessary. I expected this:
 
> 2. Make sure work-queues inherits current->flags
>    wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
>    work-queue inherits the same flag(s).

To broadly capture everything and understood this was the general plan
from the MM side instead of direct annotation?

So, can you explain in each case why it needs an explicit change?

And further, is there any validation of this? There is some lockdep
tracking of reclaim, I feel like it should be more robustly hooked up
in RDMA if we expect this to really work..

Jason
Zhu Yanjun May 14, 2024, 8:53 a.m. UTC | #2
On 13.05.24 14:53, Håkon Bugge wrote:
> This series enables RDS and the RDMA stack to be used as a block I/O
> device. This to support a filesystem on top of a raw block device

This is to support a filesystem ... ?

> which uses RDS and the RDMA stack as the network transport layer.
> 
> Under intense memory pressure, we get memory reclaims. Assume the
> filesystem reclaims memory, goes to the raw block device, which calls
> into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
> allocations in RDS or the RDMA stack require reclaims to be fulfilled,
> we end up in a circular dependency.
> 
> We break this circular dependency by:
> 
> 1. Force all allocations in RDS and the relevant RDMA stack to use
>     GFP_NOIO, by means of a parenthetic use of
>     memalloc_noio_{save,restore} on all relevant entry points.
> 
> 2. Make sure work-queues inherits current->flags
>     wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
>     work-queue inherits the same flag(s).
> 
> Håkon Bugge (6):
>    workqueue: Inherit NOIO and NOFS alloc flags
>    rds: Brute force GFP_NOIO
>    RDMA/cma: Brute force GFP_NOIO
>    RDMA/cm: Brute force GFP_NOIO
>    RDMA/mlx5: Brute force GFP_NOIO
>    net/mlx5: Brute force GFP_NOIO
> 
>   drivers/infiniband/core/cm.c                  | 15 ++++-
>   drivers/infiniband/core/cma.c                 | 20 ++++++-
>   drivers/infiniband/hw/mlx5/main.c             | 22 +++++--
>   .../net/ethernet/mellanox/mlx5/core/main.c    | 14 ++++-
>   include/linux/workqueue.h                     |  2 +
>   kernel/workqueue.c                            | 17 ++++++
>   net/rds/af_rds.c                              | 60 ++++++++++++++++++-
>   7 files changed, 138 insertions(+), 12 deletions(-)
> 
> --
> 2.39.3
>
Zhu Yanjun May 14, 2024, 12:02 p.m. UTC | #3
On 14.05.24 10:53, Zhu Yanjun wrote:
> On 13.05.24 14:53, Håkon Bugge wrote:
>> This series enables RDS and the RDMA stack to be used as a block I/O
>> device. This to support a filesystem on top of a raw block device
> 
> This is to support a filesystem ... ?

Sorry. my bad. I mean, normally rds is used to act as a communication 
protocol between Oracle databases. Now in this patch series, it seems 
that rds acts as a communication protocol to support a filesystem. So I 
am curious which filesystem that rds is supporting?

Thanks a lot.
Zhu Yanjun

> 
>> which uses RDS and the RDMA stack as the network transport layer.
>>
>> Under intense memory pressure, we get memory reclaims. Assume the
>> filesystem reclaims memory, goes to the raw block device, which calls
>> into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
>> allocations in RDS or the RDMA stack require reclaims to be fulfilled,
>> we end up in a circular dependency.
>>
>> We break this circular dependency by:
>>
>> 1. Force all allocations in RDS and the relevant RDMA stack to use
>>     GFP_NOIO, by means of a parenthetic use of
>>     memalloc_noio_{save,restore} on all relevant entry points.
>>
>> 2. Make sure work-queues inherits current->flags
>>     wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
>>     work-queue inherits the same flag(s).
>>
>> Håkon Bugge (6):
>>    workqueue: Inherit NOIO and NOFS alloc flags
>>    rds: Brute force GFP_NOIO
>>    RDMA/cma: Brute force GFP_NOIO
>>    RDMA/cm: Brute force GFP_NOIO
>>    RDMA/mlx5: Brute force GFP_NOIO
>>    net/mlx5: Brute force GFP_NOIO
>>
>>   drivers/infiniband/core/cm.c                  | 15 ++++-
>>   drivers/infiniband/core/cma.c                 | 20 ++++++-
>>   drivers/infiniband/hw/mlx5/main.c             | 22 +++++--
>>   .../net/ethernet/mellanox/mlx5/core/main.c    | 14 ++++-
>>   include/linux/workqueue.h                     |  2 +
>>   kernel/workqueue.c                            | 17 ++++++
>>   net/rds/af_rds.c                              | 60 ++++++++++++++++++-
>>   7 files changed, 138 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.39.3
>>
>
Haakon Bugge May 14, 2024, 6:19 p.m. UTC | #4
Hi Jason,


> On 14 May 2024, at 01:03, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, May 13, 2024 at 02:53:40PM +0200, Håkon Bugge wrote:
>> This series enables RDS and the RDMA stack to be used as a block I/O
>> device. This to support a filesystem on top of a raw block device
>> which uses RDS and the RDMA stack as the network transport layer.
>> 
>> Under intense memory pressure, we get memory reclaims. Assume the
>> filesystem reclaims memory, goes to the raw block device, which calls
>> into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
>> allocations in RDS or the RDMA stack require reclaims to be fulfilled,
>> we end up in a circular dependency.
>> 
>> We break this circular dependency by:
>> 
>> 1. Force all allocations in RDS and the relevant RDMA stack to use
>>   GFP_NOIO, by means of a parenthetic use of
>>   memalloc_noio_{save,restore} on all relevant entry points.
> 
> I didn't see an obvious explanation why each of these changes was
> necessary. I expected this:
> 
>> 2. Make sure work-queues inherits current->flags
>>   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
>>   work-queue inherits the same flag(s).

When the modules initialize, it does not help to have 2., unless PF_MEMALLOC_NOIO is set in current->flags. That is most probably not set, e.g. considering modprobe. That is why we have these steps in all the five modules. During module initialization, work queues are allocated in all mentioned modules. Therefore, the module initialization functions need the paranthetic use of memalloc_noio_{save,restore}.

> To broadly capture everything and understood this was the general plan
> from the MM side instead of direct annotation?
> 
> So, can you explain in each case why it needs an explicit change?

I hope my comment above explains this.

> And further, is there any validation of this? There is some lockdep
> tracking of reclaim, I feel like it should be more robustly hooked up
> in RDMA if we expect this to really work..

Oracle is about to launch a product using this series, so the techniques used have been thoroughly validated, allthough on an older kernel version.


Thxs, Håkon
Haakon Bugge May 14, 2024, 6:32 p.m. UTC | #5
Hi Yanjun,


> On 14 May 2024, at 14:02, Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> 
> 
> 
> On 14.05.24 10:53, Zhu Yanjun wrote:
>> On 13.05.24 14:53, Håkon Bugge wrote:
>>> This series enables RDS and the RDMA stack to be used as a block I/O
>>> device. This to support a filesystem on top of a raw block device
>> This is to support a filesystem ... ?
> 
> Sorry. my bad. I mean, normally rds is used to act as a communication protocol between Oracle databases. Now in this patch series, it seems that rds acts as a communication protocol to support a filesystem. So I am curious which filesystem that rds is supporting?

The peer here is a file-server which acts a block device. What Oracle calls a cell-server. The initiator here, is actually using XFS over an Oracle in-kernel pseudo-volume block device.


Thxs,  Håkon
Zhu Yanjun May 15, 2024, 10:25 a.m. UTC | #6
在 2024/5/14 20:32, Haakon Bugge 写道:
> Hi Yanjun,
> 
> 
>> On 14 May 2024, at 14:02, Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>>
>>
>>
>> On 14.05.24 10:53, Zhu Yanjun wrote:
>>> On 13.05.24 14:53, Håkon Bugge wrote:
>>>> This series enables RDS and the RDMA stack to be used as a block I/O
>>>> device. This to support a filesystem on top of a raw block device
>>> This is to support a filesystem ... ?
>>
>> Sorry. my bad. I mean, normally rds is used to act as a communication protocol between Oracle databases. Now in this patch series, it seems that rds acts as a communication protocol to support a filesystem. So I am curious which filesystem that rds is supporting?
> 
> The peer here is a file-server which acts a block device. What Oracle calls a cell-server. The initiator here, is actually using XFS over an Oracle in-kernel pseudo-volume block device.

Thanks Haakon.
There is a link about GFP_NOFS and GFP_NOIO, 
https://lore.kernel.org/linux-fsdevel/ZZcgXI46AinlcBDP@casper.infradead.org/.

I am not sure if you have read this link or not. In this link, the 
writer has his ideas about GFP_NOFS and GFP_NOIO.

"
My interest in this is that I'd like to get rid of the FGP_NOFS flag. 
It'd also be good to get rid of the __GFP_FS flag since there's always 
demand for more GFP flags.  I have a git branch with some work in this 
area, so there's a certain amount of conference-driven development going 
on here too.

We could mutatis mutandi for GFP_NOIO, memalloc_noio_save/restore, 
__GFP_IO, etc, so maybe the block people are also interested.  I haven't 
looked into that in any detail though.  I guess we'll see what interest 
this topic gains.
"

Anyway, good luck!

Zhu Yanjun

> 
> 
> Thxs,  Håkon
>
Jason Gunthorpe May 17, 2024, 5:30 p.m. UTC | #7
On Tue, May 14, 2024 at 06:19:53PM +0000, Haakon Bugge wrote:
> Hi Jason,
> 
> 
> > On 14 May 2024, at 01:03, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, May 13, 2024 at 02:53:40PM +0200, Håkon Bugge wrote:
> >> This series enables RDS and the RDMA stack to be used as a block I/O
> >> device. This to support a filesystem on top of a raw block device
> >> which uses RDS and the RDMA stack as the network transport layer.
> >> 
> >> Under intense memory pressure, we get memory reclaims. Assume the
> >> filesystem reclaims memory, goes to the raw block device, which calls
> >> into RDS, which calls the RDMA stack. Now, if regular GFP_KERNEL
> >> allocations in RDS or the RDMA stack require reclaims to be fulfilled,
> >> we end up in a circular dependency.
> >> 
> >> We break this circular dependency by:
> >> 
> >> 1. Force all allocations in RDS and the relevant RDMA stack to use
> >>   GFP_NOIO, by means of a parenthetic use of
> >>   memalloc_noio_{save,restore} on all relevant entry points.
> > 
> > I didn't see an obvious explanation why each of these changes was
> > necessary. I expected this:
> > 
> >> 2. Make sure work-queues inherits current->flags
> >>   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
> >>   work-queue inherits the same flag(s).
> 
> When the modules initialize, it does not help to have 2., unless
> PF_MEMALLOC_NOIO is set in current->flags. That is most probably not
> set, e.g. considering modprobe. That is why we have these steps in
> all the five modules. During module initialization, work queues are
> allocated in all mentioned modules. Therefore, the module
> initialization functions need the paranthetic use of
> memalloc_noio_{save,restore}.

And why would I need these work queues to have noio? they are never
called under a filesystem.

You need to explain in every single case how something in a NOIO
context becomes entangled with the unrelated thing you are taggin NIO.

Historically when we've tried to do this we gave up because the entire
subsystem end up being NOIO.

> > And further, is there any validation of this? There is some lockdep
> > tracking of reclaim, I feel like it should be more robustly hooked up
> > in RDMA if we expect this to really work..
> 
> Oracle is about to launch a product using this series, so the
> techniques used have been thoroughly validated, allthough on an
> older kernel version.

That doesn't really help keep it working. I want to see some kind of
lockdep scheme to enforce this that can validate without ever
triggering reclaim.

Jason