mbox series

[RFC,0/3] reading proc/pid/maps under RCU

Message ID 20240115183837.205694-1-surenb@google.com (mailing list archive)
Headers show
Series reading proc/pid/maps under RCU | expand

Message

Suren Baghdasaryan Jan. 15, 2024, 6:38 p.m. UTC
The issue this patchset is trying to address is mmap_lock contention when
a low priority task (monitoring, data collecting, etc.) blocks a higher
priority task from making updated to the address space. The contention is
due to the mmap_lock being held for read when reading proc/pid/maps.
With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
locks make VMA access RCU-safe. this provides an opportunity for lock-less
reading of proc/pid/maps. We still need to overcome a couple obstacles:
1. Make all VMA pointer fields used for proc/pid/maps content generation
RCU-safe;
2. Ensure that proc/pid/maps data tearing, which is currently possible at
page boundaries only, does not get worse.

The patchset deals with these issues but there is a downside which I would
like to get input on:
This change introduces unfairness towards the reader of proc/pid/maps,
which can be blocked by an overly active/malicious address space modifyer.
A couple of ways I though we can address this issue are:
1. After several lock-less retries (or some time limit) to fall back to
taking mmap_lock.
2. Employ lock-less reading only if the reader has low priority,
indicating that blocking it is not critical.
3. Introducing a separate procfs file which publishes the same data in
lock-less manner.

I imagine a combination of these approaches can also be employed.
I would like to get feedback on this from the Linux community.

Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
can be replaced with more efficiend rwsem_wait() proposed by Matthew
in [1].

[1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/

Suren Baghdasaryan (3):
  mm: make vm_area_struct anon_name field RCU-safe
  seq_file: add validate() operation to seq_operations
  mm/maps: read proc/pid/maps under RCU

 fs/proc/internal.h        |   3 +
 fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
 fs/seq_file.c             |  24 ++++++-
 include/linux/mm_inline.h |  10 ++-
 include/linux/mm_types.h  |   3 +-
 include/linux/seq_file.h  |   1 +
 mm/madvise.c              |  30 +++++++--
 7 files changed, 181 insertions(+), 20 deletions(-)

Comments

Vlastimil Babka Jan. 16, 2024, 2:42 p.m. UTC | #1
On 1/15/24 19:38, Suren Baghdasaryan wrote:

Hi,

> The issue this patchset is trying to address is mmap_lock contention when
> a low priority task (monitoring, data collecting, etc.) blocks a higher
> priority task from making updated to the address space. The contention is
> due to the mmap_lock being held for read when reading proc/pid/maps.
> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> RCU-safe;
> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> page boundaries only, does not get worse.

Hm I thought we were to only choose this more complicated in case additional
tearing becomes a problem, and at first assume that if software can deal
with page boundary tearing, it can deal with sub-page tearing too?

> The patchset deals with these issues but there is a downside which I would
> like to get input on:
> This change introduces unfairness towards the reader of proc/pid/maps,
> which can be blocked by an overly active/malicious address space modifyer.

So this is a consequence of the validate() operation, right? We could avoid
this if we allowed sub-page tearing.

> A couple of ways I though we can address this issue are:
> 1. After several lock-less retries (or some time limit) to fall back to
> taking mmap_lock.
> 2. Employ lock-less reading only if the reader has low priority,
> indicating that blocking it is not critical.
> 3. Introducing a separate procfs file which publishes the same data in
> lock-less manner.
> 
> I imagine a combination of these approaches can also be employed.
> I would like to get feedback on this from the Linux community.
> 
> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> in [1].
> 
> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> 
> Suren Baghdasaryan (3):
>   mm: make vm_area_struct anon_name field RCU-safe
>   seq_file: add validate() operation to seq_operations
>   mm/maps: read proc/pid/maps under RCU
> 
>  fs/proc/internal.h        |   3 +
>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
>  fs/seq_file.c             |  24 ++++++-
>  include/linux/mm_inline.h |  10 ++-
>  include/linux/mm_types.h  |   3 +-
>  include/linux/seq_file.h  |   1 +
>  mm/madvise.c              |  30 +++++++--
>  7 files changed, 181 insertions(+), 20 deletions(-)
>
Vlastimil Babka Jan. 16, 2024, 2:46 p.m. UTC | #2
On 1/16/24 15:42, Vlastimil Babka wrote:
> On 1/15/24 19:38, Suren Baghdasaryan wrote:
> 
> Hi,
> 
>> The issue this patchset is trying to address is mmap_lock contention when
>> a low priority task (monitoring, data collecting, etc.) blocks a higher
>> priority task from making updated to the address space. The contention is
>> due to the mmap_lock being held for read when reading proc/pid/maps.
>> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
>> locks make VMA access RCU-safe. this provides an opportunity for lock-less
>> reading of proc/pid/maps. We still need to overcome a couple obstacles:
>> 1. Make all VMA pointer fields used for proc/pid/maps content generation
>> RCU-safe;
>> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
>> page boundaries only, does not get worse.
> 
> Hm I thought we were to only choose this more complicated in case additional
> tearing becomes a problem, and at first assume that if software can deal
> with page boundary tearing, it can deal with sub-page tearing too?
> 
>> The patchset deals with these issues but there is a downside which I would
>> like to get input on:
>> This change introduces unfairness towards the reader of proc/pid/maps,
>> which can be blocked by an overly active/malicious address space modifyer.
> 
> So this is a consequence of the validate() operation, right? We could avoid
> this if we allowed sub-page tearing.
> 
>> A couple of ways I though we can address this issue are:
>> 1. After several lock-less retries (or some time limit) to fall back to
>> taking mmap_lock.
>> 2. Employ lock-less reading only if the reader has low priority,
>> indicating that blocking it is not critical.
>> 3. Introducing a separate procfs file which publishes the same data in
>> lock-less manner.

Oh and if this option 3 becomes necessary, then such new file shouldn't
validate() either, and whoever wants to avoid the reader contention and
converts their monitoring to the new file will have to account for this
possible extra tearing from the start. So I would suggest trying to change
the existing file with no validate() first, and if existing userspace gets
broken, employ option 3. This would mean no validate() in either case?

>> I imagine a combination of these approaches can also be employed.
>> I would like to get feedback on this from the Linux community.
>> 
>> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
>> can be replaced with more efficiend rwsem_wait() proposed by Matthew
>> in [1].
>> 
>> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
>> 
>> Suren Baghdasaryan (3):
>>   mm: make vm_area_struct anon_name field RCU-safe
>>   seq_file: add validate() operation to seq_operations
>>   mm/maps: read proc/pid/maps under RCU
>> 
>>  fs/proc/internal.h        |   3 +
>>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
>>  fs/seq_file.c             |  24 ++++++-
>>  include/linux/mm_inline.h |  10 ++-
>>  include/linux/mm_types.h  |   3 +-
>>  include/linux/seq_file.h  |   1 +
>>  mm/madvise.c              |  30 +++++++--
>>  7 files changed, 181 insertions(+), 20 deletions(-)
>> 
>
Suren Baghdasaryan Jan. 16, 2024, 5:57 p.m. UTC | #3
On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/16/24 15:42, Vlastimil Babka wrote:
> > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> >
> > Hi,
> >
> >> The issue this patchset is trying to address is mmap_lock contention when
> >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> >> priority task from making updated to the address space. The contention is
> >> due to the mmap_lock being held for read when reading proc/pid/maps.
> >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> >> RCU-safe;
> >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> >> page boundaries only, does not get worse.
> >
> > Hm I thought we were to only choose this more complicated in case additional
> > tearing becomes a problem, and at first assume that if software can deal
> > with page boundary tearing, it can deal with sub-page tearing too?

Hi Vlastimil,
Thanks for the feedback!
Yes, originally I thought we wouldn't be able to avoid additional
tearing without a big change but then realized it's not that hard, so
I tried to keep the change in behavior transparent to the userspace.

> >
> >> The patchset deals with these issues but there is a downside which I would
> >> like to get input on:
> >> This change introduces unfairness towards the reader of proc/pid/maps,
> >> which can be blocked by an overly active/malicious address space modifyer.
> >
> > So this is a consequence of the validate() operation, right? We could avoid
> > this if we allowed sub-page tearing.

Yes, if we don't care about sub-page tearing then we could get rid of
validate step and this issue with updaters blocking the reader would
go away. If we choose that direction there will be one more issue to
fix, namely the maple_tree temporary inconsistent state when a VMA is
replaced with another one and we might observe NULL there. We might be
able to use Matthew's rwsem_wait() to deal with that issue.

> >
> >> A couple of ways I though we can address this issue are:
> >> 1. After several lock-less retries (or some time limit) to fall back to
> >> taking mmap_lock.
> >> 2. Employ lock-less reading only if the reader has low priority,
> >> indicating that blocking it is not critical.
> >> 3. Introducing a separate procfs file which publishes the same data in
> >> lock-less manner.
>
> Oh and if this option 3 becomes necessary, then such new file shouldn't
> validate() either, and whoever wants to avoid the reader contention and
> converts their monitoring to the new file will have to account for this
> possible extra tearing from the start. So I would suggest trying to change
> the existing file with no validate() first, and if existing userspace gets
> broken, employ option 3. This would mean no validate() in either case?

Yes but I was trying to avoid introducing additional file which
publishes the same content in a slightly different way. We will have
to explain when userspace should use one vs the other and that would
require going into low level implementation details, I think. Don't
know if that's acceptable/preferable.
Thanks,
Suren.

>
> >> I imagine a combination of these approaches can also be employed.
> >> I would like to get feedback on this from the Linux community.
> >>
> >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> >> in [1].
> >>
> >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> >>
> >> Suren Baghdasaryan (3):
> >>   mm: make vm_area_struct anon_name field RCU-safe
> >>   seq_file: add validate() operation to seq_operations
> >>   mm/maps: read proc/pid/maps under RCU
> >>
> >>  fs/proc/internal.h        |   3 +
> >>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
> >>  fs/seq_file.c             |  24 ++++++-
> >>  include/linux/mm_inline.h |  10 ++-
> >>  include/linux/mm_types.h  |   3 +-
> >>  include/linux/seq_file.h  |   1 +
> >>  mm/madvise.c              |  30 +++++++--
> >>  7 files changed, 181 insertions(+), 20 deletions(-)
> >>
> >
>
Suren Baghdasaryan Jan. 18, 2024, 5:58 p.m. UTC | #4
On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 1/16/24 15:42, Vlastimil Babka wrote:
> > > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> > >
> > > Hi,
> > >
> > >> The issue this patchset is trying to address is mmap_lock contention when
> > >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> > >> priority task from making updated to the address space. The contention is
> > >> due to the mmap_lock being held for read when reading proc/pid/maps.
> > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> > >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> > >> RCU-safe;
> > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> > >> page boundaries only, does not get worse.
> > >
> > > Hm I thought we were to only choose this more complicated in case additional
> > > tearing becomes a problem, and at first assume that if software can deal
> > > with page boundary tearing, it can deal with sub-page tearing too?
>
> Hi Vlastimil,
> Thanks for the feedback!
> Yes, originally I thought we wouldn't be able to avoid additional
> tearing without a big change but then realized it's not that hard, so
> I tried to keep the change in behavior transparent to the userspace.

In the absence of other feedback I'm going to implement and post the
originally envisioned approach: remove validation step and avoid any
possibility of blocking but allowing for sub-page tearing. Will use
Matthew's rwsem_wait() to deal with possible inconsistent maple_tree
state.
Thanks,
Suren.

>
> > >
> > >> The patchset deals with these issues but there is a downside which I would
> > >> like to get input on:
> > >> This change introduces unfairness towards the reader of proc/pid/maps,
> > >> which can be blocked by an overly active/malicious address space modifyer.
> > >
> > > So this is a consequence of the validate() operation, right? We could avoid
> > > this if we allowed sub-page tearing.
>
> Yes, if we don't care about sub-page tearing then we could get rid of
> validate step and this issue with updaters blocking the reader would
> go away. If we choose that direction there will be one more issue to
> fix, namely the maple_tree temporary inconsistent state when a VMA is
> replaced with another one and we might observe NULL there. We might be
> able to use Matthew's rwsem_wait() to deal with that issue.
>
> > >
> > >> A couple of ways I though we can address this issue are:
> > >> 1. After several lock-less retries (or some time limit) to fall back to
> > >> taking mmap_lock.
> > >> 2. Employ lock-less reading only if the reader has low priority,
> > >> indicating that blocking it is not critical.
> > >> 3. Introducing a separate procfs file which publishes the same data in
> > >> lock-less manner.
> >
> > Oh and if this option 3 becomes necessary, then such new file shouldn't
> > validate() either, and whoever wants to avoid the reader contention and
> > converts their monitoring to the new file will have to account for this
> > possible extra tearing from the start. So I would suggest trying to change
> > the existing file with no validate() first, and if existing userspace gets
> > broken, employ option 3. This would mean no validate() in either case?
>
> Yes but I was trying to avoid introducing additional file which
> publishes the same content in a slightly different way. We will have
> to explain when userspace should use one vs the other and that would
> require going into low level implementation details, I think. Don't
> know if that's acceptable/preferable.
> Thanks,
> Suren.
>
> >
> > >> I imagine a combination of these approaches can also be employed.
> > >> I would like to get feedback on this from the Linux community.
> > >>
> > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> > >> in [1].
> > >>
> > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> > >>
> > >> Suren Baghdasaryan (3):
> > >>   mm: make vm_area_struct anon_name field RCU-safe
> > >>   seq_file: add validate() operation to seq_operations
> > >>   mm/maps: read proc/pid/maps under RCU
> > >>
> > >>  fs/proc/internal.h        |   3 +
> > >>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
> > >>  fs/seq_file.c             |  24 ++++++-
> > >>  include/linux/mm_inline.h |  10 ++-
> > >>  include/linux/mm_types.h  |   3 +-
> > >>  include/linux/seq_file.h  |   1 +
> > >>  mm/madvise.c              |  30 +++++++--
> > >>  7 files changed, 181 insertions(+), 20 deletions(-)
> > >>
> > >
> >
Suren Baghdasaryan Jan. 22, 2024, 7:23 a.m. UTC | #5
On Thu, Jan 18, 2024 at 9:58 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 1/16/24 15:42, Vlastimil Babka wrote:
> > > > On 1/15/24 19:38, Suren Baghdasaryan wrote:
> > > >
> > > > Hi,
> > > >
> > > >> The issue this patchset is trying to address is mmap_lock contention when
> > > >> a low priority task (monitoring, data collecting, etc.) blocks a higher
> > > >> priority task from making updated to the address space. The contention is
> > > >> due to the mmap_lock being held for read when reading proc/pid/maps.
> > > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma
> > > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less
> > > >> reading of proc/pid/maps. We still need to overcome a couple obstacles:
> > > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation
> > > >> RCU-safe;
> > > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at
> > > >> page boundaries only, does not get worse.
> > > >
> > > > Hm I thought we were to only choose this more complicated in case additional
> > > > tearing becomes a problem, and at first assume that if software can deal
> > > > with page boundary tearing, it can deal with sub-page tearing too?
> >
> > Hi Vlastimil,
> > Thanks for the feedback!
> > Yes, originally I thought we wouldn't be able to avoid additional
> > tearing without a big change but then realized it's not that hard, so
> > I tried to keep the change in behavior transparent to the userspace.
>
> In the absence of other feedback I'm going to implement and post the
> originally envisioned approach: remove validation step and avoid any
> possibility of blocking but allowing for sub-page tearing. Will use
> Matthew's rwsem_wait() to deal with possible inconsistent maple_tree
> state.

I posted v1 at
https://lore.kernel.org/all/20240122071324.2099712-1-surenb@google.com/
In the RFC I used mm_struct.mm_lock_seq to detect if mm is being
changed but then I realized that won't work. mm_struct.mm_lock_seq is
incremented after mm is changed and right before mmap_lock is
write-unlocked. Instead I need a counter that changes once we
write-lock mmap_lock and before any mm changes. So the new patchset
introduces a separate counter to detect possible mm changes. In
addition, I could not use rwsem_wait() and instead had to take
mmap_lock for read to wait for the writer to finish and then record
the new counter while holding mmap_lock for read. That prevents
concurrent mm changes while we are recording the new counter value.

> Thanks,
> Suren.
>
> >
> > > >
> > > >> The patchset deals with these issues but there is a downside which I would
> > > >> like to get input on:
> > > >> This change introduces unfairness towards the reader of proc/pid/maps,
> > > >> which can be blocked by an overly active/malicious address space modifyer.
> > > >
> > > > So this is a consequence of the validate() operation, right? We could avoid
> > > > this if we allowed sub-page tearing.
> >
> > Yes, if we don't care about sub-page tearing then we could get rid of
> > validate step and this issue with updaters blocking the reader would
> > go away. If we choose that direction there will be one more issue to
> > fix, namely the maple_tree temporary inconsistent state when a VMA is
> > replaced with another one and we might observe NULL there. We might be
> > able to use Matthew's rwsem_wait() to deal with that issue.
> >
> > > >
> > > >> A couple of ways I though we can address this issue are:
> > > >> 1. After several lock-less retries (or some time limit) to fall back to
> > > >> taking mmap_lock.
> > > >> 2. Employ lock-less reading only if the reader has low priority,
> > > >> indicating that blocking it is not critical.
> > > >> 3. Introducing a separate procfs file which publishes the same data in
> > > >> lock-less manner.
> > >
> > > Oh and if this option 3 becomes necessary, then such new file shouldn't
> > > validate() either, and whoever wants to avoid the reader contention and
> > > converts their monitoring to the new file will have to account for this
> > > possible extra tearing from the start. So I would suggest trying to change
> > > the existing file with no validate() first, and if existing userspace gets
> > > broken, employ option 3. This would mean no validate() in either case?
> >
> > Yes but I was trying to avoid introducing additional file which
> > publishes the same content in a slightly different way. We will have
> > to explain when userspace should use one vs the other and that would
> > require going into low level implementation details, I think. Don't
> > know if that's acceptable/preferable.
> > Thanks,
> > Suren.
> >
> > >
> > > >> I imagine a combination of these approaches can also be employed.
> > > >> I would like to get feedback on this from the Linux community.
> > > >>
> > > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map()
> > > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew
> > > >> in [1].
> > > >>
> > > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@casper.infradead.org/
> > > >>
> > > >> Suren Baghdasaryan (3):
> > > >>   mm: make vm_area_struct anon_name field RCU-safe
> > > >>   seq_file: add validate() operation to seq_operations
> > > >>   mm/maps: read proc/pid/maps under RCU
> > > >>
> > > >>  fs/proc/internal.h        |   3 +
> > > >>  fs/proc/task_mmu.c        | 130 ++++++++++++++++++++++++++++++++++----
> > > >>  fs/seq_file.c             |  24 ++++++-
> > > >>  include/linux/mm_inline.h |  10 ++-
> > > >>  include/linux/mm_types.h  |   3 +-
> > > >>  include/linux/seq_file.h  |   1 +
> > > >>  mm/madvise.c              |  30 +++++++--
> > > >>  7 files changed, 181 insertions(+), 20 deletions(-)
> > > >>
> > > >
> > >