Message ID | 20240605002459.4091285-5-andrii@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ioctl()-based API to query VMAs from /proc/<pid>/maps | expand |
On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > as much as possible, only falling back to mmap_lock if per-VMA lock > failed. This is done so that querying of VMAs doesn't interfere with > other critical tasks, like page fault handling. > > This has been suggested by mm folks, and we make use of a newly added > internal API that works like find_vma(), but tries to use per-VMA lock. > > We have two sets of setup/query/teardown helper functions with different > implementations depending on availability of per-VMA lock (conditioned > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > When per-VMA lock is available, lookup is done under RCU, attempting to > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > immediately. In this configuration mmap_lock is never helf for long, > minimizing disruptions while querying. > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > using find_vma() API, and then unlock mmap_lock at the very end once as > well. In this setup we avoid locking/unlocking mmap_lock on every looked > up VMA (depending on query parameters we might need to iterate a few of > them). > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 46 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 614fbe5d0667..140032ffc551 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > PROCMAP_QUERY_VMA_FLAGS \ > ) > > +#ifdef CONFIG_PER_VMA_LOCK > +static int query_vma_setup(struct mm_struct *mm) > +{ > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > + return 0; > +} > + > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > +{ > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > + if (vma) > + vma_end_read(vma); > +} > + > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > +{ > + struct vm_area_struct *vma; > + > + /* try to use less disruptive per-VMA lock */ > + vma = find_and_lock_vma_rcu(mm, addr); > + if (IS_ERR(vma)) { > + /* failed to take per-VMA lock, fallback to mmap_lock */ > + if (mmap_read_lock_killable(mm)) > + return ERR_PTR(-EINTR); > + > + vma = find_vma(mm, addr); > + if (vma) { > + /* > + * We cannot use vma_start_read() as it may fail due to > + * false locked (see comment in vma_start_read()). We > + * can avoid that by directly locking vm_lock under > + * mmap_lock, which guarantees that nobody can lock the > + * vma for write (vma_start_write()) under us. > + */ > + down_read(&vma->vm_lock->lock); Hi Andrii, The above pattern of locking VMA under mmap_lock and then dropping mmap_lock is becoming more common. Matthew had an RFC proposal for an API to do this here: https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It might be worth reviving that discussion. > + } > + > + mmap_read_unlock(mm); Later on in your code you are calling get_vma_name() which might call anon_vma_name() to retrieve user-defined VMA name. After this patch this operation will be done without holding mmap_lock, however per https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 this function has to be called with mmap_lock held for read. Indeed with debug flags enabled you should hit this assertion: https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > + } > + > + return vma; > +} > +#else > static int query_vma_setup(struct mm_struct *mm) > { > return mmap_read_lock_killable(mm); > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig > { > return find_vma(mm, addr); > } > +#endif > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > unsigned long addr, u32 flags) > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > skip_vma: > /* > * If the user needs closest matching VMA, keep iterating. > + * But before we proceed we might need to unlock current VMA. > */ > addr = vma->vm_end; > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) > goto next_vma; > no_vma: > -- > 2.43.0 >
On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > as much as possible, only falling back to mmap_lock if per-VMA lock > > failed. This is done so that querying of VMAs doesn't interfere with > > other critical tasks, like page fault handling. > > > > This has been suggested by mm folks, and we make use of a newly added > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > We have two sets of setup/query/teardown helper functions with different > > implementations depending on availability of per-VMA lock (conditioned > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > immediately. In this configuration mmap_lock is never helf for long, > > minimizing disruptions while querying. > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > using find_vma() API, and then unlock mmap_lock at the very end once as > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > up VMA (depending on query parameters we might need to iterate a few of > > them). > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 46 insertions(+) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 614fbe5d0667..140032ffc551 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > PROCMAP_QUERY_VMA_FLAGS \ > > ) > > > > +#ifdef CONFIG_PER_VMA_LOCK > > +static int query_vma_setup(struct mm_struct *mm) > > +{ > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > + return 0; > > +} > > + > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > +{ > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > + if (vma) > > + vma_end_read(vma); > > +} > > + > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > +{ > > + struct vm_area_struct *vma; > > + > > + /* try to use less disruptive per-VMA lock */ > > + vma = find_and_lock_vma_rcu(mm, addr); > > + if (IS_ERR(vma)) { > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > + if (mmap_read_lock_killable(mm)) > > + return ERR_PTR(-EINTR); > > + > > + vma = find_vma(mm, addr); > > + if (vma) { > > + /* > > + * We cannot use vma_start_read() as it may fail due to > > + * false locked (see comment in vma_start_read()). We > > + * can avoid that by directly locking vm_lock under > > + * mmap_lock, which guarantees that nobody can lock the > > + * vma for write (vma_start_write()) under us. > > + */ > > + down_read(&vma->vm_lock->lock); > > Hi Andrii, > The above pattern of locking VMA under mmap_lock and then dropping > mmap_lock is becoming more common. Matthew had an RFC proposal for an > API to do this here: > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > might be worth reviving that discussion. Sure, it would be nice to have generic and blessed primitives to use here. But the good news is that once this is all figured out by you mm folks, it should be easy to make use of those primitives here, right? > > > + } > > + > > + mmap_read_unlock(mm); > > Later on in your code you are calling get_vma_name() which might call > anon_vma_name() to retrieve user-defined VMA name. After this patch > this operation will be done without holding mmap_lock, however per > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > this function has to be called with mmap_lock held for read. Indeed > with debug flags enabled you should hit this assertion: > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. Sigh... Ok, what's the suggestion then? Should it be some variant of mmap_assert_locked() || vma_assert_locked() logic, or it's not so simple? Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until all these gotchas are figured out for /proc/<pid>/maps anyway, and then we can adapt both text-based and ioctl-based /proc/<pid>/maps APIs on top of whatever the final approach will end up being the right one? Liam, any objections to this? The whole point of this patch set is to add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My implementation is structured in a way that should be easily amenable to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle things that need to be figured for existing text-based /proc/<pid>/maps anyways, I think it would be best to use mmap_lock for now for this new API, and then adopt the same final CONFIG_PER_VMA_LOCK-aware solution. > > > + } > > + > > + return vma; > > +} > > +#else > > static int query_vma_setup(struct mm_struct *mm) > > { > > return mmap_read_lock_killable(mm); > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig > > { > > return find_vma(mm, addr); > > } > > +#endif > > > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > unsigned long addr, u32 flags) > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > skip_vma: > > /* > > * If the user needs closest matching VMA, keep iterating. > > + * But before we proceed we might need to unlock current VMA. > > */ > > addr = vma->vm_end; > > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ > > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) > > goto next_vma; > > no_vma: > > -- > > 2.43.0 > >
On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > failed. This is done so that querying of VMAs doesn't interfere with > > > other critical tasks, like page fault handling. > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > We have two sets of setup/query/teardown helper functions with different > > > implementations depending on availability of per-VMA lock (conditioned > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > immediately. In this configuration mmap_lock is never helf for long, > > > minimizing disruptions while querying. > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > up VMA (depending on query parameters we might need to iterate a few of > > > them). > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 46 insertions(+) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 614fbe5d0667..140032ffc551 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > PROCMAP_QUERY_VMA_FLAGS \ > > > ) > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > +static int query_vma_setup(struct mm_struct *mm) > > > +{ > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > + return 0; > > > +} > > > + > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > +{ > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > + if (vma) > > > + vma_end_read(vma); > > > +} > > > + > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > +{ > > > + struct vm_area_struct *vma; > > > + > > > + /* try to use less disruptive per-VMA lock */ > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > + if (IS_ERR(vma)) { > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > + if (mmap_read_lock_killable(mm)) > > > + return ERR_PTR(-EINTR); > > > + > > > + vma = find_vma(mm, addr); > > > + if (vma) { > > > + /* > > > + * We cannot use vma_start_read() as it may fail due to > > > + * false locked (see comment in vma_start_read()). We > > > + * can avoid that by directly locking vm_lock under > > > + * mmap_lock, which guarantees that nobody can lock the > > > + * vma for write (vma_start_write()) under us. > > > + */ > > > + down_read(&vma->vm_lock->lock); > > > > Hi Andrii, > > The above pattern of locking VMA under mmap_lock and then dropping > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > API to do this here: > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > might be worth reviving that discussion. > > Sure, it would be nice to have generic and blessed primitives to use > here. But the good news is that once this is all figured out by you mm > folks, it should be easy to make use of those primitives here, right? > > > > > > + } > > > + > > > + mmap_read_unlock(mm); > > > > Later on in your code you are calling get_vma_name() which might call > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > this operation will be done without holding mmap_lock, however per > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > this function has to be called with mmap_lock held for read. Indeed > > with debug flags enabled you should hit this assertion: > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > > Sigh... Ok, what's the suggestion then? Should it be some variant of > mmap_assert_locked() || vma_assert_locked() logic, or it's not so > simple? > > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until > all these gotchas are figured out for /proc/<pid>/maps anyway, and > then we can adapt both text-based and ioctl-based /proc/<pid>/maps > APIs on top of whatever the final approach will end up being the right > one? > > Liam, any objections to this? The whole point of this patch set is to > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > implementation is structured in a way that should be easily amenable > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > things that need to be figured for existing text-based > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > for now for this new API, and then adopt the same final > CONFIG_PER_VMA_LOCK-aware solution. I agree that you should start simple, using mmap_lock first and then work on improvements. Would the proposed solution become useless with coarse mmap_lock'ing? > > > > > > + } > > > + > > > + return vma; > > > +} > > > +#else > > > static int query_vma_setup(struct mm_struct *mm) > > > { > > > return mmap_read_lock_killable(mm); > > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig > > > { > > > return find_vma(mm, addr); > > > } > > > +#endif > > > > > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > > unsigned long addr, u32 flags) > > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > > skip_vma: > > > /* > > > * If the user needs closest matching VMA, keep iterating. > > > + * But before we proceed we might need to unlock current VMA. > > > */ > > > addr = vma->vm_end; > > > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ > > > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) > > > goto next_vma; > > > no_vma: > > > -- > > > 2.43.0 > > >
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]: > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > failed. This is done so that querying of VMAs doesn't interfere with > > > other critical tasks, like page fault handling. > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > We have two sets of setup/query/teardown helper functions with different > > > implementations depending on availability of per-VMA lock (conditioned > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > immediately. In this configuration mmap_lock is never helf for long, > > > minimizing disruptions while querying. > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > up VMA (depending on query parameters we might need to iterate a few of > > > them). > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 46 insertions(+) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 614fbe5d0667..140032ffc551 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > PROCMAP_QUERY_VMA_FLAGS \ > > > ) > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > +static int query_vma_setup(struct mm_struct *mm) > > > +{ > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > + return 0; > > > +} > > > + > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > +{ > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > + if (vma) > > > + vma_end_read(vma); > > > +} > > > + > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > +{ > > > + struct vm_area_struct *vma; > > > + > > > + /* try to use less disruptive per-VMA lock */ > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > + if (IS_ERR(vma)) { > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > + if (mmap_read_lock_killable(mm)) > > > + return ERR_PTR(-EINTR); > > > + > > > + vma = find_vma(mm, addr); > > > + if (vma) { > > > + /* > > > + * We cannot use vma_start_read() as it may fail due to > > > + * false locked (see comment in vma_start_read()). We > > > + * can avoid that by directly locking vm_lock under > > > + * mmap_lock, which guarantees that nobody can lock the > > > + * vma for write (vma_start_write()) under us. > > > + */ > > > + down_read(&vma->vm_lock->lock); > > > > Hi Andrii, > > The above pattern of locking VMA under mmap_lock and then dropping > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > API to do this here: > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > might be worth reviving that discussion. > > Sure, it would be nice to have generic and blessed primitives to use > here. But the good news is that once this is all figured out by you mm > folks, it should be easy to make use of those primitives here, right? > > > > > > + } > > > + > > > + mmap_read_unlock(mm); > > > > Later on in your code you are calling get_vma_name() which might call > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > this operation will be done without holding mmap_lock, however per > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > this function has to be called with mmap_lock held for read. Indeed > > with debug flags enabled you should hit this assertion: > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. The documentation on the first link says to hold the lock or take a reference, but then we assert the lock. If you take a reference to the anon vma name, then we will trigger the assert. Either the documentation needs changing or the assert is incorrect - or I'm missing something? > > Sigh... Ok, what's the suggestion then? Should it be some variant of > mmap_assert_locked() || vma_assert_locked() logic, or it's not so > simple? > > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until > all these gotchas are figured out for /proc/<pid>/maps anyway, and > then we can adapt both text-based and ioctl-based /proc/<pid>/maps > APIs on top of whatever the final approach will end up being the right > one? > > Liam, any objections to this? The whole point of this patch set is to > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > implementation is structured in a way that should be easily amenable > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > things that need to be figured for existing text-based > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > for now for this new API, and then adopt the same final > CONFIG_PER_VMA_LOCK-aware solution. The reason I was hoping to have the new interface use the per-vma locking from the start is to ensure the guarantees that we provide to the users would not change. We'd also avoid shifting to yet another mmap_lock users. I also didn't think it would complicate your series too much, so I understand why you want to revert to the old locking semantics. I'm fine with you continuing with the series on the old lock. Thanks for trying to make this work. Regards, Liam
On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]: > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > > failed. This is done so that querying of VMAs doesn't interfere with > > > > other critical tasks, like page fault handling. > > > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > > > We have two sets of setup/query/teardown helper functions with different > > > > implementations depending on availability of per-VMA lock (conditioned > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > > immediately. In this configuration mmap_lock is never helf for long, > > > > minimizing disruptions while querying. > > > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > > up VMA (depending on query parameters we might need to iterate a few of > > > > them). > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 46 insertions(+) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index 614fbe5d0667..140032ffc551 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > PROCMAP_QUERY_VMA_FLAGS \ > > > > ) > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > +static int query_vma_setup(struct mm_struct *mm) > > > > +{ > > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > > + return 0; > > > > +} > > > > + > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > > +{ > > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > > + if (vma) > > > > + vma_end_read(vma); > > > > +} > > > > + > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > > +{ > > > > + struct vm_area_struct *vma; > > > > + > > > > + /* try to use less disruptive per-VMA lock */ > > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > > + if (IS_ERR(vma)) { > > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > > + if (mmap_read_lock_killable(mm)) > > > > + return ERR_PTR(-EINTR); > > > > + > > > > + vma = find_vma(mm, addr); > > > > + if (vma) { > > > > + /* > > > > + * We cannot use vma_start_read() as it may fail due to > > > > + * false locked (see comment in vma_start_read()). We > > > > + * can avoid that by directly locking vm_lock under > > > > + * mmap_lock, which guarantees that nobody can lock the > > > > + * vma for write (vma_start_write()) under us. > > > > + */ > > > > + down_read(&vma->vm_lock->lock); > > > > > > Hi Andrii, > > > The above pattern of locking VMA under mmap_lock and then dropping > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > > API to do this here: > > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > > might be worth reviving that discussion. > > > > Sure, it would be nice to have generic and blessed primitives to use > > here. But the good news is that once this is all figured out by you mm > > folks, it should be easy to make use of those primitives here, right? > > > > > > > > > + } > > > > + > > > > + mmap_read_unlock(mm); > > > > > > Later on in your code you are calling get_vma_name() which might call > > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > > this operation will be done without holding mmap_lock, however per > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > > this function has to be called with mmap_lock held for read. Indeed > > > with debug flags enabled you should hit this assertion: > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > > The documentation on the first link says to hold the lock or take a > reference, but then we assert the lock. If you take a reference to the > anon vma name, then we will trigger the assert. Either the > documentation needs changing or the assert is incorrect - or I'm missing > something? I think the documentation is correct. It says that at the time of calling anon_vma_name() the mmap_lock should be locked (hence the assertion). Then the user can raise anon_vma_name refcount, drop mmap_lock and safely continue using anon_vma_name object. IOW this is fine: mmap_read_lock(vma->mm); anon_name = anon_vma_name(vma); anon_vma_name_get(anon_name); mmap_read_unlock(vma->mm); // keep using anon_name anon_vma_name_put(anon_name); > > > > > Sigh... Ok, what's the suggestion then? Should it be some variant of > > mmap_assert_locked() || vma_assert_locked() logic, or it's not so > > simple? > > > > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until > > all these gotchas are figured out for /proc/<pid>/maps anyway, and > > then we can adapt both text-based and ioctl-based /proc/<pid>/maps > > APIs on top of whatever the final approach will end up being the right > > one? > > > > Liam, any objections to this? The whole point of this patch set is to > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > > implementation is structured in a way that should be easily amenable > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > > things that need to be figured for existing text-based > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > > for now for this new API, and then adopt the same final > > CONFIG_PER_VMA_LOCK-aware solution. > > The reason I was hoping to have the new interface use the per-vma > locking from the start is to ensure the guarantees that we provide to > the users would not change. We'd also avoid shifting to yet another > mmap_lock users. > > I also didn't think it would complicate your series too much, so I > understand why you want to revert to the old locking semantics. I'm > fine with you continuing with the series on the old lock. Thanks for > trying to make this work. > > Regards, > Liam
On Thu, Jun 6, 2024 at 10:12 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > > failed. This is done so that querying of VMAs doesn't interfere with > > > > other critical tasks, like page fault handling. > > > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > > > We have two sets of setup/query/teardown helper functions with different > > > > implementations depending on availability of per-VMA lock (conditioned > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > > immediately. In this configuration mmap_lock is never helf for long, > > > > minimizing disruptions while querying. > > > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > > up VMA (depending on query parameters we might need to iterate a few of > > > > them). > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 46 insertions(+) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index 614fbe5d0667..140032ffc551 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > PROCMAP_QUERY_VMA_FLAGS \ > > > > ) > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > +static int query_vma_setup(struct mm_struct *mm) > > > > +{ > > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > > + return 0; > > > > +} > > > > + > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > > +{ > > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > > + if (vma) > > > > + vma_end_read(vma); > > > > +} > > > > + > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > > +{ > > > > + struct vm_area_struct *vma; > > > > + > > > > + /* try to use less disruptive per-VMA lock */ > > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > > + if (IS_ERR(vma)) { > > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > > + if (mmap_read_lock_killable(mm)) > > > > + return ERR_PTR(-EINTR); > > > > + > > > > + vma = find_vma(mm, addr); > > > > + if (vma) { > > > > + /* > > > > + * We cannot use vma_start_read() as it may fail due to > > > > + * false locked (see comment in vma_start_read()). We > > > > + * can avoid that by directly locking vm_lock under > > > > + * mmap_lock, which guarantees that nobody can lock the > > > > + * vma for write (vma_start_write()) under us. > > > > + */ > > > > + down_read(&vma->vm_lock->lock); > > > > > > Hi Andrii, > > > The above pattern of locking VMA under mmap_lock and then dropping > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > > API to do this here: > > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > > might be worth reviving that discussion. > > > > Sure, it would be nice to have generic and blessed primitives to use > > here. But the good news is that once this is all figured out by you mm > > folks, it should be easy to make use of those primitives here, right? > > > > > > > > > + } > > > > + > > > > + mmap_read_unlock(mm); > > > > > > Later on in your code you are calling get_vma_name() which might call > > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > > this operation will be done without holding mmap_lock, however per > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > > this function has to be called with mmap_lock held for read. Indeed > > > with debug flags enabled you should hit this assertion: > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > > > > Sigh... Ok, what's the suggestion then? Should it be some variant of > > mmap_assert_locked() || vma_assert_locked() logic, or it's not so > > simple? > > > > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until > > all these gotchas are figured out for /proc/<pid>/maps anyway, and > > then we can adapt both text-based and ioctl-based /proc/<pid>/maps > > APIs on top of whatever the final approach will end up being the right > > one? > > > > Liam, any objections to this? The whole point of this patch set is to > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > > implementation is structured in a way that should be easily amenable > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > > things that need to be figured for existing text-based > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > > for now for this new API, and then adopt the same final > > CONFIG_PER_VMA_LOCK-aware solution. > > I agree that you should start simple, using mmap_lock first and then > work on improvements. Would the proposed solution become useless with > coarse mmap_lock'ing? Sorry, it's not clear what you mean by "proposed solution"? If you mean this new ioctl-based API, no it's still very useful and fast even if we take mmap_lock. But if you mean vm_lock, then I'd say that due to anon_vma_name() complication it makes vm_lock not attractive anymore, because vma_name will be requested pretty much always. And if we need to take mmap_lock anyways, then what's the point of per-VMA lock, right? I'd like to be a good citizen here and help you guys not add new mmap_lock users (and adopt per-VMA lock more widely), but I'm not sure I can solve the anon_vma_name() conundrum, unfortunately. Ultimately, I do care the most about having this new API available for my customers to take advantage of, of course. > > > > > > > > > > + } > > > > + > > > > + return vma; > > > > +} > > > > +#else > > > > static int query_vma_setup(struct mm_struct *mm) > > > > { > > > > return mmap_read_lock_killable(mm); > > > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig > > > > { > > > > return find_vma(mm, addr); > > > > } > > > > +#endif > > > > > > > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > > > unsigned long addr, u32 flags) > > > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, > > > > skip_vma: > > > > /* > > > > * If the user needs closest matching VMA, keep iterating. > > > > + * But before we proceed we might need to unlock current VMA. > > > > */ > > > > addr = vma->vm_end; > > > > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ > > > > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) > > > > goto next_vma; > > > > no_vma: > > > > -- > > > > 2.43.0 > > > >
* Suren Baghdasaryan <surenb@google.com> [240606 13:33]: > On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]: > > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > > > failed. This is done so that querying of VMAs doesn't interfere with > > > > > other critical tasks, like page fault handling. > > > > > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > > > > > We have two sets of setup/query/teardown helper functions with different > > > > > implementations depending on availability of per-VMA lock (conditioned > > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > > > immediately. In this configuration mmap_lock is never helf for long, > > > > > minimizing disruptions while querying. > > > > > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > > > up VMA (depending on query parameters we might need to iterate a few of > > > > > them). > > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > > --- > > > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > > > 1 file changed, 46 insertions(+) > > > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index 614fbe5d0667..140032ffc551 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > > PROCMAP_QUERY_VMA_FLAGS \ > > > > > ) > > > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > > +static int query_vma_setup(struct mm_struct *mm) > > > > > +{ > > > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > > > +{ > > > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > > > + if (vma) > > > > > + vma_end_read(vma); > > > > > +} > > > > > + > > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > > > +{ > > > > > + struct vm_area_struct *vma; > > > > > + > > > > > + /* try to use less disruptive per-VMA lock */ > > > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > > > + if (IS_ERR(vma)) { > > > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > > > + if (mmap_read_lock_killable(mm)) > > > > > + return ERR_PTR(-EINTR); > > > > > + > > > > > + vma = find_vma(mm, addr); > > > > > + if (vma) { > > > > > + /* > > > > > + * We cannot use vma_start_read() as it may fail due to > > > > > + * false locked (see comment in vma_start_read()). We > > > > > + * can avoid that by directly locking vm_lock under > > > > > + * mmap_lock, which guarantees that nobody can lock the > > > > > + * vma for write (vma_start_write()) under us. > > > > > + */ > > > > > + down_read(&vma->vm_lock->lock); > > > > > > > > Hi Andrii, > > > > The above pattern of locking VMA under mmap_lock and then dropping > > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > > > API to do this here: > > > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > > > might be worth reviving that discussion. > > > > > > Sure, it would be nice to have generic and blessed primitives to use > > > here. But the good news is that once this is all figured out by you mm > > > folks, it should be easy to make use of those primitives here, right? > > > > > > > > > > > > + } > > > > > + > > > > > + mmap_read_unlock(mm); > > > > > > > > Later on in your code you are calling get_vma_name() which might call > > > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > > > this operation will be done without holding mmap_lock, however per > > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > > > this function has to be called with mmap_lock held for read. Indeed > > > > with debug flags enabled you should hit this assertion: > > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > > > > The documentation on the first link says to hold the lock or take a > > reference, but then we assert the lock. If you take a reference to the > > anon vma name, then we will trigger the assert. Either the > > documentation needs changing or the assert is incorrect - or I'm missing > > something? > > I think the documentation is correct. It says that at the time of > calling anon_vma_name() the mmap_lock should be locked (hence the > assertion). Then the user can raise anon_vma_name refcount, drop > mmap_lock and safely continue using anon_vma_name object. IOW this is > fine: > > mmap_read_lock(vma->mm); > anon_name = anon_vma_name(vma); > anon_vma_name_get(anon_name); > mmap_read_unlock(vma->mm); > // keep using anon_name > anon_vma_name_put(anon_name); > I consider that an optimistic view of what will happen, but okay.
On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 12:52]: > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA > > > > as much as possible, only falling back to mmap_lock if per-VMA lock > > > > failed. This is done so that querying of VMAs doesn't interfere with > > > > other critical tasks, like page fault handling. > > > > > > > > This has been suggested by mm folks, and we make use of a newly added > > > > internal API that works like find_vma(), but tries to use per-VMA lock. > > > > > > > > We have two sets of setup/query/teardown helper functions with different > > > > implementations depending on availability of per-VMA lock (conditioned > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. > > > > > > > > When per-VMA lock is available, lookup is done under RCU, attempting to > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock > > > > immediately. In this configuration mmap_lock is never helf for long, > > > > minimizing disruptions while querying. > > > > > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs > > > > using find_vma() API, and then unlock mmap_lock at the very end once as > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked > > > > up VMA (depending on query parameters we might need to iterate a few of > > > > them). > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 46 insertions(+) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index 614fbe5d0667..140032ffc551 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) > > > > PROCMAP_QUERY_VMA_FLAGS \ > > > > ) > > > > > > > > +#ifdef CONFIG_PER_VMA_LOCK > > > > +static int query_vma_setup(struct mm_struct *mm) > > > > +{ > > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */ > > > > + return 0; > > > > +} > > > > + > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) > > > > +{ > > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */ > > > > + if (vma) > > > > + vma_end_read(vma); > > > > +} > > > > + > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) > > > > +{ > > > > + struct vm_area_struct *vma; > > > > + > > > > + /* try to use less disruptive per-VMA lock */ > > > > + vma = find_and_lock_vma_rcu(mm, addr); > > > > + if (IS_ERR(vma)) { > > > > + /* failed to take per-VMA lock, fallback to mmap_lock */ > > > > + if (mmap_read_lock_killable(mm)) > > > > + return ERR_PTR(-EINTR); > > > > + > > > > + vma = find_vma(mm, addr); > > > > + if (vma) { > > > > + /* > > > > + * We cannot use vma_start_read() as it may fail due to > > > > + * false locked (see comment in vma_start_read()). We > > > > + * can avoid that by directly locking vm_lock under > > > > + * mmap_lock, which guarantees that nobody can lock the > > > > + * vma for write (vma_start_write()) under us. > > > > + */ > > > > + down_read(&vma->vm_lock->lock); > > > > > > Hi Andrii, > > > The above pattern of locking VMA under mmap_lock and then dropping > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an > > > API to do this here: > > > https://lore.kernel.org/all/ZivhG0yrbpFqORDw@casper.infradead.org/. It > > > might be worth reviving that discussion. > > > > Sure, it would be nice to have generic and blessed primitives to use > > here. But the good news is that once this is all figured out by you mm > > folks, it should be easy to make use of those primitives here, right? > > > > > > > > > + } > > > > + > > > > + mmap_read_unlock(mm); > > > > > > Later on in your code you are calling get_vma_name() which might call > > > anon_vma_name() to retrieve user-defined VMA name. After this patch > > > this operation will be done without holding mmap_lock, however per > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582 > > > this function has to be called with mmap_lock held for read. Indeed > > > with debug flags enabled you should hit this assertion: > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96. > > The documentation on the first link says to hold the lock or take a > reference, but then we assert the lock. If you take a reference to the > anon vma name, then we will trigger the assert. Either the > documentation needs changing or the assert is incorrect - or I'm missing > something? > > > > > Sigh... Ok, what's the suggestion then? Should it be some variant of > > mmap_assert_locked() || vma_assert_locked() logic, or it's not so > > simple? > > > > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until > > all these gotchas are figured out for /proc/<pid>/maps anyway, and > > then we can adapt both text-based and ioctl-based /proc/<pid>/maps > > APIs on top of whatever the final approach will end up being the right > > one? > > > > Liam, any objections to this? The whole point of this patch set is to > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > > implementation is structured in a way that should be easily amenable > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > > things that need to be figured for existing text-based > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > > for now for this new API, and then adopt the same final > > CONFIG_PER_VMA_LOCK-aware solution. > > The reason I was hoping to have the new interface use the per-vma > locking from the start is to ensure the guarantees that we provide to > the users would not change. We'd also avoid shifting to yet another > mmap_lock users. > Yep, it's completely understandable. And you see that I changed the structure quite a lot to abstract away mmap_lock vs vm_lock details. I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and seems like it should be addressed first, but I'm just not qualified enough to do this. > I also didn't think it would complicate your series too much, so I > understand why you want to revert to the old locking semantics. I'm > fine with you continuing with the series on the old lock. Thanks for > trying to make this work. > I'm happy to keep the existing structure of the code, and (intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate patches, so it's easy to do. I'd love to help adopt a per-VMA lock once all the pieces are figured out. Hopefully anon_vma_name() is the last one remaining :) So please keep me cc'ed on relevant patches. As I mentioned, I just don't feel like I would be able to solve the anon_vma_name() problem, but of course I wouldn't want to be completely blocked by it as well. > Regards, > Liam
* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240606 14:09]: ... > > > Liam, any objections to this? The whole point of this patch set is to > > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My > > > implementation is structured in a way that should be easily amenable > > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle > > > things that need to be figured for existing text-based > > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock > > > for now for this new API, and then adopt the same final > > > CONFIG_PER_VMA_LOCK-aware solution. > > > > The reason I was hoping to have the new interface use the per-vma > > locking from the start is to ensure the guarantees that we provide to > > the users would not change. We'd also avoid shifting to yet another > > mmap_lock users. > > > > Yep, it's completely understandable. And you see that I changed the > structure quite a lot to abstract away mmap_lock vs vm_lock details. > I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and > seems like it should be addressed first, but I'm just not qualified > enough to do this. > > > I also didn't think it would complicate your series too much, so I > > understand why you want to revert to the old locking semantics. I'm > > fine with you continuing with the series on the old lock. Thanks for > > trying to make this work. > > > > I'm happy to keep the existing structure of the code, and > (intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate > patches, so it's easy to do. I'd love to help adopt a per-VMA lock > once all the pieces are figured out. Hopefully anon_vma_name() is the > last one remaining :) So please keep me cc'ed on relevant patches. > > As I mentioned, I just don't feel like I would be able to solve the > anon_vma_name() problem, but of course I wouldn't want to be > completely blocked by it as well. > Absolutely. Thanks for trying. To be clear, I'm fine with you dropping the per-vma locking from this interface as well. Thanks, Liam
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 614fbe5d0667..140032ffc551 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) PROCMAP_QUERY_VMA_FLAGS \ ) +#ifdef CONFIG_PER_VMA_LOCK +static int query_vma_setup(struct mm_struct *mm) +{ + /* in the presence of per-VMA lock we don't need any setup/teardown */ + return 0; +} + +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +{ + /* in the presence of per-VMA lock we need to unlock vma, if present */ + if (vma) + vma_end_read(vma); +} + +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma; + + /* try to use less disruptive per-VMA lock */ + vma = find_and_lock_vma_rcu(mm, addr); + if (IS_ERR(vma)) { + /* failed to take per-VMA lock, fallback to mmap_lock */ + if (mmap_read_lock_killable(mm)) + return ERR_PTR(-EINTR); + + vma = find_vma(mm, addr); + if (vma) { + /* + * We cannot use vma_start_read() as it may fail due to + * false locked (see comment in vma_start_read()). We + * can avoid that by directly locking vm_lock under + * mmap_lock, which guarantees that nobody can lock the + * vma for write (vma_start_write()) under us. + */ + down_read(&vma->vm_lock->lock); + } + + mmap_read_unlock(mm); + } + + return vma; +} +#else static int query_vma_setup(struct mm_struct *mm) { return mmap_read_lock_killable(mm); @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig { return find_vma(mm, addr); } +#endif static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, unsigned long addr, u32 flags) @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, skip_vma: /* * If the user needs closest matching VMA, keep iterating. + * But before we proceed we might need to unlock current VMA. */ addr = vma->vm_end; + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) goto next_vma; no_vma:
Attempt to use RCU-protected per-VMA lock when looking up requested VMA as much as possible, only falling back to mmap_lock if per-VMA lock failed. This is done so that querying of VMAs doesn't interfere with other critical tasks, like page fault handling. This has been suggested by mm folks, and we make use of a newly added internal API that works like find_vma(), but tries to use per-VMA lock. We have two sets of setup/query/teardown helper functions with different implementations depending on availability of per-VMA lock (conditioned on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. When per-VMA lock is available, lookup is done under RCU, attempting to take a per-VMA lock. If that fails, we fallback to mmap_lock, but then proceed to unconditionally grab per-VMA lock again, dropping mmap_lock immediately. In this configuration mmap_lock is never helf for long, minimizing disruptions while querying. When per-VMA lock is compiled out, we take mmap_lock once, query VMAs using find_vma() API, and then unlock mmap_lock at the very end once as well. In this setup we avoid locking/unlocking mmap_lock on every looked up VMA (depending on query parameters we might need to iterate a few of them). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+)