diff mbox series

[v10,3/3] mm: add anonymous vma name refcounting

Message ID 20211001205657.815551-3-surenb@google.com (mailing list archive)
State New
Headers show
Series [v10,1/3] mm: rearrange madvise code to allow for reuse | expand

Commit Message

Suren Baghdasaryan Oct. 1, 2021, 8:56 p.m. UTC
While forking a process with high number (64K) of named anonymous vmas the
overhead caused by strdup() is noticeable. Experiments with ARM64 Android
device show up to 40% performance regression when forking a process with
64k unpopulated anonymous vmas using the max name lengths vs the same
process with the same number of anonymous vmas having no name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member of
anon_vma_name structure is assigned at structure allocation time and is
never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated
to hold the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
previous version at:
https://lore.kernel.org/linux-mm/20210902231813.3597709-3-surenb@google.com/

changes in v10
- Changed anon_vma_name_alloc() to use struct_size() and avoid strcpy(),
per Kees Cook

 include/linux/mm_types.h |  9 ++++++++-
 mm/madvise.c             | 42 ++++++++++++++++++++++++++++++++++------
 2 files changed, 44 insertions(+), 7 deletions(-)

Comments

Pavel Machek Oct. 5, 2021, 6:42 p.m. UTC | #1
On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
> While forking a process with high number (64K) of named anonymous vmas the
> overhead caused by strdup() is noticeable. Experiments with ARM64
Android

I still believe you should simply use numbers and do the
numbers->strings mapping in userspace. We should not need to optimize
strdups in kernel...

Best regards,
								Pavel
Suren Baghdasaryan Oct. 5, 2021, 7:14 p.m. UTC | #2
On Tue, Oct 5, 2021 at 11:42 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
> > While forking a process with high number (64K) of named anonymous vmas the
> > overhead caused by strdup() is noticeable. Experiments with ARM64
> Android
>
> I still believe you should simply use numbers and do the
> numbers->strings mapping in userspace. We should not need to optimize
> strdups in kernel...

Here are complications with mapping numbers to strings in the userspace:
Approach 1: hardcode number->string in some header file and let all
tools use that mapping. The issue is that whenever that mapping
changes all the tools that are using it (including 3rd party ones)
have to be rebuilt. This is not really maintainable since we don't
control 3rd party tools and even for the ones we control, it will be a
maintenance issue figuring out which version of the tool used which
header file.
Approach 2: have a centralized facility (a process or a DB)
maintaining number->string mapping. This would require an additional
request to this facility whenever we want to make a number->string
conversion. Moreover, when we want to name a VMA, we would have to
register a new VMA name in that facility or check that one already
exists and get its ID. So each prctl() call to name a VMA will be
preceded by such a request (IPC call), maybe with some optimizations
to cache already known number->string pairs. This would be quite
expensive performance-wise. Additional issue with this approach is
that this mapping will have to be persistent to handle a case when the
facility crashes and has to be restored.

As I said before, it complicates userspace quite a bit. Is that a good
enough reason to store the names in the kernel and pay a little more
memory for that? IMHO yes, but I might be wrong.
Thanks,
Suren.

>
> Best regards,
>                                                                 Pavel
> --
> http://www.livejournal.com/~pavelmachek
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
Kees Cook Oct. 5, 2021, 7:21 p.m. UTC | #3
On Tue, Oct 05, 2021 at 12:14:59PM -0700, Suren Baghdasaryan wrote:
> On Tue, Oct 5, 2021 at 11:42 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
> > > While forking a process with high number (64K) of named anonymous vmas the
> > > overhead caused by strdup() is noticeable. Experiments with ARM64
> > Android
> >
> > I still believe you should simply use numbers and do the
> > numbers->strings mapping in userspace. We should not need to optimize
> > strdups in kernel...
> 
> Here are complications with mapping numbers to strings in the userspace:
> Approach 1: hardcode number->string in some header file and let all
> tools use that mapping. The issue is that whenever that mapping
> changes all the tools that are using it (including 3rd party ones)
> have to be rebuilt. This is not really maintainable since we don't
> control 3rd party tools and even for the ones we control, it will be a
> maintenance issue figuring out which version of the tool used which
> header file.
> Approach 2: have a centralized facility (a process or a DB)
> maintaining number->string mapping. This would require an additional
> request to this facility whenever we want to make a number->string
> conversion. Moreover, when we want to name a VMA, we would have to
> register a new VMA name in that facility or check that one already
> exists and get its ID. So each prctl() call to name a VMA will be
> preceded by such a request (IPC call), maybe with some optimizations
> to cache already known number->string pairs. This would be quite
> expensive performance-wise. Additional issue with this approach is
> that this mapping will have to be persistent to handle a case when the
> facility crashes and has to be restored.
> 
> As I said before, it complicates userspace quite a bit. Is that a good
> enough reason to store the names in the kernel and pay a little more
> memory for that? IMHO yes, but I might be wrong.

FWIW, I prefer the strings. It's more human-readable, which is important
for the kinds of cases where the maps are being used for diagnostics,
etc.
Pavel Machek Oct. 5, 2021, 8:04 p.m. UTC | #4
Hi!

> > On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
> > > While forking a process with high number (64K) of named anonymous vmas the
> > > overhead caused by strdup() is noticeable. Experiments with ARM64
> > Android
> >
> > I still believe you should simply use numbers and do the
> > numbers->strings mapping in userspace. We should not need to optimize
> > strdups in kernel...
> 
> Here are complications with mapping numbers to strings in the userspace:
> Approach 1: hardcode number->string in some header file and let all
> tools use that mapping. The issue is that whenever that mapping
> changes all the tools that are using it (including 3rd party ones)
> have to be rebuilt. This is not really maintainable since we don't
> control 3rd party tools and even for the ones we control, it will be a
> maintenance issue figuring out which version of the tool used which
> header file.

1a) Just put it into a file in /etc... Similar to header file but
easier...

> Approach 2: have a centralized facility (a process or a DB)
> maintaining number->string mapping. This would require an additional
> request to this facility whenever we want to make a number->string
> conversion. Moreover, when we want to name a VMA, we would have to

I see it complicates userspace. But that's better than complicating
kernel, and I don't know what limits on strings you plan, but
considering you'll be outputing the strings in /proc... someone is
going to get confused with parsing.

								Pavel
Suren Baghdasaryan Oct. 5, 2021, 8:43 p.m. UTC | #5
On Tue, Oct 5, 2021 at 1:04 PM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > > On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
> > > > While forking a process with high number (64K) of named anonymous vmas the
> > > > overhead caused by strdup() is noticeable. Experiments with ARM64
> > > Android
> > >
> > > I still believe you should simply use numbers and do the
> > > numbers->strings mapping in userspace. We should not need to optimize
> > > strdups in kernel...
> >
> > Here are complications with mapping numbers to strings in the userspace:
> > Approach 1: hardcode number->string in some header file and let all
> > tools use that mapping. The issue is that whenever that mapping
> > changes all the tools that are using it (including 3rd party ones)
> > have to be rebuilt. This is not really maintainable since we don't
> > control 3rd party tools and even for the ones we control, it will be a
> > maintenance issue figuring out which version of the tool used which
> > header file.
>
> 1a) Just put it into a file in /etc... Similar to header file but
> easier...
>
> > Approach 2: have a centralized facility (a process or a DB)
> > maintaining number->string mapping. This would require an additional
> > request to this facility whenever we want to make a number->string
> > conversion. Moreover, when we want to name a VMA, we would have to
>
> I see it complicates userspace. But that's better than complicating
> kernel, and I don't know what limits on strings you plan, but
> considering you'll be outputing the strings in /proc... someone is
> going to get confused with parsing.

I'm not a fan of complicating kernel but the proposed approach seems
simple enough to me. Again this is subjective, so I can't really have
a good argument here. Maybe, as Andrew suggested, I should keep it
under a separate config so that whoever does not care about this
feature pays no price for it?
On the topic of confusing the parsers, if the parser is written so
that it can't ignore new [anon:...] entry then it does not matter
whether we use strings or numbers, it will get confused either way.
Again, if we are concerned about confusing existing parsers I think
having a separate config option set to 'n' would help. This would
prevent some userspace process from naming an anonymous VMA and
causing parser confusion. OTOH on systems where parsers can handle
anon VMA names (Android) we will set that config and use the feature.
Would that address your concerns?


>
>                                                                 Pavel
> --
> http://www.livejournal.com/~pavelmachek
John Hubbard Oct. 6, 2021, 6:57 a.m. UTC | #6
On 10/5/21 13:43, Suren Baghdasaryan wrote:
> On Tue, Oct 5, 2021 at 1:04 PM Pavel Machek <pavel@ucw.cz> wrote:
>>
>> Hi!
>>
>>>> On Fri 2021-10-01 13:56:57, Suren Baghdasaryan wrote:
>>>>> While forking a process with high number (64K) of named anonymous vmas the
>>>>> overhead caused by strdup() is noticeable. Experiments with ARM64
>>>> Android
>>>>
>>>> I still believe you should simply use numbers and do the
>>>> numbers->strings mapping in userspace. We should not need to optimize
>>>> strdups in kernel...
>>>
>>> Here are complications with mapping numbers to strings in the userspace:
>>> Approach 1: hardcode number->string in some header file and let all
>>> tools use that mapping. The issue is that whenever that mapping
>>> changes all the tools that are using it (including 3rd party ones)
>>> have to be rebuilt. This is not really maintainable since we don't
>>> control 3rd party tools and even for the ones we control, it will be a
>>> maintenance issue figuring out which version of the tool used which
>>> header file.
>>
>> 1a) Just put it into a file in /etc... Similar to header file but
>> easier...
>>
>>> Approach 2: have a centralized facility (a process or a DB)
>>> maintaining number->string mapping. This would require an additional
>>> request to this facility whenever we want to make a number->string
>>> conversion. Moreover, when we want to name a VMA, we would have to
>>
>> I see it complicates userspace. But that's better than complicating
>> kernel, and I don't know what limits on strings you plan, but
>> considering you'll be outputing the strings in /proc... someone is
>> going to get confused with parsing.
> 
> I'm not a fan of complicating kernel but the proposed approach seems
> simple enough to me. Again this is subjective, so I can't really have
> a good argument here. Maybe, as Andrew suggested, I should keep it
> under a separate config so that whoever does not care about this
> feature pays no price for it?


For what it's worth, I've been watching this feature proposal evolve,
and a couple of things are starting to become clear. These are of
course judgment calls, though, so even though I'm writing them as
"facts", please read them as merely "one developer's opinion and
preference":

1) Yes, just leave the strings in the kernel, that's simple and
it works, and the alternatives don't really help your case nearly
enough. The kernel changes at a different rate than distros and
user space, and keeping number->string mappings updated and correct
is just basically hopeless.

And you've beaten down the perf problems with kref, so it's fine.

2) At the same time, this feature is Just Not Needed! ...usually.
So the config option seems absolutely appropriate.


Even Pavel here will probably be content with the above mix, I
expect. :)


thanks,
Michal Hocko Oct. 6, 2021, 8:27 a.m. UTC | #7
On Tue 05-10-21 23:57:36, John Hubbard wrote:
[...]
> 1) Yes, just leave the strings in the kernel, that's simple and
> it works, and the alternatives don't really help your case nearly
> enough.

I do not have a strong opinion. Strings are easier to use but they
are more involved and the necessity of kref approach just underlines
that. There are going to be new allocations and that always can lead
to surprising side effects.  These are small (80B at maximum) so the
overall footpring shouldn't all that large by default but it can grow
quite large with a very high max_map_count. There are workloads which
really require the default to be set high (e.g. heavy mremap users). So
if anything all those should be __GFP_ACCOUNT and memcg accounted.

I do agree that numbers are just much more simpler from accounting,
performance and implementation POV.

> The kernel changes at a different rate than distros and
> user space, and keeping number->string mappings updated and correct
> is just basically hopeless.

I am not sure I follow here. This all looks like a userspace policy. No
matter what kind of id you use. It is userspace to set and consume those
ids. Why is it different to use strings from numbers. All parties have
to agree in a naming/numbering convention anyway. Those might differ on
Android or other userspaces. What am I missing?

> And you've beaten down the perf problems with kref, so it's fine.
> 
> 2) At the same time, this feature is Just Not Needed! ...usually.
> So the config option seems absolutely appropriate.

This is not really an answer. Most users are using a distro kernel so
this would need to be enabled in those configs just in case.
David Hildenbrand Oct. 6, 2021, 9:27 a.m. UTC | #8
On 06.10.21 10:27, Michal Hocko wrote:
> On Tue 05-10-21 23:57:36, John Hubbard wrote:
> [...]
>> 1) Yes, just leave the strings in the kernel, that's simple and
>> it works, and the alternatives don't really help your case nearly
>> enough.
> 
> I do not have a strong opinion. Strings are easier to use but they
> are more involved and the necessity of kref approach just underlines
> that. There are going to be new allocations and that always can lead
> to surprising side effects.  These are small (80B at maximum) so the
> overall footpring shouldn't all that large by default but it can grow
> quite large with a very high max_map_count. There are workloads which
> really require the default to be set high (e.g. heavy mremap users). So
> if anything all those should be __GFP_ACCOUNT and memcg accounted.
> 
> I do agree that numbers are just much more simpler from accounting,
> performance and implementation POV.

+1

I can understand that having a string can be quite beneficial e.g., when 
dumping mmaps. If only user space knows the id <-> string mapping, that 
can be quite tricky.

However, I also do wonder if there would be a way to standardize/reserve 
ids, such that a given id always corresponds to a specific user. If we 
use an uint64_t for an id, there would be plenty room to reserve ids ...

I'd really prefer if we can avoid using strings and instead using ids.
Suren Baghdasaryan Oct. 6, 2021, 3:01 p.m. UTC | #9
On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.10.21 10:27, Michal Hocko wrote:
> > On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > [...]
> >> 1) Yes, just leave the strings in the kernel, that's simple and
> >> it works, and the alternatives don't really help your case nearly
> >> enough.
> >
> > I do not have a strong opinion. Strings are easier to use but they
> > are more involved and the necessity of kref approach just underlines
> > that. There are going to be new allocations and that always can lead
> > to surprising side effects.  These are small (80B at maximum) so the
> > overall footpring shouldn't all that large by default but it can grow
> > quite large with a very high max_map_count. There are workloads which
> > really require the default to be set high (e.g. heavy mremap users). So
> > if anything all those should be __GFP_ACCOUNT and memcg accounted.
> >
> > I do agree that numbers are just much more simpler from accounting,
> > performance and implementation POV.
>
> +1
>
> I can understand that having a string can be quite beneficial e.g., when
> dumping mmaps. If only user space knows the id <-> string mapping, that
> can be quite tricky.
>
> However, I also do wonder if there would be a way to standardize/reserve
> ids, such that a given id always corresponds to a specific user. If we
> use an uint64_t for an id, there would be plenty room to reserve ids ...
>
> I'd really prefer if we can avoid using strings and instead using ids.

I wish it was that simple and for some names like [anon:.bss] or
[anon:dalvik-zygote space] reserving a unique id would work, however
some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
are generated dynamically at runtime and include package name.
Packages are constantly evolving, new ones are developed, names can
change, etc. So assigning a unique id for these names is not really
feasible.
That leaves us with the central facility option, which as I described
in my previous email would be prohibitive from performance POV (IPC
every time we have a new name or want to convert id to name).

I'm all for simplicity but the simple approach of using ids instead of
names unfortunately would not work for our usecases.

>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand Oct. 6, 2021, 3:07 p.m. UTC | #10
On 06.10.21 17:01, Suren Baghdasaryan wrote:
> On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 06.10.21 10:27, Michal Hocko wrote:
>>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
>>> [...]
>>>> 1) Yes, just leave the strings in the kernel, that's simple and
>>>> it works, and the alternatives don't really help your case nearly
>>>> enough.
>>>
>>> I do not have a strong opinion. Strings are easier to use but they
>>> are more involved and the necessity of kref approach just underlines
>>> that. There are going to be new allocations and that always can lead
>>> to surprising side effects.  These are small (80B at maximum) so the
>>> overall footpring shouldn't all that large by default but it can grow
>>> quite large with a very high max_map_count. There are workloads which
>>> really require the default to be set high (e.g. heavy mremap users). So
>>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
>>>
>>> I do agree that numbers are just much more simpler from accounting,
>>> performance and implementation POV.
>>
>> +1
>>
>> I can understand that having a string can be quite beneficial e.g., when
>> dumping mmaps. If only user space knows the id <-> string mapping, that
>> can be quite tricky.
>>
>> However, I also do wonder if there would be a way to standardize/reserve
>> ids, such that a given id always corresponds to a specific user. If we
>> use an uint64_t for an id, there would be plenty room to reserve ids ...
>>
>> I'd really prefer if we can avoid using strings and instead using ids.
> 
> I wish it was that simple and for some names like [anon:.bss] or
> [anon:dalvik-zygote space] reserving a unique id would work, however
> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> are generated dynamically at runtime and include package name.

Valuable information

> Packages are constantly evolving, new ones are developed, names can
> change, etc. So assigning a unique id for these names is not really
> feasible.

So, you'd actually want to generate/reserve an id for a given string at 
runtime, assign that id to the VMA, and have a way to match id <-> 
string somehow?

That reservation service could be inside the kernel or even (better?) in 
user space. The service could for example de-duplicates strings.

My question would be, if we really have to expose these strings to the 
kernel, or if an id is sufficient. Sure, it would move complexity to 
user space, but keeping complexity out of the kernel is usually a good idea.
Suren Baghdasaryan Oct. 6, 2021, 3:20 p.m. UTC | #11
On Wed, Oct 6, 2021 at 8:08 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.10.21 17:01, Suren Baghdasaryan wrote:
> > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 06.10.21 10:27, Michal Hocko wrote:
> >>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
> >>> [...]
> >>>> 1) Yes, just leave the strings in the kernel, that's simple and
> >>>> it works, and the alternatives don't really help your case nearly
> >>>> enough.
> >>>
> >>> I do not have a strong opinion. Strings are easier to use but they
> >>> are more involved and the necessity of kref approach just underlines
> >>> that. There are going to be new allocations and that always can lead
> >>> to surprising side effects.  These are small (80B at maximum) so the
> >>> overall footpring shouldn't all that large by default but it can grow
> >>> quite large with a very high max_map_count. There are workloads which
> >>> really require the default to be set high (e.g. heavy mremap users). So
> >>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
> >>>
> >>> I do agree that numbers are just much more simpler from accounting,
> >>> performance and implementation POV.
> >>
> >> +1
> >>
> >> I can understand that having a string can be quite beneficial e.g., when
> >> dumping mmaps. If only user space knows the id <-> string mapping, that
> >> can be quite tricky.
> >>
> >> However, I also do wonder if there would be a way to standardize/reserve
> >> ids, such that a given id always corresponds to a specific user. If we
> >> use an uint64_t for an id, there would be plenty room to reserve ids ...
> >>
> >> I'd really prefer if we can avoid using strings and instead using ids.
> >
> > I wish it was that simple and for some names like [anon:.bss] or
> > [anon:dalvik-zygote space] reserving a unique id would work, however
> > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > are generated dynamically at runtime and include package name.
>
> Valuable information

Yeah, I should have described it clearer the first time around.

>
> > Packages are constantly evolving, new ones are developed, names can
> > change, etc. So assigning a unique id for these names is not really
> > feasible.
>
> So, you'd actually want to generate/reserve an id for a given string at
> runtime, assign that id to the VMA, and have a way to match id <->
> string somehow?

If we go with ids then yes, that is what we would have to do.

> That reservation service could be inside the kernel or even (better?) in
> user space. The service could for example de-duplicates strings.

Yes but it would require an IPC call to that service potentially on
every mmap() when we want to name a mapped vma. This would be
prohibitive. Even on consumption side, instead of just dumping
/proc/$pid/maps we would have to parse the file and convert all
[anon:id] into [anon:name] with each conversion requiring an IPC call
(assuming no id->name pair caching on the client side).

> My question would be, if we really have to expose these strings to the
> kernel, or if an id is sufficient. Sure, it would move complexity to
> user space, but keeping complexity out of the kernel is usually a good idea.

My worry here is not the additional complexity on the userspace side
but the performance hit we would have to endure due to these
conversions.

> --
> Thanks,
>
> David / dhildenb
>
Pavel Machek Oct. 6, 2021, 5:58 p.m. UTC | #12
Hi!

> > I can understand that having a string can be quite beneficial e.g., when
> > dumping mmaps. If only user space knows the id <-> string mapping, that
> > can be quite tricky.
> >
> > However, I also do wonder if there would be a way to standardize/reserve
> > ids, such that a given id always corresponds to a specific user. If we
> > use an uint64_t for an id, there would be plenty room to reserve ids ...
> >
> > I'd really prefer if we can avoid using strings and instead using ids.
> 
> I wish it was that simple and for some names like [anon:.bss] or
> [anon:dalvik-zygote space] reserving a unique id would work, however
> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> are generated dynamically at runtime and include package name.

I'd be careful; if you allow special characters like that, you will
confuse some kind of parser.

> Packages are constantly evolving, new ones are developed, names can
> change, etc. So assigning a unique id for these names is not really
> feasible.
> That leaves us with the central facility option, which as I described
> in my previous email would be prohibitive from performance POV (IPC
> every time we have a new name or want to convert id to name).

That "central facility" option can be as simple as "mkdir
/somewhere/sanitized_id", using inode numbers for example. You don't
really need IPC.

Plus, I don't really believe the IPC cost would be prohibitive.

Or you could simply hash the string and use the hash as id...

Best regards,
								Pavel
Suren Baghdasaryan Oct. 6, 2021, 6:18 p.m. UTC | #13
On Wed, Oct 6, 2021 at 10:58 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > > I can understand that having a string can be quite beneficial e.g., when
> > > dumping mmaps. If only user space knows the id <-> string mapping, that
> > > can be quite tricky.
> > >
> > > However, I also do wonder if there would be a way to standardize/reserve
> > > ids, such that a given id always corresponds to a specific user. If we
> > > use an uint64_t for an id, there would be plenty room to reserve ids ...
> > >
> > > I'd really prefer if we can avoid using strings and instead using ids.
> >
> > I wish it was that simple and for some names like [anon:.bss] or
> > [anon:dalvik-zygote space] reserving a unique id would work, however
> > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > are generated dynamically at runtime and include package name.
>
> I'd be careful; if you allow special characters like that, you will
> confuse some kind of parser.

That's why I think having a separate config option with default being
disabled would be useful here. It can be enabled on the systems which
handle [anon:...] correctly without affecting all other systems.

>
> > Packages are constantly evolving, new ones are developed, names can
> > change, etc. So assigning a unique id for these names is not really
> > feasible.
> > That leaves us with the central facility option, which as I described
> > in my previous email would be prohibitive from performance POV (IPC
> > every time we have a new name or want to convert id to name).
>
> That "central facility" option can be as simple as "mkdir
> /somewhere/sanitized_id", using inode numbers for example. You don't
> really need IPC.

Hmm, so the suggestion is to have some directory which contains files
representing IDs, each containing the string name of the associated
vma? Then let's say we are creating a new VMA and want to name it. We
would have to scan that directory, check all files and see if any of
them contain the name we want to reuse the same ID. This is while we
are doing mmap() call, which is often a part of time sensitive
operation (for example app is starting and allocating some memory to
operate). App startup time is one of the main metrics our users care
about and regressing that would not go well with our QA team.

>
> Plus, I don't really believe the IPC cost would be prohibitive.

This option was discussed by the Android performance team and rejected
8 years ago. The problem is that these operations are often happening
in performance-sensitive areas of the process lifecycle.

>
> Or you could simply hash the string and use the hash as id...

I don't see how this would help. You still need to register your
hash->name association somewhere for that hash to be converted back to
the name. Did I misunderstand your suggestion?

>
> Best regards,
>                                                                 Pavel
> --
> http://www.livejournal.com/~pavelmachek
Andrew Morton Oct. 7, 2021, 2:29 a.m. UTC | #14
On Wed, 6 Oct 2021 08:20:20 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> On Wed, Oct 6, 2021 at 8:08 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 06.10.21 17:01, Suren Baghdasaryan wrote:
> > > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 06.10.21 10:27, Michal Hocko wrote:
> > >>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > >>> [...]
> > >>>> 1) Yes, just leave the strings in the kernel, that's simple and
> > >>>> it works, and the alternatives don't really help your case nearly
> > >>>> enough.
> > >>>
> > >>> I do not have a strong opinion. Strings are easier to use but they
> > >>> are more involved and the necessity of kref approach just underlines
> > >>> that. There are going to be new allocations and that always can lead
> > >>> to surprising side effects.  These are small (80B at maximum) so the
> > >>> overall footpring shouldn't all that large by default but it can grow
> > >>> quite large with a very high max_map_count. There are workloads which
> > >>> really require the default to be set high (e.g. heavy mremap users). So
> > >>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > >>>
> > >>> I do agree that numbers are just much more simpler from accounting,
> > >>> performance and implementation POV.
> > >>
> > >> +1
> > >>
> > >> I can understand that having a string can be quite beneficial e.g., when
> > >> dumping mmaps. If only user space knows the id <-> string mapping, that
> > >> can be quite tricky.
> > >>
> > >> However, I also do wonder if there would be a way to standardize/reserve
> > >> ids, such that a given id always corresponds to a specific user. If we
> > >> use an uint64_t for an id, there would be plenty room to reserve ids ...
> > >>
> > >> I'd really prefer if we can avoid using strings and instead using ids.
> > >
> > > I wish it was that simple and for some names like [anon:.bss] or
> > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > are generated dynamically at runtime and include package name.
> >
> > Valuable information
> 
> Yeah, I should have described it clearer the first time around.

If it gets this fancy then the 80 char limit is likely to become a
significant limitation and the choice should be explained & justified.

Why not 97?  1034?  Why not just strndup_user() and be done with it?

> > My question would be, if we really have to expose these strings to the
> > kernel, or if an id is sufficient. Sure, it would move complexity to
> > user space, but keeping complexity out of the kernel is usually a good idea.
> 
> My worry here is not the additional complexity on the userspace side
> but the performance hit we would have to endure due to these
> conversions.

Has the performance hit been quantified?

I've seen this many times down the ages.  Something which *could* be
done in userspace is instead done in the kernel because coordinating
userspace is Just So Damn Hard.  I guess the central problem is that
userspace isn't centrally coordinated.  I wish we were better at this.
Suren Baghdasaryan Oct. 7, 2021, 2:46 a.m. UTC | #15
On Wed, Oct 6, 2021 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 6 Oct 2021 08:20:20 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Wed, Oct 6, 2021 at 8:08 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 06.10.21 17:01, Suren Baghdasaryan wrote:
> > > > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> > > >>
> > > >> On 06.10.21 10:27, Michal Hocko wrote:
> > > >>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > > >>> [...]
> > > >>>> 1) Yes, just leave the strings in the kernel, that's simple and
> > > >>>> it works, and the alternatives don't really help your case nearly
> > > >>>> enough.
> > > >>>
> > > >>> I do not have a strong opinion. Strings are easier to use but they
> > > >>> are more involved and the necessity of kref approach just underlines
> > > >>> that. There are going to be new allocations and that always can lead
> > > >>> to surprising side effects.  These are small (80B at maximum) so the
> > > >>> overall footpring shouldn't all that large by default but it can grow
> > > >>> quite large with a very high max_map_count. There are workloads which
> > > >>> really require the default to be set high (e.g. heavy mremap users). So
> > > >>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > > >>>
> > > >>> I do agree that numbers are just much more simpler from accounting,
> > > >>> performance and implementation POV.
> > > >>
> > > >> +1
> > > >>
> > > >> I can understand that having a string can be quite beneficial e.g., when
> > > >> dumping mmaps. If only user space knows the id <-> string mapping, that
> > > >> can be quite tricky.
> > > >>
> > > >> However, I also do wonder if there would be a way to standardize/reserve
> > > >> ids, such that a given id always corresponds to a specific user. If we
> > > >> use an uint64_t for an id, there would be plenty room to reserve ids ...
> > > >>
> > > >> I'd really prefer if we can avoid using strings and instead using ids.
> > > >
> > > > I wish it was that simple and for some names like [anon:.bss] or
> > > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > > are generated dynamically at runtime and include package name.
> > >
> > > Valuable information
> >
> > Yeah, I should have described it clearer the first time around.
>
> If it gets this fancy then the 80 char limit is likely to become a
> significant limitation and the choice should be explained & justified.
>
> Why not 97?  1034?  Why not just strndup_user() and be done with it?

The original patch from 8 years ago used 256 as the limit but Rasmus
argued that the string content should be human-readable, so 80 chars
seems to be a reasonable limit (see:
https://lore.kernel.org/all/d8619a98-2380-ca96-001e-60fe9c6204a6@rasmusvillemoes.dk),
which makes sense to me. We should be able to handle the 80 char limit
by trimming it before calling prctl().

>
> > > My question would be, if we really have to expose these strings to the
> > > kernel, or if an id is sufficient. Sure, it would move complexity to
> > > user space, but keeping complexity out of the kernel is usually a good idea.
> >
> > My worry here is not the additional complexity on the userspace side
> > but the performance hit we would have to endure due to these
> > conversions.
>
> Has the performance hit been quantified?

I'll try to get the data that was collected or at least an estimate. I
imagine collecting such data would require considerable userspace
redesign.

> I've seen this many times down the ages.  Something which *could* be
> done in userspace is instead done in the kernel because coordinating
> userspace is Just So Damn Hard.  I guess the central problem is that
> userspace isn't centrally coordinated.  I wish we were better at this.

It's not just hard, it's also inefficient. And for our usecase
performance is important.

>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Andrew Morton Oct. 7, 2021, 2:53 a.m. UTC | #16
On Wed, 6 Oct 2021 19:46:57 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> > > > > I wish it was that simple and for some names like [anon:.bss] or
> > > > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > > > are generated dynamically at runtime and include package name.
> > > >
> > > > Valuable information
> > >
> > > Yeah, I should have described it clearer the first time around.
> >
> > If it gets this fancy then the 80 char limit is likely to become a
> > significant limitation and the choice should be explained & justified.
> >
> > Why not 97?  1034?  Why not just strndup_user() and be done with it?
> 
> The original patch from 8 years ago used 256 as the limit but Rasmus
> argued that the string content should be human-readable, so 80 chars
> seems to be a reasonable limit (see:
> https://lore.kernel.org/all/d8619a98-2380-ca96-001e-60fe9c6204a6@rasmusvillemoes.dk),
> which makes sense to me. We should be able to handle the 80 char limit
> by trimming it before calling prctl().

What's the downside to making it unlimited?
Suren Baghdasaryan Oct. 7, 2021, 3:01 a.m. UTC | #17
On Wed, Oct 6, 2021 at 7:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 6 Oct 2021 19:46:57 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > > > > I wish it was that simple and for some names like [anon:.bss] or
> > > > > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > > > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > > > > are generated dynamically at runtime and include package name.
> > > > >
> > > > > Valuable information
> > > >
> > > > Yeah, I should have described it clearer the first time around.
> > >
> > > If it gets this fancy then the 80 char limit is likely to become a
> > > significant limitation and the choice should be explained & justified.
> > >
> > > Why not 97?  1034?  Why not just strndup_user() and be done with it?
> >
> > The original patch from 8 years ago used 256 as the limit but Rasmus
> > argued that the string content should be human-readable, so 80 chars
> > seems to be a reasonable limit (see:
> > https://lore.kernel.org/all/d8619a98-2380-ca96-001e-60fe9c6204a6@rasmusvillemoes.dk),
> > which makes sense to me. We should be able to handle the 80 char limit
> > by trimming it before calling prctl().
>
> What's the downside to making it unlimited?

If we ignore the human-readability argument, I guess the possibility
of abuse and increased memory consumption? I'm guessing parsing such a
string is also easier if there is a known limit?

>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
David Hildenbrand Oct. 7, 2021, 7:27 a.m. UTC | #18
On 07.10.21 05:01, Suren Baghdasaryan wrote:
> On Wed, Oct 6, 2021 at 7:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Wed, 6 Oct 2021 19:46:57 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>>
>>>>>>> I wish it was that simple and for some names like [anon:.bss] or
>>>>>>> [anon:dalvik-zygote space] reserving a unique id would work, however
>>>>>>> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
>>>>>>> are generated dynamically at runtime and include package name.
>>>>>>
>>>>>> Valuable information
>>>>>
>>>>> Yeah, I should have described it clearer the first time around.
>>>>
>>>> If it gets this fancy then the 80 char limit is likely to become a
>>>> significant limitation and the choice should be explained & justified.
>>>>
>>>> Why not 97?  1034?  Why not just strndup_user() and be done with it?
>>>
>>> The original patch from 8 years ago used 256 as the limit but Rasmus
>>> argued that the string content should be human-readable, so 80 chars
>>> seems to be a reasonable limit (see:
>>> https://lore.kernel.org/all/d8619a98-2380-ca96-001e-60fe9c6204a6@rasmusvillemoes.dk),
>>> which makes sense to me. We should be able to handle the 80 char limit
>>> by trimming it before calling prctl().
>>
>> What's the downside to making it unlimited?
> 
> If we ignore the human-readability argument, I guess the possibility
> of abuse and increased memory consumption? I'm guessing parsing such a
> string is also easier if there is a known limit?

64k * 80 bytes already makes me nervous enough :)
David Hildenbrand Oct. 7, 2021, 7:33 a.m. UTC | #19
On 06.10.21 17:20, Suren Baghdasaryan wrote:
> On Wed, Oct 6, 2021 at 8:08 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 06.10.21 17:01, Suren Baghdasaryan wrote:
>>> On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 06.10.21 10:27, Michal Hocko wrote:
>>>>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
>>>>> [...]
>>>>>> 1) Yes, just leave the strings in the kernel, that's simple and
>>>>>> it works, and the alternatives don't really help your case nearly
>>>>>> enough.
>>>>>
>>>>> I do not have a strong opinion. Strings are easier to use but they
>>>>> are more involved and the necessity of kref approach just underlines
>>>>> that. There are going to be new allocations and that always can lead
>>>>> to surprising side effects.  These are small (80B at maximum) so the
>>>>> overall footpring shouldn't all that large by default but it can grow
>>>>> quite large with a very high max_map_count. There are workloads which
>>>>> really require the default to be set high (e.g. heavy mremap users). So
>>>>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
>>>>>
>>>>> I do agree that numbers are just much more simpler from accounting,
>>>>> performance and implementation POV.
>>>>
>>>> +1
>>>>
>>>> I can understand that having a string can be quite beneficial e.g., when
>>>> dumping mmaps. If only user space knows the id <-> string mapping, that
>>>> can be quite tricky.
>>>>
>>>> However, I also do wonder if there would be a way to standardize/reserve
>>>> ids, such that a given id always corresponds to a specific user. If we
>>>> use an uint64_t for an id, there would be plenty room to reserve ids ...
>>>>
>>>> I'd really prefer if we can avoid using strings and instead using ids.
>>>
>>> I wish it was that simple and for some names like [anon:.bss] or
>>> [anon:dalvik-zygote space] reserving a unique id would work, however
>>> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
>>> are generated dynamically at runtime and include package name.
>>
>> Valuable information
> 
> Yeah, I should have described it clearer the first time around.
> 
>>
>>> Packages are constantly evolving, new ones are developed, names can
>>> change, etc. So assigning a unique id for these names is not really
>>> feasible.
>>
>> So, you'd actually want to generate/reserve an id for a given string at
>> runtime, assign that id to the VMA, and have a way to match id <->
>> string somehow?
> 
> If we go with ids then yes, that is what we would have to do.
> 
>> That reservation service could be inside the kernel or even (better?) in
>> user space. The service could for example de-duplicates strings.
> 
> Yes but it would require an IPC call to that service potentially on
> every mmap() when we want to name a mapped vma. This would be
> prohibitive. Even on consumption side, instead of just dumping
> /proc/$pid/maps we would have to parse the file and convert all
> [anon:id] into [anon:name] with each conversion requiring an IPC call
> (assuming no id->name pair caching on the client side).

mmap() and prctl() already do take the mmap sem in write, so they are 
not the "most lightweight" operations so to say.

We already to have two separate operations, first the mmap(), then the 
prctl(). IMHO you could defer the "naming" part to a later point in 
time, without creating too many issues, moving it out of the 
"hot/performance critical phase"

Reading https://lwn.net/Articles/867818/, to me it feels like the use 
case could live with a little larger delay between the mmap popping up 
and a name getting assigned.
Michal Hocko Oct. 7, 2021, 7:59 a.m. UTC | #20
On Wed 06-10-21 08:01:56, Suren Baghdasaryan wrote:
> On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 06.10.21 10:27, Michal Hocko wrote:
> > > On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > > [...]
> > >> 1) Yes, just leave the strings in the kernel, that's simple and
> > >> it works, and the alternatives don't really help your case nearly
> > >> enough.
> > >
> > > I do not have a strong opinion. Strings are easier to use but they
> > > are more involved and the necessity of kref approach just underlines
> > > that. There are going to be new allocations and that always can lead
> > > to surprising side effects.  These are small (80B at maximum) so the
> > > overall footpring shouldn't all that large by default but it can grow
> > > quite large with a very high max_map_count. There are workloads which
> > > really require the default to be set high (e.g. heavy mremap users). So
> > > if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > >
> > > I do agree that numbers are just much more simpler from accounting,
> > > performance and implementation POV.
> >
> > +1
> >
> > I can understand that having a string can be quite beneficial e.g., when
> > dumping mmaps. If only user space knows the id <-> string mapping, that
> > can be quite tricky.
> >
> > However, I also do wonder if there would be a way to standardize/reserve
> > ids, such that a given id always corresponds to a specific user. If we
> > use an uint64_t for an id, there would be plenty room to reserve ids ...
> >
> > I'd really prefer if we can avoid using strings and instead using ids.
> 
> I wish it was that simple and for some names like [anon:.bss] or
> [anon:dalvik-zygote space] reserving a unique id would work, however
> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> are generated dynamically at runtime and include package name.
> Packages are constantly evolving, new ones are developed, names can
> change, etc. So assigning a unique id for these names is not really
> feasible.

I still do not follow. If you need a globaly consistent naming then
you need clear rules for that, no matter whether that is number or a
file. How do you handle this with strings currently?
Michal Hocko Oct. 7, 2021, 8:10 a.m. UTC | #21
On Wed 06-10-21 11:18:31, Suren Baghdasaryan wrote:
> On Wed, Oct 6, 2021 at 10:58 AM Pavel Machek <pavel@ucw.cz> wrote:
[...]
> > That "central facility" option can be as simple as "mkdir
> > /somewhere/sanitized_id", using inode numbers for example. You don't
> > really need IPC.
> 
> Hmm, so the suggestion is to have some directory which contains files
> representing IDs, each containing the string name of the associated
> vma? Then let's say we are creating a new VMA and want to name it. We
> would have to scan that directory, check all files and see if any of
> them contain the name we want to reuse the same ID.

I believe Pavel meant something as simple as
$ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
$ touch $YOUR_FILE
$ stat -c %i $YOUR_FILE

YOUR_IDS_DIR can live on a tmpfs and you can even implement a policy on
top of that (who can generate new ids, gurantee uniqness etc...).

The above is certainly not for free of course but if you really need a
system wide consistency when using names then you need some sort of
central authority. How you implement that is not all that important
but I do not think we want to handle that in the kernel.
Pavel Machek Oct. 7, 2021, 8:41 a.m. UTC | #22
Hi!

> [...]
> > > That "central facility" option can be as simple as "mkdir
> > > /somewhere/sanitized_id", using inode numbers for example. You don't
> > > really need IPC.
> > 
> > Hmm, so the suggestion is to have some directory which contains files
> > representing IDs, each containing the string name of the associated
> > vma? Then let's say we are creating a new VMA and want to name it. We
> > would have to scan that directory, check all files and see if any of
> > them contain the name we want to reuse the same ID.
> 
> I believe Pavel meant something as simple as
> $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> $ touch $YOUR_FILE
> $ stat -c %i $YOUR_FILE
> 
> YOUR_IDS_DIR can live on a tmpfs and you can even implement a policy on
> top of that (who can generate new ids, gurantee uniqness etc...).
> 
> The above is certainly not for free of course but if you really need a
> system wide consistency when using names then you need some sort of
> central authority. How you implement that is not all that important
> but I do not think we want to handle that in the kernel.

For the record, yes, that is what I meant.

Thanks,
								Pavel
Rasmus Villemoes Oct. 7, 2021, 8:47 a.m. UTC | #23
On 07/10/2021 10.10, Michal Hocko wrote:
> On Wed 06-10-21 11:18:31, Suren Baghdasaryan wrote:
>> On Wed, Oct 6, 2021 at 10:58 AM Pavel Machek <pavel@ucw.cz> wrote:
> [...]
>>> That "central facility" option can be as simple as "mkdir
>>> /somewhere/sanitized_id", using inode numbers for example. You don't
>>> really need IPC.
>>
>> Hmm, so the suggestion is to have some directory which contains files
>> representing IDs, each containing the string name of the associated
>> vma? Then let's say we are creating a new VMA and want to name it. We
>> would have to scan that directory, check all files and see if any of
>> them contain the name we want to reuse the same ID.
> 
> I believe Pavel meant something as simple as
> $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> $ touch $YOUR_FILE
> $ stat -c %i $YOUR_FILE

So in terms of syscall overhead, that would be open(..., O_CREAT |
O_CLOEXEC), fstat(), close() - or one could optimistically start by
doing a single lstat() if it is normal that the name is already created
(which I assume).

As for the consumer, one can't directly map an inode number to a dentry,
but whoever first creates the name->id mapping could also be responsible
for doing a "sprintf(buf, "/id/to/name/%016lx", id); symlink(name,
buf)". And if one did the optimistic lstat() succesfully, one would know
that someone else created the file and thus the symlink. And since the
operations are idempotent, the obvious races are irrelevant.

Then the consumer would only need to do a readlink() to get the name.
But that would only be for presentation to a human. Internally all the
aggregation based on the type of anon mem the tool might as well do in
terms of the integer id.

> YOUR_IDS_DIR can live on a tmpfs and you can even implement a policy on
> top of that (who can generate new ids, gurantee uniqness etc...).
> 
> The above is certainly not for free of course but if you really need a
> system wide consistency when using names then you need some sort of
> central authority. How you implement that is not all that important
> but I do not think we want to handle that in the kernel.

IDK. If the whole thing could be put behind a CONFIG_ knob, with _zero_
overhead when not enabled (and I'm a bit worried about all the functions
that grow an extra argument that gets passed around), I don't mind the
string interface. But I don't really have a say either way.

Rasmus
Pavel Machek Oct. 7, 2021, 10:15 a.m. UTC | #24
Hi!

> >> Hmm, so the suggestion is to have some directory which contains files
> >> representing IDs, each containing the string name of the associated
> >> vma? Then let's say we are creating a new VMA and want to name it. We
> >> would have to scan that directory, check all files and see if any of
> >> them contain the name we want to reuse the same ID.
> > 
> > I believe Pavel meant something as simple as
> > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > $ touch $YOUR_FILE
> > $ stat -c %i $YOUR_FILE
> 
> So in terms of syscall overhead, that would be open(..., O_CREAT |
> O_CLOEXEC), fstat(), close() - or one could optimistically start by

You could get to two if you used mkdir instead of open.

> > YOUR_IDS_DIR can live on a tmpfs and you can even implement a policy on
> > top of that (who can generate new ids, gurantee uniqness etc...).
> > 
> > The above is certainly not for free of course but if you really need a
> > system wide consistency when using names then you need some sort of
> > central authority. How you implement that is not all that important
> > but I do not think we want to handle that in the kernel.
> 
> IDK. If the whole thing could be put behind a CONFIG_ knob, with _zero_
> overhead when not enabled (and I'm a bit worried about all the functions
> that grow an extra argument that gets passed around), I don't mind the
> string interface. But I don't really have a say either way.

If this is ever useful outside of Android, eventually distros will
have it enabled.

Best regards,
								Pavel
Suren Baghdasaryan Oct. 7, 2021, 3:42 p.m. UTC | #25
On Thu, Oct 7, 2021 at 12:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.10.21 17:20, Suren Baghdasaryan wrote:
> > On Wed, Oct 6, 2021 at 8:08 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 06.10.21 17:01, Suren Baghdasaryan wrote:
> >>> On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 06.10.21 10:27, Michal Hocko wrote:
> >>>>> On Tue 05-10-21 23:57:36, John Hubbard wrote:
> >>>>> [...]
> >>>>>> 1) Yes, just leave the strings in the kernel, that's simple and
> >>>>>> it works, and the alternatives don't really help your case nearly
> >>>>>> enough.
> >>>>>
> >>>>> I do not have a strong opinion. Strings are easier to use but they
> >>>>> are more involved and the necessity of kref approach just underlines
> >>>>> that. There are going to be new allocations and that always can lead
> >>>>> to surprising side effects.  These are small (80B at maximum) so the
> >>>>> overall footpring shouldn't all that large by default but it can grow
> >>>>> quite large with a very high max_map_count. There are workloads which
> >>>>> really require the default to be set high (e.g. heavy mremap users). So
> >>>>> if anything all those should be __GFP_ACCOUNT and memcg accounted.
> >>>>>
> >>>>> I do agree that numbers are just much more simpler from accounting,
> >>>>> performance and implementation POV.
> >>>>
> >>>> +1
> >>>>
> >>>> I can understand that having a string can be quite beneficial e.g., when
> >>>> dumping mmaps. If only user space knows the id <-> string mapping, that
> >>>> can be quite tricky.
> >>>>
> >>>> However, I also do wonder if there would be a way to standardize/reserve
> >>>> ids, such that a given id always corresponds to a specific user. If we
> >>>> use an uint64_t for an id, there would be plenty room to reserve ids ...
> >>>>
> >>>> I'd really prefer if we can avoid using strings and instead using ids.
> >>>
> >>> I wish it was that simple and for some names like [anon:.bss] or
> >>> [anon:dalvik-zygote space] reserving a unique id would work, however
> >>> some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> >>> are generated dynamically at runtime and include package name.
> >>
> >> Valuable information
> >
> > Yeah, I should have described it clearer the first time around.
> >
> >>
> >>> Packages are constantly evolving, new ones are developed, names can
> >>> change, etc. So assigning a unique id for these names is not really
> >>> feasible.
> >>
> >> So, you'd actually want to generate/reserve an id for a given string at
> >> runtime, assign that id to the VMA, and have a way to match id <->
> >> string somehow?
> >
> > If we go with ids then yes, that is what we would have to do.
> >
> >> That reservation service could be inside the kernel or even (better?) in
> >> user space. The service could for example de-duplicates strings.
> >
> > Yes but it would require an IPC call to that service potentially on
> > every mmap() when we want to name a mapped vma. This would be
> > prohibitive. Even on consumption side, instead of just dumping
> > /proc/$pid/maps we would have to parse the file and convert all
> > [anon:id] into [anon:name] with each conversion requiring an IPC call
> > (assuming no id->name pair caching on the client side).
>
> mmap() and prctl() already do take the mmap sem in write, so they are
> not the "most lightweight" operations so to say.
>
> We already to have two separate operations, first the mmap(), then the
> prctl(). IMHO you could defer the "naming" part to a later point in
> time, without creating too many issues, moving it out of the
> "hot/performance critical phase"
>
> Reading https://lwn.net/Articles/867818/, to me it feels like the use
> case could live with a little larger delay between the mmap popping up
> and a name getting assigned.

That might be doable if occasional inconsistency can be tolerated (we
can't guarantee that maps won't be read before the deferred work name
the vma). However I would prefer an efficient solution vs the one
which is inefficient but can be deferred.

>
> --
> Thanks,
>
> David / dhildenb
>
Suren Baghdasaryan Oct. 7, 2021, 3:45 p.m. UTC | #26
On Thu, Oct 7, 2021 at 12:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 06-10-21 08:01:56, Suren Baghdasaryan wrote:
> > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 06.10.21 10:27, Michal Hocko wrote:
> > > > On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > > > [...]
> > > >> 1) Yes, just leave the strings in the kernel, that's simple and
> > > >> it works, and the alternatives don't really help your case nearly
> > > >> enough.
> > > >
> > > > I do not have a strong opinion. Strings are easier to use but they
> > > > are more involved and the necessity of kref approach just underlines
> > > > that. There are going to be new allocations and that always can lead
> > > > to surprising side effects.  These are small (80B at maximum) so the
> > > > overall footpring shouldn't all that large by default but it can grow
> > > > quite large with a very high max_map_count. There are workloads which
> > > > really require the default to be set high (e.g. heavy mremap users). So
> > > > if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > > >
> > > > I do agree that numbers are just much more simpler from accounting,
> > > > performance and implementation POV.
> > >
> > > +1
> > >
> > > I can understand that having a string can be quite beneficial e.g., when
> > > dumping mmaps. If only user space knows the id <-> string mapping, that
> > > can be quite tricky.
> > >
> > > However, I also do wonder if there would be a way to standardize/reserve
> > > ids, such that a given id always corresponds to a specific user. If we
> > > use an uint64_t for an id, there would be plenty room to reserve ids ...
> > >
> > > I'd really prefer if we can avoid using strings and instead using ids.
> >
> > I wish it was that simple and for some names like [anon:.bss] or
> > [anon:dalvik-zygote space] reserving a unique id would work, however
> > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > are generated dynamically at runtime and include package name.
> > Packages are constantly evolving, new ones are developed, names can
> > change, etc. So assigning a unique id for these names is not really
> > feasible.
>
> I still do not follow. If you need a globaly consistent naming then
> you need clear rules for that, no matter whether that is number or a
> file. How do you handle this with strings currently?

Some names represent standard categories, some are unique. A simple
tool could calculate and report the total for each name, a more
advanced tool might recognize some standard names and process them
differently. From kernel's POV, it's just a name used by the userspace
to categorize anonymous memory areas.

>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 7, 2021, 4:04 p.m. UTC | #27
On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > >> Hmm, so the suggestion is to have some directory which contains files
> > >> representing IDs, each containing the string name of the associated
> > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > >> would have to scan that directory, check all files and see if any of
> > >> them contain the name we want to reuse the same ID.
> > >
> > > I believe Pavel meant something as simple as
> > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > $ touch $YOUR_FILE
> > > $ stat -c %i $YOUR_FILE

Ah, ok, now I understand the proposal. Thanks for the clarification!
So, this would use filesystem as a directory for inode->name mappings.
One rough edge for me is that the consumer would still need to parse
/proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
just dumping the content for the user. Would it be acceptable if we
require the ID provided by prctl() to always be a valid inode and
show_map_vma() would do the inode-to-filename conversion when
generating maps/smaps files? I know that inode->dentry is not
one-to-one mapping but we can simply output the first dentry name.
WDYT?

> >
> > So in terms of syscall overhead, that would be open(..., O_CREAT |
> > O_CLOEXEC), fstat(), close() - or one could optimistically start by
>
> You could get to two if you used mkdir instead of open.
>
> > > YOUR_IDS_DIR can live on a tmpfs and you can even implement a policy on
> > > top of that (who can generate new ids, gurantee uniqness etc...).
> > >
> > > The above is certainly not for free of course but if you really need a
> > > system wide consistency when using names then you need some sort of
> > > central authority. How you implement that is not all that important
> > > but I do not think we want to handle that in the kernel.

Ideally it would be great if $YOUR_IDS_DIR/my_string_name entries
could be generated by the kernel in response to userspace calling
prctl(..., name) but I haven't looked into complexity of doing that,
so I would not propose that at this point.
Thanks for sharing the ideas!
Suren.

> >
> > IDK. If the whole thing could be put behind a CONFIG_ knob, with _zero_
> > overhead when not enabled (and I'm a bit worried about all the functions
> > that grow an extra argument that gets passed around), I don't mind the
> > string interface. But I don't really have a say either way.
>
> If this is ever useful outside of Android, eventually distros will
> have it enabled.
>
> Best regards,
>                                                                 Pavel
> --
> http://www.livejournal.com/~pavelmachek
Michal Hocko Oct. 7, 2021, 4:37 p.m. UTC | #28
On Thu 07-10-21 08:45:21, Suren Baghdasaryan wrote:
> On Thu, Oct 7, 2021 at 12:59 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 06-10-21 08:01:56, Suren Baghdasaryan wrote:
> > > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 06.10.21 10:27, Michal Hocko wrote:
> > > > > On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > > > > [...]
> > > > >> 1) Yes, just leave the strings in the kernel, that's simple and
> > > > >> it works, and the alternatives don't really help your case nearly
> > > > >> enough.
> > > > >
> > > > > I do not have a strong opinion. Strings are easier to use but they
> > > > > are more involved and the necessity of kref approach just underlines
> > > > > that. There are going to be new allocations and that always can lead
> > > > > to surprising side effects.  These are small (80B at maximum) so the
> > > > > overall footpring shouldn't all that large by default but it can grow
> > > > > quite large with a very high max_map_count. There are workloads which
> > > > > really require the default to be set high (e.g. heavy mremap users). So
> > > > > if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > > > >
> > > > > I do agree that numbers are just much more simpler from accounting,
> > > > > performance and implementation POV.
> > > >
> > > > +1
> > > >
> > > > I can understand that having a string can be quite beneficial e.g., when
> > > > dumping mmaps. If only user space knows the id <-> string mapping, that
> > > > can be quite tricky.
> > > >
> > > > However, I also do wonder if there would be a way to standardize/reserve
> > > > ids, such that a given id always corresponds to a specific user. If we
> > > > use an uint64_t for an id, there would be plenty room to reserve ids ...
> > > >
> > > > I'd really prefer if we can avoid using strings and instead using ids.
> > >
> > > I wish it was that simple and for some names like [anon:.bss] or
> > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > are generated dynamically at runtime and include package name.
> > > Packages are constantly evolving, new ones are developed, names can
> > > change, etc. So assigning a unique id for these names is not really
> > > feasible.
> >
> > I still do not follow. If you need a globaly consistent naming then
> > you need clear rules for that, no matter whether that is number or a
> > file. How do you handle this with strings currently?
> 
> Some names represent standard categories, some are unique. A simple
> tool could calculate and report the total for each name, a more
> advanced tool might recognize some standard names and process them
> differently. From kernel's POV, it's just a name used by the userspace
> to categorize anonymous memory areas.

OK, so there is no real authority or any real naming convention. You
just hope that applications will behave so that the consumer of those
names can make proper calls. Correct?

In that case the same applies to numbers and I do not see any strong
argument for strings other than it is more pleasing to a human eye when
reading the file. And that doesn't sound like a strong argument to make
the kernel more complicated. Functionally both approaches are equal from
a practical POV.
Michal Hocko Oct. 7, 2021, 4:40 p.m. UTC | #29
On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Hi!
> >
> > > >> Hmm, so the suggestion is to have some directory which contains files
> > > >> representing IDs, each containing the string name of the associated
> > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > >> would have to scan that directory, check all files and see if any of
> > > >> them contain the name we want to reuse the same ID.
> > > >
> > > > I believe Pavel meant something as simple as
> > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > $ touch $YOUR_FILE
> > > > $ stat -c %i $YOUR_FILE
> 
> Ah, ok, now I understand the proposal. Thanks for the clarification!
> So, this would use filesystem as a directory for inode->name mappings.
> One rough edge for me is that the consumer would still need to parse
> /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> just dumping the content for the user. Would it be acceptable if we
> require the ID provided by prctl() to always be a valid inode and
> show_map_vma() would do the inode-to-filename conversion when
> generating maps/smaps files? I know that inode->dentry is not
> one-to-one mapping but we can simply output the first dentry name.
> WDYT?

No. You do not want to dictate any particular way of the mapping. The
above is just one way to do that without developing any actual mapping
yourself. You just use a filesystem for that. Kernel doesn't and
shouldn't understand the meaning of those numbers. It has no business in
that.

In a way this would be pushing policy into the kernel.
Suren Baghdasaryan Oct. 7, 2021, 4:43 p.m. UTC | #30
On Thu, Oct 7, 2021 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 07-10-21 08:45:21, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 12:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 06-10-21 08:01:56, Suren Baghdasaryan wrote:
> > > > On Wed, Oct 6, 2021 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > On 06.10.21 10:27, Michal Hocko wrote:
> > > > > > On Tue 05-10-21 23:57:36, John Hubbard wrote:
> > > > > > [...]
> > > > > >> 1) Yes, just leave the strings in the kernel, that's simple and
> > > > > >> it works, and the alternatives don't really help your case nearly
> > > > > >> enough.
> > > > > >
> > > > > > I do not have a strong opinion. Strings are easier to use but they
> > > > > > are more involved and the necessity of kref approach just underlines
> > > > > > that. There are going to be new allocations and that always can lead
> > > > > > to surprising side effects.  These are small (80B at maximum) so the
> > > > > > overall footpring shouldn't all that large by default but it can grow
> > > > > > quite large with a very high max_map_count. There are workloads which
> > > > > > really require the default to be set high (e.g. heavy mremap users). So
> > > > > > if anything all those should be __GFP_ACCOUNT and memcg accounted.
> > > > > >
> > > > > > I do agree that numbers are just much more simpler from accounting,
> > > > > > performance and implementation POV.
> > > > >
> > > > > +1
> > > > >
> > > > > I can understand that having a string can be quite beneficial e.g., when
> > > > > dumping mmaps. If only user space knows the id <-> string mapping, that
> > > > > can be quite tricky.
> > > > >
> > > > > However, I also do wonder if there would be a way to standardize/reserve
> > > > > ids, such that a given id always corresponds to a specific user. If we
> > > > > use an uint64_t for an id, there would be plenty room to reserve ids ...
> > > > >
> > > > > I'd really prefer if we can avoid using strings and instead using ids.
> > > >
> > > > I wish it was that simple and for some names like [anon:.bss] or
> > > > [anon:dalvik-zygote space] reserving a unique id would work, however
> > > > some names like [anon:dalvik-/system/framework/boot-core-icu4j.art]
> > > > are generated dynamically at runtime and include package name.
> > > > Packages are constantly evolving, new ones are developed, names can
> > > > change, etc. So assigning a unique id for these names is not really
> > > > feasible.
> > >
> > > I still do not follow. If you need a globaly consistent naming then
> > > you need clear rules for that, no matter whether that is number or a
> > > file. How do you handle this with strings currently?
> >
> > Some names represent standard categories, some are unique. A simple
> > tool could calculate and report the total for each name, a more
> > advanced tool might recognize some standard names and process them
> > differently. From kernel's POV, it's just a name used by the userspace
> > to categorize anonymous memory areas.
>
> OK, so there is no real authority or any real naming convention. You
> just hope that applications will behave so that the consumer of those
> names can make proper calls. Correct?
>
> In that case the same applies to numbers and I do not see any strong
> argument for strings other than it is more pleasing to a human eye when
> reading the file. And that doesn't sound like a strong argument to make
> the kernel more complicated. Functionally both approaches are equal from
> a practical POV.

I don't think that's correct. Names like [anon:.bss],
[anon:dalvik-zygote space] and
[anon:dalvik-/system/framework/boot-core-icu4j.art] provide user with
actionable information about the use of that memory or the allocator
using it. Names like [anon:1], [anon:2] and [anon:3] do not convey any
valuable information for the user until they are converted into
descriptive names.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 7, 2021, 4:58 p.m. UTC | #31
On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > >
> > > Hi!
> > >
> > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > >> representing IDs, each containing the string name of the associated
> > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > >> would have to scan that directory, check all files and see if any of
> > > > >> them contain the name we want to reuse the same ID.
> > > > >
> > > > > I believe Pavel meant something as simple as
> > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > $ touch $YOUR_FILE
> > > > > $ stat -c %i $YOUR_FILE
> >
> > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > So, this would use filesystem as a directory for inode->name mappings.
> > One rough edge for me is that the consumer would still need to parse
> > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > just dumping the content for the user. Would it be acceptable if we
> > require the ID provided by prctl() to always be a valid inode and
> > show_map_vma() would do the inode-to-filename conversion when
> > generating maps/smaps files? I know that inode->dentry is not
> > one-to-one mapping but we can simply output the first dentry name.
> > WDYT?
>
> No. You do not want to dictate any particular way of the mapping. The
> above is just one way to do that without developing any actual mapping
> yourself. You just use a filesystem for that. Kernel doesn't and
> shouldn't understand the meaning of those numbers. It has no business in
> that.
>
> In a way this would be pushing policy into the kernel.

I can see your point. Any other ideas on how to prevent tools from
doing this id-to-name conversion themselves?
Aside from the obvious inefficiency of requiring tools to parse
maps/smaps and convert ids to names it also creates a tool->system
dependency. Tools would have to know how the system interprets these
IDs and on the systems where the ID is an inode they would have to
know where the soflinks proposed by Rasmus reside and convert them. If
I'm a tool developer who wants the tool to work on multiple systems
this becomes quite messy.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Oct. 7, 2021, 5:25 p.m. UTC | #32
On Thu 07-10-21 09:43:14, Suren Baghdasaryan wrote:
> On Thu, Oct 7, 2021 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > OK, so there is no real authority or any real naming convention. You
> > just hope that applications will behave so that the consumer of those
> > names can make proper calls. Correct?
> >
> > In that case the same applies to numbers and I do not see any strong
> > argument for strings other than it is more pleasing to a human eye when
> > reading the file. And that doesn't sound like a strong argument to make
> > the kernel more complicated. Functionally both approaches are equal from
> > a practical POV.
> 
> I don't think that's correct. Names like [anon:.bss],
> [anon:dalvik-zygote space] and
> [anon:dalvik-/system/framework/boot-core-icu4j.art] provide user with
> actionable information about the use of that memory or the allocator
> using it.

No, none of the above is really actionable without a common
understanding. Both dalvik* are a complete gibberish to me.
Suren Baghdasaryan Oct. 7, 2021, 5:30 p.m. UTC | #33
On Thu, Oct 7, 2021 at 10:25 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Thu 07-10-21 09:43:14, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > OK, so there is no real authority or any real naming convention. You
> > > just hope that applications will behave so that the consumer of those
> > > names can make proper calls. Correct?
> > >
> > > In that case the same applies to numbers and I do not see any strong
> > > argument for strings other than it is more pleasing to a human eye when
> > > reading the file. And that doesn't sound like a strong argument to make
> > > the kernel more complicated. Functionally both approaches are equal from
> > > a practical POV.
> >
> > I don't think that's correct. Names like [anon:.bss],
> > [anon:dalvik-zygote space] and
> > [anon:dalvik-/system/framework/boot-core-icu4j.art] provide user with
> > actionable information about the use of that memory or the allocator
> > using it.
>
> No, none of the above is really actionable without a common
> understanding. Both dalvik* are a complete gibberish to me.

Ok, maybe I was unclear. Some names, as the first two in the above
example are quite standard for Android and tools do use them to
identify specific specialized areas. Some names are not standardized
and can contain package names, like
anon:dalvik-/system/framework/boot-core-icu4j.art. In this case while
tools do not process them in any special way, they still convey enough
information for a user to identify the corresponding component.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Michal Hocko Oct. 7, 2021, 5:31 p.m. UTC | #34
On Thu 07-10-21 09:58:02, Suren Baghdasaryan wrote:
> On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > >
> > > > Hi!
> > > >
> > > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > > >> representing IDs, each containing the string name of the associated
> > > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > > >> would have to scan that directory, check all files and see if any of
> > > > > >> them contain the name we want to reuse the same ID.
> > > > > >
> > > > > > I believe Pavel meant something as simple as
> > > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > > $ touch $YOUR_FILE
> > > > > > $ stat -c %i $YOUR_FILE
> > >
> > > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > > So, this would use filesystem as a directory for inode->name mappings.
> > > One rough edge for me is that the consumer would still need to parse
> > > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > > just dumping the content for the user. Would it be acceptable if we
> > > require the ID provided by prctl() to always be a valid inode and
> > > show_map_vma() would do the inode-to-filename conversion when
> > > generating maps/smaps files? I know that inode->dentry is not
> > > one-to-one mapping but we can simply output the first dentry name.
> > > WDYT?
> >
> > No. You do not want to dictate any particular way of the mapping. The
> > above is just one way to do that without developing any actual mapping
> > yourself. You just use a filesystem for that. Kernel doesn't and
> > shouldn't understand the meaning of those numbers. It has no business in
> > that.
> >
> > In a way this would be pushing policy into the kernel.
> 
> I can see your point. Any other ideas on how to prevent tools from
> doing this id-to-name conversion themselves?

I really fail to understand why you really want to prevent them from that.
Really, the whole thing is just a cookie that kernel maintains for memory
mappings so that two parties can understand what the meaning of that
mapping is from a higher level. They both have to agree on the naming
but the kernel shouldn't dictate any specific convention because the
kernel _doesn't_ _care_. These things are not really anything actionable
for the kernel. It is just a metadata.
Suren Baghdasaryan Oct. 7, 2021, 5:50 p.m. UTC | #35
On Thu, Oct 7, 2021 at 10:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 07-10-21 09:58:02, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > > > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > > >
> > > > > Hi!
> > > > >
> > > > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > > > >> representing IDs, each containing the string name of the associated
> > > > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > > > >> would have to scan that directory, check all files and see if any of
> > > > > > >> them contain the name we want to reuse the same ID.
> > > > > > >
> > > > > > > I believe Pavel meant something as simple as
> > > > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > > > $ touch $YOUR_FILE
> > > > > > > $ stat -c %i $YOUR_FILE
> > > >
> > > > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > > > So, this would use filesystem as a directory for inode->name mappings.
> > > > One rough edge for me is that the consumer would still need to parse
> > > > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > > > just dumping the content for the user. Would it be acceptable if we
> > > > require the ID provided by prctl() to always be a valid inode and
> > > > show_map_vma() would do the inode-to-filename conversion when
> > > > generating maps/smaps files? I know that inode->dentry is not
> > > > one-to-one mapping but we can simply output the first dentry name.
> > > > WDYT?
> > >
> > > No. You do not want to dictate any particular way of the mapping. The
> > > above is just one way to do that without developing any actual mapping
> > > yourself. You just use a filesystem for that. Kernel doesn't and
> > > shouldn't understand the meaning of those numbers. It has no business in
> > > that.
> > >
> > > In a way this would be pushing policy into the kernel.
> >
> > I can see your point. Any other ideas on how to prevent tools from
> > doing this id-to-name conversion themselves?
>
> I really fail to understand why you really want to prevent them from that.
> Really, the whole thing is just a cookie that kernel maintains for memory
> mappings so that two parties can understand what the meaning of that
> mapping is from a higher level. They both have to agree on the naming
> but the kernel shouldn't dictate any specific convention because the
> kernel _doesn't_ _care_. These things are not really anything actionable
> for the kernel. It is just a metadata.

The desire is for one of these two parties to be a human who can get
the data and use it as is without additional conversions.
/proc/$pid/maps could report FD numbers instead of pathnames, which
could be converted to pathnames in userspace. However we do not do
that because pathnames are more convenient for humans to identify a
specific resource. Same logic applies here IMHO.

>
> --
> Michal Hocko
> SUSE Labs
Kees Cook Oct. 7, 2021, 6:12 p.m. UTC | #36
On Thu, Oct 07, 2021 at 10:50:24AM -0700, Suren Baghdasaryan wrote:
> On Thu, Oct 7, 2021 at 10:31 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 07-10-21 09:58:02, Suren Baghdasaryan wrote:
> > > On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > > > > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > > > >
> > > > > > Hi!
> > > > > >
> > > > > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > > > > >> representing IDs, each containing the string name of the associated
> > > > > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > > > > >> would have to scan that directory, check all files and see if any of
> > > > > > > >> them contain the name we want to reuse the same ID.
> > > > > > > >
> > > > > > > > I believe Pavel meant something as simple as
> > > > > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > > > > $ touch $YOUR_FILE
> > > > > > > > $ stat -c %i $YOUR_FILE
> > > > >
> > > > > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > > > > So, this would use filesystem as a directory for inode->name mappings.
> > > > > One rough edge for me is that the consumer would still need to parse
> > > > > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > > > > just dumping the content for the user. Would it be acceptable if we
> > > > > require the ID provided by prctl() to always be a valid inode and
> > > > > show_map_vma() would do the inode-to-filename conversion when
> > > > > generating maps/smaps files? I know that inode->dentry is not
> > > > > one-to-one mapping but we can simply output the first dentry name.
> > > > > WDYT?
> > > >
> > > > No. You do not want to dictate any particular way of the mapping. The
> > > > above is just one way to do that without developing any actual mapping
> > > > yourself. You just use a filesystem for that. Kernel doesn't and
> > > > shouldn't understand the meaning of those numbers. It has no business in
> > > > that.
> > > >
> > > > In a way this would be pushing policy into the kernel.
> > >
> > > I can see your point. Any other ideas on how to prevent tools from
> > > doing this id-to-name conversion themselves?
> >
> > I really fail to understand why you really want to prevent them from that.
> > Really, the whole thing is just a cookie that kernel maintains for memory
> > mappings so that two parties can understand what the meaning of that
> > mapping is from a higher level. They both have to agree on the naming
> > but the kernel shouldn't dictate any specific convention because the
> > kernel _doesn't_ _care_. These things are not really anything actionable
> > for the kernel. It is just a metadata.
> 
> The desire is for one of these two parties to be a human who can get
> the data and use it as is without additional conversions.
> /proc/$pid/maps could report FD numbers instead of pathnames, which
> could be converted to pathnames in userspace. However we do not do
> that because pathnames are more convenient for humans to identify a
> specific resource. Same logic applies here IMHO.

Yes, please. It really seems like the folks that are interested in this
feature want strings. (I certainly do.) For those not interested in the
feature, it sounds like a CONFIG to keep it away would be sufficient.
Can we just move forward with that?
Suren Baghdasaryan Oct. 7, 2021, 6:50 p.m. UTC | #37
On Thu, Oct 7, 2021 at 11:13 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Oct 07, 2021 at 10:50:24AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 10:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 07-10-21 09:58:02, Suren Baghdasaryan wrote:
> > > > On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > > > > > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > > > > > >> representing IDs, each containing the string name of the associated
> > > > > > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > > > > > >> would have to scan that directory, check all files and see if any of
> > > > > > > > >> them contain the name we want to reuse the same ID.
> > > > > > > > >
> > > > > > > > > I believe Pavel meant something as simple as
> > > > > > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > > > > > $ touch $YOUR_FILE
> > > > > > > > > $ stat -c %i $YOUR_FILE
> > > > > >
> > > > > > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > > > > > So, this would use filesystem as a directory for inode->name mappings.
> > > > > > One rough edge for me is that the consumer would still need to parse
> > > > > > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > > > > > just dumping the content for the user. Would it be acceptable if we
> > > > > > require the ID provided by prctl() to always be a valid inode and
> > > > > > show_map_vma() would do the inode-to-filename conversion when
> > > > > > generating maps/smaps files? I know that inode->dentry is not
> > > > > > one-to-one mapping but we can simply output the first dentry name.
> > > > > > WDYT?
> > > > >
> > > > > No. You do not want to dictate any particular way of the mapping. The
> > > > > above is just one way to do that without developing any actual mapping
> > > > > yourself. You just use a filesystem for that. Kernel doesn't and
> > > > > shouldn't understand the meaning of those numbers. It has no business in
> > > > > that.
> > > > >
> > > > > In a way this would be pushing policy into the kernel.
> > > >
> > > > I can see your point. Any other ideas on how to prevent tools from
> > > > doing this id-to-name conversion themselves?
> > >
> > > I really fail to understand why you really want to prevent them from that.
> > > Really, the whole thing is just a cookie that kernel maintains for memory
> > > mappings so that two parties can understand what the meaning of that
> > > mapping is from a higher level. They both have to agree on the naming
> > > but the kernel shouldn't dictate any specific convention because the
> > > kernel _doesn't_ _care_. These things are not really anything actionable
> > > for the kernel. It is just a metadata.
> >
> > The desire is for one of these two parties to be a human who can get
> > the data and use it as is without additional conversions.
> > /proc/$pid/maps could report FD numbers instead of pathnames, which
> > could be converted to pathnames in userspace. However we do not do
> > that because pathnames are more convenient for humans to identify a
> > specific resource. Same logic applies here IMHO.
>
> Yes, please. It really seems like the folks that are interested in this
> feature want strings. (I certainly do.) For those not interested in the
> feature, it sounds like a CONFIG to keep it away would be sufficient.
> Can we just move forward with that?

Would love to if others are ok with this.

>
> --
> Kees Cook
John Hubbard Oct. 7, 2021, 7:02 p.m. UTC | #38
On 10/7/21 11:50, Suren Baghdasaryan wrote:
...
>>>>>>>>>> I believe Pavel meant something as simple as
>>>>>>>>>> $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
>>>>>>>>>> $ touch $YOUR_FILE
>>>>>>>>>> $ stat -c %i $YOUR_FILE
>>>>>>>
>>>>>>> Ah, ok, now I understand the proposal. Thanks for the clarification!
>>>>>>> So, this would use filesystem as a directory for inode->name mappings.
>>>>>>> One rough edge for me is that the consumer would still need to parse
>>>>>>> /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
>>>>>>> just dumping the content for the user. Would it be acceptable if we
>>>>>>> require the ID provided by prctl() to always be a valid inode and
>>>>>>> show_map_vma() would do the inode-to-filename conversion when
>>>>>>> generating maps/smaps files? I know that inode->dentry is not
>>>>>>> one-to-one mapping but we can simply output the first dentry name.
>>>>>>> WDYT?
>>>>>>
>>>>>> No. You do not want to dictate any particular way of the mapping. The
>>>>>> above is just one way to do that without developing any actual mapping
>>>>>> yourself. You just use a filesystem for that. Kernel doesn't and
>>>>>> shouldn't understand the meaning of those numbers. It has no business in
>>>>>> that.
>>>>>>
>>>>>> In a way this would be pushing policy into the kernel.
>>>>>
>>>>> I can see your point. Any other ideas on how to prevent tools from
>>>>> doing this id-to-name conversion themselves?
>>>>
>>>> I really fail to understand why you really want to prevent them from that.
>>>> Really, the whole thing is just a cookie that kernel maintains for memory
>>>> mappings so that two parties can understand what the meaning of that
>>>> mapping is from a higher level. They both have to agree on the naming
>>>> but the kernel shouldn't dictate any specific convention because the
>>>> kernel _doesn't_ _care_. These things are not really anything actionable
>>>> for the kernel. It is just a metadata.
>>>
>>> The desire is for one of these two parties to be a human who can get
>>> the data and use it as is without additional conversions.
>>> /proc/$pid/maps could report FD numbers instead of pathnames, which
>>> could be converted to pathnames in userspace. However we do not do
>>> that because pathnames are more convenient for humans to identify a
>>> specific resource. Same logic applies here IMHO.
>>
>> Yes, please. It really seems like the folks that are interested in this
>> feature want strings. (I certainly do.) For those not interested in the
>> feature, it sounds like a CONFIG to keep it away would be sufficient.
>> Can we just move forward with that?
> 
> Would love to if others are ok with this.
> 

If this doesn't get accepted, then another way forward would to continue
the ideas above to their logical conclusion, and create a new file system:
vma-fs.  Like debug-fs and other special file systems, similar policy and
motivation. Also protected by a CONFIG option.

Actually this seems at least as natural as the procfs approach, especially
given the nature of these strings, which feel more like dir+file names, than
simple strings.

thanks,
Suren Baghdasaryan Oct. 7, 2021, 9:32 p.m. UTC | #39
On Thu, Oct 7, 2021 at 12:03 PM 'John Hubbard' via kernel-team
<kernel-team@android.com> wrote:
>
> On 10/7/21 11:50, Suren Baghdasaryan wrote:
> ...
> >>>>>>>>>> I believe Pavel meant something as simple as
> >>>>>>>>>> $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> >>>>>>>>>> $ touch $YOUR_FILE
> >>>>>>>>>> $ stat -c %i $YOUR_FILE
> >>>>>>>
> >>>>>>> Ah, ok, now I understand the proposal. Thanks for the clarification!
> >>>>>>> So, this would use filesystem as a directory for inode->name mappings.
> >>>>>>> One rough edge for me is that the consumer would still need to parse
> >>>>>>> /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> >>>>>>> just dumping the content for the user. Would it be acceptable if we
> >>>>>>> require the ID provided by prctl() to always be a valid inode and
> >>>>>>> show_map_vma() would do the inode-to-filename conversion when
> >>>>>>> generating maps/smaps files? I know that inode->dentry is not
> >>>>>>> one-to-one mapping but we can simply output the first dentry name.
> >>>>>>> WDYT?
> >>>>>>
> >>>>>> No. You do not want to dictate any particular way of the mapping. The
> >>>>>> above is just one way to do that without developing any actual mapping
> >>>>>> yourself. You just use a filesystem for that. Kernel doesn't and
> >>>>>> shouldn't understand the meaning of those numbers. It has no business in
> >>>>>> that.
> >>>>>>
> >>>>>> In a way this would be pushing policy into the kernel.
> >>>>>
> >>>>> I can see your point. Any other ideas on how to prevent tools from
> >>>>> doing this id-to-name conversion themselves?
> >>>>
> >>>> I really fail to understand why you really want to prevent them from that.
> >>>> Really, the whole thing is just a cookie that kernel maintains for memory
> >>>> mappings so that two parties can understand what the meaning of that
> >>>> mapping is from a higher level. They both have to agree on the naming
> >>>> but the kernel shouldn't dictate any specific convention because the
> >>>> kernel _doesn't_ _care_. These things are not really anything actionable
> >>>> for the kernel. It is just a metadata.
> >>>
> >>> The desire is for one of these two parties to be a human who can get
> >>> the data and use it as is without additional conversions.
> >>> /proc/$pid/maps could report FD numbers instead of pathnames, which
> >>> could be converted to pathnames in userspace. However we do not do
> >>> that because pathnames are more convenient for humans to identify a
> >>> specific resource. Same logic applies here IMHO.
> >>
> >> Yes, please. It really seems like the folks that are interested in this
> >> feature want strings. (I certainly do.) For those not interested in the
> >> feature, it sounds like a CONFIG to keep it away would be sufficient.
> >> Can we just move forward with that?
> >
> > Would love to if others are ok with this.
> >
>
> If this doesn't get accepted, then another way forward would to continue
> the ideas above to their logical conclusion, and create a new file system:
> vma-fs.  Like debug-fs and other special file systems, similar policy and
> motivation. Also protected by a CONFIG option.

TBH, I would prefer to have the current simple solution protected with
a CONFIG option.

>
> Actually this seems at least as natural as the procfs approach, especially
> given the nature of these strings, which feel more like dir+file names, than
> simple strings.
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Liam R. Howlett Oct. 8, 2021, 1:04 a.m. UTC | #40
* Suren Baghdasaryan <surenb@google.com> [211007 17:32]:
> On Thu, Oct 7, 2021 at 12:03 PM 'John Hubbard' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On 10/7/21 11:50, Suren Baghdasaryan wrote:
> > ...
> > >>>>>>>>>> I believe Pavel meant something as simple as
> > >>>>>>>>>> $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > >>>>>>>>>> $ touch $YOUR_FILE
> > >>>>>>>>>> $ stat -c %i $YOUR_FILE
> > >>>>>>>
> > >>>>>>> Ah, ok, now I understand the proposal. Thanks for the clarification!
> > >>>>>>> So, this would use filesystem as a directory for inode->name mappings.
> > >>>>>>> One rough edge for me is that the consumer would still need to parse
> > >>>>>>> /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > >>>>>>> just dumping the content for the user. Would it be acceptable if we
> > >>>>>>> require the ID provided by prctl() to always be a valid inode and
> > >>>>>>> show_map_vma() would do the inode-to-filename conversion when
> > >>>>>>> generating maps/smaps files? I know that inode->dentry is not
> > >>>>>>> one-to-one mapping but we can simply output the first dentry name.
> > >>>>>>> WDYT?
> > >>>>>>
> > >>>>>> No. You do not want to dictate any particular way of the mapping. The
> > >>>>>> above is just one way to do that without developing any actual mapping
> > >>>>>> yourself. You just use a filesystem for that. Kernel doesn't and
> > >>>>>> shouldn't understand the meaning of those numbers. It has no business in
> > >>>>>> that.
> > >>>>>>
> > >>>>>> In a way this would be pushing policy into the kernel.
> > >>>>>
> > >>>>> I can see your point. Any other ideas on how to prevent tools from
> > >>>>> doing this id-to-name conversion themselves?
> > >>>>
> > >>>> I really fail to understand why you really want to prevent them from that.
> > >>>> Really, the whole thing is just a cookie that kernel maintains for memory
> > >>>> mappings so that two parties can understand what the meaning of that
> > >>>> mapping is from a higher level. They both have to agree on the naming
> > >>>> but the kernel shouldn't dictate any specific convention because the
> > >>>> kernel _doesn't_ _care_. These things are not really anything actionable
> > >>>> for the kernel. It is just a metadata.
> > >>>
> > >>> The desire is for one of these two parties to be a human who can get
> > >>> the data and use it as is without additional conversions.
> > >>> /proc/$pid/maps could report FD numbers instead of pathnames, which
> > >>> could be converted to pathnames in userspace. However we do not do
> > >>> that because pathnames are more convenient for humans to identify a
> > >>> specific resource. Same logic applies here IMHO.
> > >>
> > >> Yes, please. It really seems like the folks that are interested in this
> > >> feature want strings. (I certainly do.) For those not interested in the
> > >> feature, it sounds like a CONFIG to keep it away would be sufficient.
> > >> Can we just move forward with that?
> > >
> > > Would love to if others are ok with this.
> > >
> >
> > If this doesn't get accepted, then another way forward would to continue
> > the ideas above to their logical conclusion, and create a new file system:
> > vma-fs.  Like debug-fs and other special file systems, similar policy and
> > motivation. Also protected by a CONFIG option.
> 
> TBH, I would prefer to have the current simple solution protected with
> a CONFIG option.
> 
> >
> > Actually this seems at least as natural as the procfs approach, especially
> > given the nature of these strings, which feel more like dir+file names, than
> > simple strings.

I think the current locking around VMAs makes this a very tricky, if not
impossible, path.  Watching a proc file which takes the mmap_lock() is
painful enough.  Considering how hard it has been to have this feature
added, I cannot see locking changes being accepted as a more feasible
approach nor can I see increased mmap_lock() contention from any feature
being desired.

I like the CONFIG option.  The patches are in good shape and have a
clever way around the (unlikely?) scalability issue that existed.

Thanks,
Liam
Michal Hocko Oct. 8, 2021, 6:34 a.m. UTC | #41
On Thu 07-10-21 11:12:58, Kees Cook wrote:
> On Thu, Oct 07, 2021 at 10:50:24AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Oct 7, 2021 at 10:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 07-10-21 09:58:02, Suren Baghdasaryan wrote:
> > > > On Thu, Oct 7, 2021 at 9:40 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 07-10-21 09:04:09, Suren Baghdasaryan wrote:
> > > > > > On Thu, Oct 7, 2021 at 3:15 AM Pavel Machek <pavel@ucw.cz> wrote:
> > > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > > >> Hmm, so the suggestion is to have some directory which contains files
> > > > > > > > >> representing IDs, each containing the string name of the associated
> > > > > > > > >> vma? Then let's say we are creating a new VMA and want to name it. We
> > > > > > > > >> would have to scan that directory, check all files and see if any of
> > > > > > > > >> them contain the name we want to reuse the same ID.
> > > > > > > > >
> > > > > > > > > I believe Pavel meant something as simple as
> > > > > > > > > $ YOUR_FILE=$YOUR_IDS_DIR/my_string_name
> > > > > > > > > $ touch $YOUR_FILE
> > > > > > > > > $ stat -c %i $YOUR_FILE
> > > > > >
> > > > > > Ah, ok, now I understand the proposal. Thanks for the clarification!
> > > > > > So, this would use filesystem as a directory for inode->name mappings.
> > > > > > One rough edge for me is that the consumer would still need to parse
> > > > > > /proc/$pid/maps and convert [anon:inode] into [anon:name] instead of
> > > > > > just dumping the content for the user. Would it be acceptable if we
> > > > > > require the ID provided by prctl() to always be a valid inode and
> > > > > > show_map_vma() would do the inode-to-filename conversion when
> > > > > > generating maps/smaps files? I know that inode->dentry is not
> > > > > > one-to-one mapping but we can simply output the first dentry name.
> > > > > > WDYT?
> > > > >
> > > > > No. You do not want to dictate any particular way of the mapping. The
> > > > > above is just one way to do that without developing any actual mapping
> > > > > yourself. You just use a filesystem for that. Kernel doesn't and
> > > > > shouldn't understand the meaning of those numbers. It has no business in
> > > > > that.
> > > > >
> > > > > In a way this would be pushing policy into the kernel.
> > > >
> > > > I can see your point. Any other ideas on how to prevent tools from
> > > > doing this id-to-name conversion themselves?
> > >
> > > I really fail to understand why you really want to prevent them from that.
> > > Really, the whole thing is just a cookie that kernel maintains for memory
> > > mappings so that two parties can understand what the meaning of that
> > > mapping is from a higher level. They both have to agree on the naming
> > > but the kernel shouldn't dictate any specific convention because the
> > > kernel _doesn't_ _care_. These things are not really anything actionable
> > > for the kernel. It is just a metadata.
> > 
> > The desire is for one of these two parties to be a human who can get
> > the data and use it as is without additional conversions.
> > /proc/$pid/maps could report FD numbers instead of pathnames, which
> > could be converted to pathnames in userspace. However we do not do
> > that because pathnames are more convenient for humans to identify a
> > specific resource. Same logic applies here IMHO.
> 
> Yes, please. It really seems like the folks that are interested in this
> feature want strings. (I certainly do.)

I am sorry but there were no strong arguments mentioned for strings so
far. Effectively string require a more complex and more resource hungry
solution. The only advantage is that strings are nicer to read for
humans.

There hasn't been any plan presented for actual naming convention or how
those names would be used in practice. Except for a more advanced
resource management and that sounds like something that can work with
ids just fine.

> For those not interested in the
> feature, it sounds like a CONFIG to keep it away would be sufficient.

CONFIG is not an answer here as already pointed out. Distro kernels will
be forced to enable this because there might be somebody to use this
feature.

Initially I was not really feeling strongly one way or other but more we
are discussing the topic the more I see that strings have a very weak
justification behind.
Rasmus Villemoes Oct. 8, 2021, 7:25 a.m. UTC | #42
On 07/10/2021 21.02, John Hubbard wrote:
> On 10/7/21 11:50, Suren Baghdasaryan wrote:
> ...

>>>> The desire is for one of these two parties to be a human who can get
>>>> the data and use it as is without additional conversions.
>>>> /proc/$pid/maps could report FD numbers instead of pathnames, which
>>>> could be converted to pathnames in userspace. However we do not do
>>>> that because pathnames are more convenient for humans to identify a
>>>> specific resource. Same logic applies here IMHO.
>>>
>>> Yes, please. It really seems like the folks that are interested in this
>>> feature want strings. (I certainly do.) For those not interested in the
>>> feature, it sounds like a CONFIG to keep it away would be sufficient.
>>> Can we just move forward with that?
>>
>> Would love to if others are ok with this.
>>
> 
> If this doesn't get accepted, then another way forward would to continue
> the ideas above to their logical conclusion, and create a new file system:
> vma-fs. 

Or: Why can't the library/application that wants a VMA backed by memory
to have some associated name not just

  fd = open("/run/named-vmas/foobar#24", O_CLOEXEC|O_RDWR|O_EXCL|O_CREAT);
  unlink("/run/named-vmas/foobar#24");
  ftruncate(fd, ...);
  mmap(fd);

where /run/named-vmas is a tmpfs (probably with some per-user/per-app
subdirs). That requires no changes in the kernel at all. Yes, it lacks
the automatic cleanup of using real anon mmap in case there's a crash
between open and unlink, but in an environment like Android that should
be solvable.

Rasmus
David Hildenbrand Oct. 8, 2021, 7:43 a.m. UTC | #43
On 08.10.21 09:25, Rasmus Villemoes wrote:
> On 07/10/2021 21.02, John Hubbard wrote:
>> On 10/7/21 11:50, Suren Baghdasaryan wrote:
>> ...
> 
>>>>> The desire is for one of these two parties to be a human who can get
>>>>> the data and use it as is without additional conversions.
>>>>> /proc/$pid/maps could report FD numbers instead of pathnames, which
>>>>> could be converted to pathnames in userspace. However we do not do
>>>>> that because pathnames are more convenient for humans to identify a
>>>>> specific resource. Same logic applies here IMHO.
>>>>
>>>> Yes, please. It really seems like the folks that are interested in this
>>>> feature want strings. (I certainly do.) For those not interested in the
>>>> feature, it sounds like a CONFIG to keep it away would be sufficient.
>>>> Can we just move forward with that?
>>>
>>> Would love to if others are ok with this.
>>>
>>
>> If this doesn't get accepted, then another way forward would to continue
>> the ideas above to their logical conclusion, and create a new file system:
>> vma-fs.
> 
> Or: Why can't the library/application that wants a VMA backed by memory
> to have some associated name not just
> 
>    fd = open("/run/named-vmas/foobar#24", O_CLOEXEC|O_RDWR|O_EXCL|O_CREAT);
>    unlink("/run/named-vmas/foobar#24");
>    ftruncate(fd, ...);
>    mmap(fd);
> 
> where /run/named-vmas is a tmpfs (probably with some per-user/per-app
> subdirs). That requires no changes in the kernel at all. Yes, it lacks
> the automatic cleanup of using real anon mmap in case there's a crash
> between open and unlink, but in an environment like Android that should
> be solvable.

I'm going to point out that we already do have names for memfds.

"The name supplied in name is used as a filename and will be displayed 
as  the  target  of  the corresponding  symbolic  link  in  the 
directory /proc/self/fd/." It's also displayed in /proc/self/maps.

So theoretically, without any kernel changes one might be able to 
achieve something as proposed in this patch via memfd_create().

There is one issue to be fixed:

MAP_PRIVATE on memfd results in a double memory consumption on any 
access via the mapping last time I checked (one page for pagecache, one 
page for private mapping).
Dave Hansen Oct. 8, 2021, 2:14 p.m. UTC | #44
On 10/7/21 11:34 PM, Michal Hocko wrote:
>> Yes, please. It really seems like the folks that are interested in this
>> feature want strings. (I certainly do.)
> I am sorry but there were no strong arguments mentioned for strings so
> far.

The folks who want this have maintained an out-of-tree patch using
strings.  They've maintained it for the better part of a decade.  I
don't know how widely this shipped in the Android ecosystem, but I
suspect we're talking about billions of devices.  Right?

This is a feature that, if accepted into mainline, will get enabled and
used on billions of devices.  If we dumb this down to integers, it's not
100% clear that it _will_ get used.

That's a pretty strong argument in my book, even if the contributors
have difficulty articulating exactly why they want strings.
Michal Hocko Oct. 8, 2021, 2:57 p.m. UTC | #45
On Fri 08-10-21 07:14:58, Dave Hansen wrote:
> On 10/7/21 11:34 PM, Michal Hocko wrote:
> >> Yes, please. It really seems like the folks that are interested in this
> >> feature want strings. (I certainly do.)
> > I am sorry but there were no strong arguments mentioned for strings so
> > far.
> 
> The folks who want this have maintained an out-of-tree patch using
> strings.  They've maintained it for the better part of a decade.  I
> don't know how widely this shipped in the Android ecosystem, but I
> suspect we're talking about billions of devices.  Right?
> 
> This is a feature that, if accepted into mainline, will get enabled and
> used on billions of devices.  If we dumb this down to integers, it's not
> 100% clear that it _will_ get used.
> 
> That's a pretty strong argument in my book, even if the contributors
> have difficulty articulating exactly why they want strings.

I would agree that if integers would make this unusable then this would
be a strong argument. But I haven't really heard any arguments like that
so far. I have heard about IPC overhead and other speculations that do
not seem really convincing. We shouldn't hand wave concerns regarding
the implementation complexity and resource handling just by "somebody
has been using this for decates", right?

Do not get me wrong. This is going to become a user interface and we
will have to maintain it for ever. As such an extra scrutiny has to be
applied.
Suren Baghdasaryan Oct. 8, 2021, 4:10 p.m. UTC | #46
On Fri, Oct 8, 2021 at 7:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 08-10-21 07:14:58, Dave Hansen wrote:
> > On 10/7/21 11:34 PM, Michal Hocko wrote:
> > >> Yes, please. It really seems like the folks that are interested in this
> > >> feature want strings. (I certainly do.)
> > > I am sorry but there were no strong arguments mentioned for strings so
> > > far.
> >
> > The folks who want this have maintained an out-of-tree patch using
> > strings.  They've maintained it for the better part of a decade.  I
> > don't know how widely this shipped in the Android ecosystem, but I
> > suspect we're talking about billions of devices.  Right?

Correct.

> >
> > This is a feature that, if accepted into mainline, will get enabled and
> > used on billions of devices.  If we dumb this down to integers, it's not
> > 100% clear that it _will_ get used.

Not as is and not with some major changes in the userspace, which
relied on a simple interface: set a name to a vma, observe that name
in the /proc/$pid/maps.

> >
> > That's a pretty strong argument in my book, even if the contributors
> > have difficulty articulating exactly why they want strings.
>
> I would agree that if integers would make this unusable then this would
> be a strong argument. But I haven't really heard any arguments like that
> so far. I have heard about IPC overhead and other speculations that do
> not seem really convincing. We shouldn't hand wave concerns regarding
> the implementation complexity and resource handling just by "somebody
> has been using this for decates", right?
>
> Do not get me wrong. This is going to become a user interface and we
> will have to maintain it for ever. As such an extra scrutiny has to be
> applied.

I don't know how to better articulate this. IPC transactions on
Android cannot be scheduled efficiently. We're going to have to stall
after mmap, make binder transaction, schedule a new process, get the
ID, make binder reply, schedule back to the original thread, resume.
Doing this potentially for every mmap is a non-starter. Deferring this
job is possible but we still have to do all this work, so it still
requires cpu cycles and power, not mentioning the additional
complexity in the userspace. I'm adding a rep from the performance
team, maybe Tim can explain this better.

There were a couple suggestions on using filesystem/memfd for naming
purposes which I need to explore but if that works the approach will
likely not involve any IDs. We want human-readable names in the maps
file, not a number.

Thanks for all the feedback and ideas!

> --
> Michal Hocko
> SUSE Labs
Kees Cook Oct. 8, 2021, 8:58 p.m. UTC | #47
On Fri, Oct 08, 2021 at 08:34:47AM +0200, Michal Hocko wrote:
> I am sorry but there were no strong arguments mentioned for strings so
> far. Effectively string require a more complex and more resource hungry
> solution. The only advantage is that strings are nicer to read for
> humans.

How I see it:

- Strings are already present in the "maps" output, so this doesn't
  create a burden on userspace to grow new parsers.

- Strings for "anon" specifically have no required format (this is good)
  it's informational like the task_struct::comm and can (roughly)
  anything. There's no naming convention for memfds, AF_UNIX, etc. Why
  is one needed here? That seems like a completely unreasonable
  requirement.

- Strings need to be in kernel space because cross-process GUP has been a
  constant source of security flaws.

> There hasn't been any plan presented for actual naming convention or how
> those names would be used in practice. Except for a more advanced
> resource management and that sounds like something that can work with
> ids just fine.

There doesn't need to be and there shouldn't be. Why aren't memfds names
an id? Because, to quote the man-page, "... serves only for debugging
purposes. Names do not affect the behavior of the file descriptor, and
as such multiple files can have the same name without any side effects."

And they aren't filtered _at all_ either. I think the anonymous vma name
series has gone out of its way to be safe and sane while still providing
the ease-of-use that it was designed to provide.

> Initially I was not really feeling strongly one way or other but more we
> are discussing the topic the more I see that strings have a very weak
> justification behind.

I just don't see any _down_ side to gaining this. There's only resource
utilization when it's used, and the complexity is minimal.
Kees Cook Oct. 8, 2021, 9:13 p.m. UTC | #48
On Fri, Oct 08, 2021 at 09:43:59AM +0200, David Hildenbrand wrote:
> I'm going to point out that we already do have names for memfds.

I just did the same here[1]. :P

> [...] It's also displayed in /proc/self/maps.

I missed that part! /me screams forever

We really need to filter this at creation time. :( At least
seq_file_path() escapes "\n" for it, but not "\r", so humans on a
terminal could get very confused...

$ ./memfd '^M0000000000000000-ffffffffffffffff rwxp 00000000 00:00 0 [stack]' &
[1] 2953833
$ cat /proc/2953833/maps
...
0000000000000000-ffffffffffffffff rwxp 00000000 00:00 0     [stack] (deleted)
...


-Kees

[1] https://lore.kernel.org/lkml/202110081344.FE6A7A82@keescook
Michal Hocko Oct. 11, 2021, 8:36 a.m. UTC | #49
On Fri 08-10-21 13:58:01, Kees Cook wrote:
> - Strings for "anon" specifically have no required format (this is good)
>   it's informational like the task_struct::comm and can (roughly)
>   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
>   is one needed here? That seems like a completely unreasonable
>   requirement.

I might be misreading the justification for the feature. Patch 2 is
talking about tools that need to understand memeory usage to make
further actions. Also Suren was suggesting "numbering convetion" as an
argument against. 

So can we get a clear example how is this being used actually? If this
is just to be used to debug by humans than I can see an argument for
human readable form. If this is, however, meant to be used by tools to
make some actions then the argument for strings is much weaker.
Suren Baghdasaryan Oct. 12, 2021, 1:18 a.m. UTC | #50
On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > - Strings for "anon" specifically have no required format (this is good)
> >   it's informational like the task_struct::comm and can (roughly)
> >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> >   is one needed here? That seems like a completely unreasonable
> >   requirement.
>
> I might be misreading the justification for the feature. Patch 2 is
> talking about tools that need to understand memeory usage to make
> further actions. Also Suren was suggesting "numbering convetion" as an
> argument against.
>
> So can we get a clear example how is this being used actually? If this
> is just to be used to debug by humans than I can see an argument for
> human readable form. If this is, however, meant to be used by tools to
> make some actions then the argument for strings is much weaker.

The simplest usecase is when we notice that a process consumes more
memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
check which area is contributing to this growth. The names we assign
to anonymous areas are descriptive enough for a developer to get an
idea where the increased consumption is coming from and how to proceed
with their investigation.
There are of course cases when tools are involved, but the end-user is
always a human and the final report should contain easily
understandable data.

IIUC, the main argument here is whether the userspace can provide
tools to perform the translations between ids and names, with the
kernel accepting and reporting ids instead of strings. Technically
it's possible, but to be practical that conversion should be fast
because we will need to make name->id conversion potentially for each
mmap. On the consumer side the performance is not as critical, but the
fact that instead of dumping /proc/$pid/maps we will have to parse the
file, do id->name conversion and replace all [anon:id] with
[anon:name] would be an issue when we do that in bulk, for example
when collecting system-wide data for a bugreport.

I went ahead and implemented the proposed userspace solution involving
tmpfs as a repository for name->id mapping (more precisely
filename->inode mapping). Profiling shows that open()+fstat()+close()
takes:
- roughly 15 times longer than mmap() with 1000 unique names each
being reused 50 times.
- roughly 3 times longer than mmap() with 100 unique names each being
reused 500 times. This is due to lstat() optimization suggested by
Rasmus which avoids open() and close().
For comparison, proposed prctl() takes roughly the same amount of time
as mmap() and does not depend on the number of unique names.

I'm still evaluating the proposal to use memfds but I'm not sure if
the issue that David Hildenbrand mentioned about additional memory
consumed in pagecache (which has to be addressed) is the only one we
will encounter with this approach. If anyone knows of any potential
issues with using memfds as named anonymous memory, I would really
appreciate your feedback before I go too far in that direction.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 12, 2021, 1:20 a.m. UTC | #51
On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > - Strings for "anon" specifically have no required format (this is good)
> > >   it's informational like the task_struct::comm and can (roughly)
> > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > >   is one needed here? That seems like a completely unreasonable
> > >   requirement.
> >
> > I might be misreading the justification for the feature. Patch 2 is
> > talking about tools that need to understand memeory usage to make
> > further actions. Also Suren was suggesting "numbering convetion" as an
> > argument against.
> >
> > So can we get a clear example how is this being used actually? If this
> > is just to be used to debug by humans than I can see an argument for
> > human readable form. If this is, however, meant to be used by tools to
> > make some actions then the argument for strings is much weaker.
>
> The simplest usecase is when we notice that a process consumes more
> memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> check which area is contributing to this growth. The names we assign
> to anonymous areas are descriptive enough for a developer to get an
> idea where the increased consumption is coming from and how to proceed
> with their investigation.
> There are of course cases when tools are involved, but the end-user is
> always a human and the final report should contain easily
> understandable data.
>
> IIUC, the main argument here is whether the userspace can provide
> tools to perform the translations between ids and names, with the
> kernel accepting and reporting ids instead of strings. Technically
> it's possible, but to be practical that conversion should be fast
> because we will need to make name->id conversion potentially for each
> mmap. On the consumer side the performance is not as critical, but the
> fact that instead of dumping /proc/$pid/maps we will have to parse the
> file, do id->name conversion and replace all [anon:id] with
> [anon:name] would be an issue when we do that in bulk, for example
> when collecting system-wide data for a bugreport.
>
> I went ahead and implemented the proposed userspace solution involving
> tmpfs as a repository for name->id mapping (more precisely
> filename->inode mapping). Profiling shows that open()+fstat()+close()
> takes:
> - roughly 15 times longer than mmap() with 1000 unique names each
> being reused 50 times.
> - roughly 3 times longer than mmap() with 100 unique names each being
> reused 500 times. This is due to lstat() optimization suggested by
> Rasmus which avoids open() and close().
> For comparison, proposed prctl() takes roughly the same amount of time
> as mmap() and does not depend on the number of unique names.
>
> I'm still evaluating the proposal to use memfds but I'm not sure if
> the issue that David Hildenbrand mentioned about additional memory
> consumed in pagecache (which has to be addressed) is the only one we
> will encounter with this approach. If anyone knows of any potential
> issues with using memfds as named anonymous memory, I would really
> appreciate your feedback before I go too far in that direction.
> Thanks,
> Suren.

Just noticed that timmurray@ was dropped from the last reply. Adding him back.

>
> > --
> > Michal Hocko
> > SUSE Labs
Johannes Weiner Oct. 12, 2021, 3 a.m. UTC | #52
On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > - Strings for "anon" specifically have no required format (this is good)
> > > >   it's informational like the task_struct::comm and can (roughly)
> > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > >   is one needed here? That seems like a completely unreasonable
> > > >   requirement.
> > >
> > > I might be misreading the justification for the feature. Patch 2 is
> > > talking about tools that need to understand memeory usage to make
> > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > argument against.
> > >
> > > So can we get a clear example how is this being used actually? If this
> > > is just to be used to debug by humans than I can see an argument for
> > > human readable form. If this is, however, meant to be used by tools to
> > > make some actions then the argument for strings is much weaker.
> >
> > The simplest usecase is when we notice that a process consumes more
> > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > check which area is contributing to this growth. The names we assign
> > to anonymous areas are descriptive enough for a developer to get an
> > idea where the increased consumption is coming from and how to proceed
> > with their investigation.
> > There are of course cases when tools are involved, but the end-user is
> > always a human and the final report should contain easily
> > understandable data.
> >
> > IIUC, the main argument here is whether the userspace can provide
> > tools to perform the translations between ids and names, with the
> > kernel accepting and reporting ids instead of strings. Technically
> > it's possible, but to be practical that conversion should be fast
> > because we will need to make name->id conversion potentially for each
> > mmap. On the consumer side the performance is not as critical, but the
> > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > file, do id->name conversion and replace all [anon:id] with
> > [anon:name] would be an issue when we do that in bulk, for example
> > when collecting system-wide data for a bugreport.

Is that something you need to do client-side? Or could the bug tool
upload the userspace-maintained name:ids database alongside the
/proc/pid/maps dump for external processing?
Suren Baghdasaryan Oct. 12, 2021, 5:36 a.m. UTC | #53
On Mon, Oct 11, 2021 at 8:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > >   is one needed here? That seems like a completely unreasonable
> > > > >   requirement.
> > > >
> > > > I might be misreading the justification for the feature. Patch 2 is
> > > > talking about tools that need to understand memeory usage to make
> > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > argument against.
> > > >
> > > > So can we get a clear example how is this being used actually? If this
> > > > is just to be used to debug by humans than I can see an argument for
> > > > human readable form. If this is, however, meant to be used by tools to
> > > > make some actions then the argument for strings is much weaker.
> > >
> > > The simplest usecase is when we notice that a process consumes more
> > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > check which area is contributing to this growth. The names we assign
> > > to anonymous areas are descriptive enough for a developer to get an
> > > idea where the increased consumption is coming from and how to proceed
> > > with their investigation.
> > > There are of course cases when tools are involved, but the end-user is
> > > always a human and the final report should contain easily
> > > understandable data.
> > >
> > > IIUC, the main argument here is whether the userspace can provide
> > > tools to perform the translations between ids and names, with the
> > > kernel accepting and reporting ids instead of strings. Technically
> > > it's possible, but to be practical that conversion should be fast
> > > because we will need to make name->id conversion potentially for each
> > > mmap. On the consumer side the performance is not as critical, but the
> > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > file, do id->name conversion and replace all [anon:id] with
> > > [anon:name] would be an issue when we do that in bulk, for example
> > > when collecting system-wide data for a bugreport.
>
> Is that something you need to do client-side? Or could the bug tool
> upload the userspace-maintained name:ids database alongside the
> /proc/pid/maps dump for external processing?

You can generate a bugreport and analyze it locally or submit it as an
attachment to a bug for further analyzes.
Sure, we can attach the id->name conversion table to the bugreport but
either way, some tool would have to post-process it to resolve the
ids. If we are not analyzing the results immediately then that step
can be postponed and I think that's what you mean? If so, then yes,
that is correct.
Michal Hocko Oct. 12, 2021, 7:36 a.m. UTC | #54
On Mon 11-10-21 18:20:25, Suren Baghdasaryan wrote:
> On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > - Strings for "anon" specifically have no required format (this is good)
> > > >   it's informational like the task_struct::comm and can (roughly)
> > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > >   is one needed here? That seems like a completely unreasonable
> > > >   requirement.
> > >
> > > I might be misreading the justification for the feature. Patch 2 is
> > > talking about tools that need to understand memeory usage to make
> > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > argument against.
> > >
> > > So can we get a clear example how is this being used actually? If this
> > > is just to be used to debug by humans than I can see an argument for
> > > human readable form. If this is, however, meant to be used by tools to
> > > make some actions then the argument for strings is much weaker.
> >
> > The simplest usecase is when we notice that a process consumes more
> > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > check which area is contributing to this growth. The names we assign
> > to anonymous areas are descriptive enough for a developer to get an
> > idea where the increased consumption is coming from and how to proceed
> > with their investigation.
> > There are of course cases when tools are involved, but the end-user is
> > always a human and the final report should contain easily
> > understandable data.

OK, it would have been much more preferable to be explicit about this
main use case from the very beginning. Just to make sure we are at the
same page. Is the primary usecase usage and bug reporting?

My initial understanding was that at userspace managed memory management
could make an educated guess about targeted reclaim (e.g. MADV_{FREE,COLD,PAGEOUT}
for cached data in memory like uncompressed images/data). Such a usecase
would clearly require a standardized id/naming convention to be
application neutral.

> > IIUC, the main argument here is whether the userspace can provide
> > tools to perform the translations between ids and names, with the
> > kernel accepting and reporting ids instead of strings. Technically
> > it's possible, but to be practical that conversion should be fast
> > because we will need to make name->id conversion potentially for each
> > mmap. On the consumer side the performance is not as critical, but the
> > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > file, do id->name conversion and replace all [anon:id] with
> > [anon:name] would be an issue when we do that in bulk, for example
> > when collecting system-wide data for a bugreport.

Whether you use ids or human readable strings you still have to
understand the underlying meaning to make any educated guess. Let me
give you an example. Say I have an application with a memory leak. Right
now I can only tell that it is anonymous memory growing but it is not
clear who uses that anonymous. You are adding a means to tell different
users appart. That is really helpful. Now I know this is an anon
user 1234 or MySuperAnonMemory. Neither of the will not tell me more
without a id/naming convention or reading the code. A convention can be
useful for the most common users (e.g. a specific allocator) but I am
rather dubious there are many more that would be _generally_ recognized
without some understanding of the said application.

Maybe the situation in Android is different because the runtime is more
coupled but is it reasonable to expect any common naming conventions for
general Linux platforms?

I am slightly worried that we have spent way too much time talking
specifics about id->name translation rather than the actual usability
of the token.
David Hildenbrand Oct. 12, 2021, 7:43 a.m. UTC | #55
> I'm still evaluating the proposal to use memfds but I'm not sure if
> the issue that David Hildenbrand mentioned about additional memory
> consumed in pagecache (which has to be addressed) is the only one we
> will encounter with this approach. If anyone knows of any potential
> issues with using memfds as named anonymous memory, I would really
> appreciate your feedback before I go too far in that direction.

[MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so 
I think it just has to be fixed. It doesn't make any sense to allocate a 
page for the pagecache ("populate the file") when accessing via a 
private mapping that's supposed to leave the file untouched]

My gut feeling is if you really need a string as identifier, then try 
going with memfds. Yes, we might hit some road blocks to be sorted out, 
but it just logically makes sense to me: Files have names. These names 
exist before mapping and after mapping. They "name" the content.

Maybe it's just me, but the whole interface, setting the name via a 
prctl after the mapping was already instantiated doesn't really spark 
joy at my end. That's not a strong pushback, but if we can avoid it 
using something that's already there, that would be very much preferred.
Suren Baghdasaryan Oct. 12, 2021, 4:50 p.m. UTC | #56
On Tue, Oct 12, 2021 at 12:37 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-10-21 18:20:25, Suren Baghdasaryan wrote:
> > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > >   is one needed here? That seems like a completely unreasonable
> > > > >   requirement.
> > > >
> > > > I might be misreading the justification for the feature. Patch 2 is
> > > > talking about tools that need to understand memeory usage to make
> > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > argument against.
> > > >
> > > > So can we get a clear example how is this being used actually? If this
> > > > is just to be used to debug by humans than I can see an argument for
> > > > human readable form. If this is, however, meant to be used by tools to
> > > > make some actions then the argument for strings is much weaker.
> > >
> > > The simplest usecase is when we notice that a process consumes more
> > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > check which area is contributing to this growth. The names we assign
> > > to anonymous areas are descriptive enough for a developer to get an
> > > idea where the increased consumption is coming from and how to proceed
> > > with their investigation.
> > > There are of course cases when tools are involved, but the end-user is
> > > always a human and the final report should contain easily
> > > understandable data.
>
> OK, it would have been much more preferable to be explicit about this
> main use case from the very beginning. Just to make sure we are at the
> same page. Is the primary usecase usage and bug reporting?

Sorry, I should have spent more time on patch #2 description. Yes,
debugging memory issues is the primary usecase. In fact that's the
only usecase in Android AFAIK.

>
> My initial understanding was that at userspace managed memory management
> could make an educated guess about targeted reclaim (e.g. MADV_{FREE,COLD,PAGEOUT}
> for cached data in memory like uncompressed images/data). Such a usecase
> would clearly require a standardized id/naming convention to be
> application neutral.

Ah, now I understand your angle. Our prior work on process_madvise()
probably helped in leading your thoughts in this direction :) Sorry
about the confusion.

>
> > > IIUC, the main argument here is whether the userspace can provide
> > > tools to perform the translations between ids and names, with the
> > > kernel accepting and reporting ids instead of strings. Technically
> > > it's possible, but to be practical that conversion should be fast
> > > because we will need to make name->id conversion potentially for each
> > > mmap. On the consumer side the performance is not as critical, but the
> > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > file, do id->name conversion and replace all [anon:id] with
> > > [anon:name] would be an issue when we do that in bulk, for example
> > > when collecting system-wide data for a bugreport.
>
> Whether you use ids or human readable strings you still have to
> understand the underlying meaning to make any educated guess. Let me
> give you an example. Say I have an application with a memory leak. Right
> now I can only tell that it is anonymous memory growing but it is not
> clear who uses that anonymous. You are adding a means to tell different
> users appart. That is really helpful. Now I know this is an anon
> user 1234 or MySuperAnonMemory. Neither of the will not tell me more
> without a id/naming convention or reading the code. A convention can be
> useful for the most common users (e.g. a specific allocator) but I am
> rather dubious there are many more that would be _generally_ recognized
> without some understanding of the said application.

I guess an example would be better to clarify this. Here are some vma
names from Google maps app:

[anon:dalvik-main space (region space)]
[anon:dalvik-/apex/com.android.art/javalib/boot.art]
[anon:dalvik-/apex/com.android.art/javalib/boot-apache-xml.art]
[anon:.bss]
[anon:dalvik-zygote space]
[anon:dalvik-non moving space]
[anon:dalvik-free list large object space]
[anon:dalvik-/product/app/Maps/oat/arm64/Maps.art]
[anon:stack_and_tls:20792]
[anon:stack_and_tls:20791]
[anon:dalvik-LinearAlloc]
[anon:dalvik-CompilerMetadata]
[anon:dalvik-indirect ref table]
[anon:dalvik-live stack]
[anon:dalvik-allocation stack]
[anon:dalvik-large object free list space allocation info map]
[anon:scudo:primary]
[anon:scudo:secondary]
[anon:bionic_alloc_small_objects]

Most of them have names standard for Android and can be recognized by
developers and even Android framework (example where "anon:dalvik-main
space" and other standard names are being parsed:
https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/jni/android_os_Debug.cpp;l=340).
Names like "anon:dalvik-/apex/com.android.art/javalib/boot.art" help
the developer to recognize the component responsible for the memory.
Names like "anon:stack_and_tls:20792" include the TID of the thread
which uses this memory. All this information can help in narrowing
down memory consumption investigation. Hopefully these examples
clarify the usage a bit better?

>
> Maybe the situation in Android is different because the runtime is more
> coupled but is it reasonable to expect any common naming conventions for
> general Linux platforms?

Well, to be useful the system would have to agree to *some* convention I guess.

>
> I am slightly worried that we have spent way too much time talking
> specifics about id->name translation rather than the actual usability
> of the token.

Agree. I'll try to avoid further confusions.
Thanks!

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 12, 2021, 5:01 p.m. UTC | #57
On Tue, Oct 12, 2021 at 12:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> > I'm still evaluating the proposal to use memfds but I'm not sure if
> > the issue that David Hildenbrand mentioned about additional memory
> > consumed in pagecache (which has to be addressed) is the only one we
> > will encounter with this approach. If anyone knows of any potential
> > issues with using memfds as named anonymous memory, I would really
> > appreciate your feedback before I go too far in that direction.
>
> [MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so
> I think it just has to be fixed. It doesn't make any sense to allocate a
> page for the pagecache ("populate the file") when accessing via a
> private mapping that's supposed to leave the file untouched]
>
> My gut feeling is if you really need a string as identifier, then try
> going with memfds. Yes, we might hit some road blocks to be sorted out,
> but it just logically makes sense to me: Files have names. These names
> exist before mapping and after mapping. They "name" the content.

I'm investigating this direction. I don't have much background with
memfds, so I'll need to digest the code first.

>
> Maybe it's just me, but the whole interface, setting the name via a
> prctl after the mapping was already instantiated doesn't really spark
> joy at my end. That's not a strong pushback, but if we can avoid it
> using something that's already there, that would be very much preferred.

Actually that's one of my worries about using memfds. There might be
cases when we need to name a vma after it was mapped. memfd_create()
would not allow us to do that AFAIKT. But I need to check all usages
to say if that's really an issue.
Thanks!

>
> --
> Thanks,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Johannes Weiner Oct. 12, 2021, 6:26 p.m. UTC | #58
On Mon, Oct 11, 2021 at 10:36:24PM -0700, Suren Baghdasaryan wrote:
> On Mon, Oct 11, 2021 at 8:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > > >   is one needed here? That seems like a completely unreasonable
> > > > > >   requirement.
> > > > >
> > > > > I might be misreading the justification for the feature. Patch 2 is
> > > > > talking about tools that need to understand memeory usage to make
> > > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > > argument against.
> > > > >
> > > > > So can we get a clear example how is this being used actually? If this
> > > > > is just to be used to debug by humans than I can see an argument for
> > > > > human readable form. If this is, however, meant to be used by tools to
> > > > > make some actions then the argument for strings is much weaker.
> > > >
> > > > The simplest usecase is when we notice that a process consumes more
> > > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > > check which area is contributing to this growth. The names we assign
> > > > to anonymous areas are descriptive enough for a developer to get an
> > > > idea where the increased consumption is coming from and how to proceed
> > > > with their investigation.
> > > > There are of course cases when tools are involved, but the end-user is
> > > > always a human and the final report should contain easily
> > > > understandable data.
> > > >
> > > > IIUC, the main argument here is whether the userspace can provide
> > > > tools to perform the translations between ids and names, with the
> > > > kernel accepting and reporting ids instead of strings. Technically
> > > > it's possible, but to be practical that conversion should be fast
> > > > because we will need to make name->id conversion potentially for each
> > > > mmap. On the consumer side the performance is not as critical, but the
> > > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > > file, do id->name conversion and replace all [anon:id] with
> > > > [anon:name] would be an issue when we do that in bulk, for example
> > > > when collecting system-wide data for a bugreport.
> >
> > Is that something you need to do client-side? Or could the bug tool
> > upload the userspace-maintained name:ids database alongside the
> > /proc/pid/maps dump for external processing?
> 
> You can generate a bugreport and analyze it locally or submit it as an
> attachment to a bug for further analyzes.
> Sure, we can attach the id->name conversion table to the bugreport but
> either way, some tool would have to post-process it to resolve the
> ids. If we are not analyzing the results immediately then that step
> can be postponed and I think that's what you mean? If so, then yes,
> that is correct.

Right, somebody needs to do it at some point, but I suppose it's less
of a problem if a developer machine does it than a mobile device.

One advantage of an ID over a string - besides not having to maintain
a deduplicating arbitrary string storage in the kernel - is that we
may be able to auto-assign unique IDs to VMAs in the kernel, in a way
that we could not with strings. You'd still have to do IPC calls to
write new name mappings into your db, but you wouldn't have to do the
prctl() to assign stuff in the kernel at all.

(We'd have to think of a solution of how IDs work with vma merging and
splitting, but I think to a certain degree that's policy and we should
be able to find something workable - a MAP_ID flag, using anon_vma as
identity, assigning IDs at mmap time and do merges only for protection
changes etc. etc.)
Suren Baghdasaryan Oct. 12, 2021, 6:52 p.m. UTC | #59
On Tue, Oct 12, 2021 at 11:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 11, 2021 at 10:36:24PM -0700, Suren Baghdasaryan wrote:
> > On Mon, Oct 11, 2021 at 8:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > > > >   is one needed here? That seems like a completely unreasonable
> > > > > > >   requirement.
> > > > > >
> > > > > > I might be misreading the justification for the feature. Patch 2 is
> > > > > > talking about tools that need to understand memeory usage to make
> > > > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > > > argument against.
> > > > > >
> > > > > > So can we get a clear example how is this being used actually? If this
> > > > > > is just to be used to debug by humans than I can see an argument for
> > > > > > human readable form. If this is, however, meant to be used by tools to
> > > > > > make some actions then the argument for strings is much weaker.
> > > > >
> > > > > The simplest usecase is when we notice that a process consumes more
> > > > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > > > check which area is contributing to this growth. The names we assign
> > > > > to anonymous areas are descriptive enough for a developer to get an
> > > > > idea where the increased consumption is coming from and how to proceed
> > > > > with their investigation.
> > > > > There are of course cases when tools are involved, but the end-user is
> > > > > always a human and the final report should contain easily
> > > > > understandable data.
> > > > >
> > > > > IIUC, the main argument here is whether the userspace can provide
> > > > > tools to perform the translations between ids and names, with the
> > > > > kernel accepting and reporting ids instead of strings. Technically
> > > > > it's possible, but to be practical that conversion should be fast
> > > > > because we will need to make name->id conversion potentially for each
> > > > > mmap. On the consumer side the performance is not as critical, but the
> > > > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > > > file, do id->name conversion and replace all [anon:id] with
> > > > > [anon:name] would be an issue when we do that in bulk, for example
> > > > > when collecting system-wide data for a bugreport.
> > >
> > > Is that something you need to do client-side? Or could the bug tool
> > > upload the userspace-maintained name:ids database alongside the
> > > /proc/pid/maps dump for external processing?
> >
> > You can generate a bugreport and analyze it locally or submit it as an
> > attachment to a bug for further analyzes.
> > Sure, we can attach the id->name conversion table to the bugreport but
> > either way, some tool would have to post-process it to resolve the
> > ids. If we are not analyzing the results immediately then that step
> > can be postponed and I think that's what you mean? If so, then yes,
> > that is correct.
>
> Right, somebody needs to do it at some point, but I suppose it's less
> of a problem if a developer machine does it than a mobile device.

True, and that's why I mentioned that it's not as critical as the
efficiency at mmap() time. In any case, if we could avoid translations
at all that would be ideal.

>
> One advantage of an ID over a string - besides not having to maintain
> a deduplicating arbitrary string storage in the kernel - is that we
> may be able to auto-assign unique IDs to VMAs in the kernel, in a way
> that we could not with strings. You'd still have to do IPC calls to
> write new name mappings into your db, but you wouldn't have to do the
> prctl() to assign stuff in the kernel at all.

You still have to retrieve that tag from the kernel to record it in
your db, so this would still require some syscall, no?

> (We'd have to think of a solution of how IDs work with vma merging and
> splitting, but I think to a certain degree that's policy and we should
> be able to find something workable - a MAP_ID flag, using anon_vma as
> identity, assigning IDs at mmap time and do merges only for protection
> changes etc. etc.)

Overall, I think keeping the kernel out of this and letting it treat
this tag as a cookie which only userspace cares about is simpler.
Unless you see other uses where kernel's involvement is needed.
Johannes Weiner Oct. 12, 2021, 8:41 p.m. UTC | #60
On Tue, Oct 12, 2021 at 11:52:42AM -0700, Suren Baghdasaryan wrote:
> On Tue, Oct 12, 2021 at 11:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Oct 11, 2021 at 10:36:24PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Oct 11, 2021 at 8:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > > > > >   is one needed here? That seems like a completely unreasonable
> > > > > > > >   requirement.
> > > > > > >
> > > > > > > I might be misreading the justification for the feature. Patch 2 is
> > > > > > > talking about tools that need to understand memeory usage to make
> > > > > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > > > > argument against.
> > > > > > >
> > > > > > > So can we get a clear example how is this being used actually? If this
> > > > > > > is just to be used to debug by humans than I can see an argument for
> > > > > > > human readable form. If this is, however, meant to be used by tools to
> > > > > > > make some actions then the argument for strings is much weaker.
> > > > > >
> > > > > > The simplest usecase is when we notice that a process consumes more
> > > > > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > > > > check which area is contributing to this growth. The names we assign
> > > > > > to anonymous areas are descriptive enough for a developer to get an
> > > > > > idea where the increased consumption is coming from and how to proceed
> > > > > > with their investigation.
> > > > > > There are of course cases when tools are involved, but the end-user is
> > > > > > always a human and the final report should contain easily
> > > > > > understandable data.
> > > > > >
> > > > > > IIUC, the main argument here is whether the userspace can provide
> > > > > > tools to perform the translations between ids and names, with the
> > > > > > kernel accepting and reporting ids instead of strings. Technically
> > > > > > it's possible, but to be practical that conversion should be fast
> > > > > > because we will need to make name->id conversion potentially for each
> > > > > > mmap. On the consumer side the performance is not as critical, but the
> > > > > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > > > > file, do id->name conversion and replace all [anon:id] with
> > > > > > [anon:name] would be an issue when we do that in bulk, for example
> > > > > > when collecting system-wide data for a bugreport.
> > > >
> > > > Is that something you need to do client-side? Or could the bug tool
> > > > upload the userspace-maintained name:ids database alongside the
> > > > /proc/pid/maps dump for external processing?
> > >
> > > You can generate a bugreport and analyze it locally or submit it as an
> > > attachment to a bug for further analyzes.
> > > Sure, we can attach the id->name conversion table to the bugreport but
> > > either way, some tool would have to post-process it to resolve the
> > > ids. If we are not analyzing the results immediately then that step
> > > can be postponed and I think that's what you mean? If so, then yes,
> > > that is correct.
> >
> > Right, somebody needs to do it at some point, but I suppose it's less
> > of a problem if a developer machine does it than a mobile device.
> 
> True, and that's why I mentioned that it's not as critical as the
> efficiency at mmap() time. In any case, if we could avoid translations
> at all that would be ideal.
> 
> >
> > One advantage of an ID over a string - besides not having to maintain
> > a deduplicating arbitrary string storage in the kernel - is that we
> > may be able to auto-assign unique IDs to VMAs in the kernel, in a way
> > that we could not with strings. You'd still have to do IPC calls to
> > write new name mappings into your db, but you wouldn't have to do the
> > prctl() to assign stuff in the kernel at all.
> 
> You still have to retrieve that tag from the kernel to record it in
> your db, so this would still require some syscall, no?

Don't you have to do this with the string setting interface as well?
How do you know the vma address to pass into the prctl()? Is this
somehow coordinated with the mmap()?

> > (We'd have to think of a solution of how IDs work with vma merging and
> > splitting, but I think to a certain degree that's policy and we should
> > be able to find something workable - a MAP_ID flag, using anon_vma as
> > identity, assigning IDs at mmap time and do merges only for protection
> > changes etc. etc.)
> 
> Overall, I think keeping the kernel out of this and letting it treat
> this tag as a cookie which only userspace cares about is simpler.
> Unless you see other uses where kernel's involvement is needed.

It depends on what you consider keeping the kernel out of it. A small
extension to assign unique IDs to mappings automatically in an
intuitive way (with a compat option to disable) is a much smaller ABI
commitment than a prctl()-controlled string storage.

When I say policy on how to assign the ID, I didn't mean that it
should be a free for all. Rather that we should pick one reasonable
way to do it, comparable to picking the parameters for how long the
stored strings could be, which characters to allow etc.
Suren Baghdasaryan Oct. 12, 2021, 8:59 p.m. UTC | #61
On Tue, Oct 12, 2021 at 1:41 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 12, 2021 at 11:52:42AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Oct 12, 2021 at 11:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 10:36:24PM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Oct 11, 2021 at 8:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Mon, Oct 11, 2021 at 06:20:25PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, Oct 11, 2021 at 6:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 11, 2021 at 1:36 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Fri 08-10-21 13:58:01, Kees Cook wrote:
> > > > > > > > > - Strings for "anon" specifically have no required format (this is good)
> > > > > > > > >   it's informational like the task_struct::comm and can (roughly)
> > > > > > > > >   anything. There's no naming convention for memfds, AF_UNIX, etc. Why
> > > > > > > > >   is one needed here? That seems like a completely unreasonable
> > > > > > > > >   requirement.
> > > > > > > >
> > > > > > > > I might be misreading the justification for the feature. Patch 2 is
> > > > > > > > talking about tools that need to understand memeory usage to make
> > > > > > > > further actions. Also Suren was suggesting "numbering convetion" as an
> > > > > > > > argument against.
> > > > > > > >
> > > > > > > > So can we get a clear example how is this being used actually? If this
> > > > > > > > is just to be used to debug by humans than I can see an argument for
> > > > > > > > human readable form. If this is, however, meant to be used by tools to
> > > > > > > > make some actions then the argument for strings is much weaker.
> > > > > > >
> > > > > > > The simplest usecase is when we notice that a process consumes more
> > > > > > > memory than usual and we do "cat /proc/$(pidof my_process)/maps" to
> > > > > > > check which area is contributing to this growth. The names we assign
> > > > > > > to anonymous areas are descriptive enough for a developer to get an
> > > > > > > idea where the increased consumption is coming from and how to proceed
> > > > > > > with their investigation.
> > > > > > > There are of course cases when tools are involved, but the end-user is
> > > > > > > always a human and the final report should contain easily
> > > > > > > understandable data.
> > > > > > >
> > > > > > > IIUC, the main argument here is whether the userspace can provide
> > > > > > > tools to perform the translations between ids and names, with the
> > > > > > > kernel accepting and reporting ids instead of strings. Technically
> > > > > > > it's possible, but to be practical that conversion should be fast
> > > > > > > because we will need to make name->id conversion potentially for each
> > > > > > > mmap. On the consumer side the performance is not as critical, but the
> > > > > > > fact that instead of dumping /proc/$pid/maps we will have to parse the
> > > > > > > file, do id->name conversion and replace all [anon:id] with
> > > > > > > [anon:name] would be an issue when we do that in bulk, for example
> > > > > > > when collecting system-wide data for a bugreport.
> > > > >
> > > > > Is that something you need to do client-side? Or could the bug tool
> > > > > upload the userspace-maintained name:ids database alongside the
> > > > > /proc/pid/maps dump for external processing?
> > > >
> > > > You can generate a bugreport and analyze it locally or submit it as an
> > > > attachment to a bug for further analyzes.
> > > > Sure, we can attach the id->name conversion table to the bugreport but
> > > > either way, some tool would have to post-process it to resolve the
> > > > ids. If we are not analyzing the results immediately then that step
> > > > can be postponed and I think that's what you mean? If so, then yes,
> > > > that is correct.
> > >
> > > Right, somebody needs to do it at some point, but I suppose it's less
> > > of a problem if a developer machine does it than a mobile device.
> >
> > True, and that's why I mentioned that it's not as critical as the
> > efficiency at mmap() time. In any case, if we could avoid translations
> > at all that would be ideal.
> >
> > >
> > > One advantage of an ID over a string - besides not having to maintain
> > > a deduplicating arbitrary string storage in the kernel - is that we
> > > may be able to auto-assign unique IDs to VMAs in the kernel, in a way
> > > that we could not with strings. You'd still have to do IPC calls to
> > > write new name mappings into your db, but you wouldn't have to do the
> > > prctl() to assign stuff in the kernel at all.
> >
> > You still have to retrieve that tag from the kernel to record it in
> > your db, so this would still require some syscall, no?
>
> Don't you have to do this with the string setting interface as well?
> How do you know the vma address to pass into the prctl()? Is this
> somehow coordinated with the mmap()?

Sure. The sequence is:

ptr = mmap(NULL, size, ...);
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr, size, name);

>
> > > (We'd have to think of a solution of how IDs work with vma merging and
> > > splitting, but I think to a certain degree that's policy and we should
> > > be able to find something workable - a MAP_ID flag, using anon_vma as
> > > identity, assigning IDs at mmap time and do merges only for protection
> > > changes etc. etc.)
> >
> > Overall, I think keeping the kernel out of this and letting it treat
> > this tag as a cookie which only userspace cares about is simpler.
> > Unless you see other uses where kernel's involvement is needed.
>
> It depends on what you consider keeping the kernel out of it. A small
> extension to assign unique IDs to mappings automatically in an
> intuitive way (with a compat option to disable) is a much smaller ABI
> commitment than a prctl()-controlled string storage.

I'm not saying it's hard or complex. I just don't see the advantage of
generating these IDs in the kernel vs passing them from userspace.
Maybe I'm missing some usecase?

> When I say policy on how to assign the ID, I didn't mean that it
> should be a free for all. Rather that we should pick one reasonable
> way to do it, comparable to picking the parameters for how long the
> stored strings could be, which characters to allow etc.
Suren Baghdasaryan Oct. 14, 2021, 8:16 p.m. UTC | #62
On Tue, Oct 12, 2021 at 10:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Oct 12, 2021 at 12:44 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > > I'm still evaluating the proposal to use memfds but I'm not sure if
> > > the issue that David Hildenbrand mentioned about additional memory
> > > consumed in pagecache (which has to be addressed) is the only one we
> > > will encounter with this approach. If anyone knows of any potential
> > > issues with using memfds as named anonymous memory, I would really
> > > appreciate your feedback before I go too far in that direction.
> >
> > [MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so
> > I think it just has to be fixed. It doesn't make any sense to allocate a
> > page for the pagecache ("populate the file") when accessing via a
> > private mapping that's supposed to leave the file untouched]
> >
> > My gut feeling is if you really need a string as identifier, then try
> > going with memfds. Yes, we might hit some road blocks to be sorted out,
> > but it just logically makes sense to me: Files have names. These names
> > exist before mapping and after mapping. They "name" the content.
>
> I'm investigating this direction. I don't have much background with
> memfds, so I'll need to digest the code first.

I've done some investigation into the possibility of using memfds to
name anonymous VMAs. Here are my findings:

1. Forking a process with anonymous vmas named using memfd is 5-15%
slower than with prctl (depends on the number of VMAs in the process
being forked). Profiling shows that i_mmap_lock_write() dominates
dup_mmap(). Exit path is also slower by roughly 9% with
free_pgtables() and fput() dominating exit_mmap(). Fork performance is
important for Android because almost all processes are forked from
zygote, therefore this limitation already makes this approach
prohibitive.

2. mremap() usage to grow the mapping has an issue when used with memfds:

fd = memfd_create(name, MFD_ALLOW_SEALING);
ftruncate(fd, size_bytes);
ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
close(fd);
ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
touch_mem(ptr, size_bytes * 2);

This would generate a SIGBUS in touch_mem(). I believe it's because
ftruncate() specified the size to be size_bytes and we are accessing
more than that after remapping. prctl() does not have this limitation
and we do have a usecase for growing a named VMA.

3. Leaves an fd exposed, even briefly, which may lead to unexpected
flaws (e.g. anything using mmap MAP_SHARED could allow exposures or
overwrites). Even MAP_PRIVATE, if an attacker writes into the file
after ftruncate() and before mmap(), can cause private memory to be
initialized with unexpected data.

4. There is a usecase in the Android userspace where vma naming
happens after memory was allocated. Bionic linker does in-memory
relocations and then names some relocated sections.

In the light of these findings, could the current patchset be reconsidered?
Thanks,
Suren.


>
> >
> > Maybe it's just me, but the whole interface, setting the name via a
> > prctl after the mapping was already instantiated doesn't really spark
> > joy at my end. That's not a strong pushback, but if we can avoid it
> > using something that's already there, that would be very much preferred.
>
> Actually that's one of my worries about using memfds. There might be
> cases when we need to name a vma after it was mapped. memfd_create()
> would not allow us to do that AFAIKT. But I need to check all usages
> to say if that's really an issue.
> Thanks!
>
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
David Hildenbrand Oct. 15, 2021, 8:03 a.m. UTC | #63
On 14.10.21 22:16, Suren Baghdasaryan wrote:
> On Tue, Oct 12, 2021 at 10:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Tue, Oct 12, 2021 at 12:44 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> I'm still evaluating the proposal to use memfds but I'm not sure if
>>>> the issue that David Hildenbrand mentioned about additional memory
>>>> consumed in pagecache (which has to be addressed) is the only one we
>>>> will encounter with this approach. If anyone knows of any potential
>>>> issues with using memfds as named anonymous memory, I would really
>>>> appreciate your feedback before I go too far in that direction.
>>>
>>> [MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so
>>> I think it just has to be fixed. It doesn't make any sense to allocate a
>>> page for the pagecache ("populate the file") when accessing via a
>>> private mapping that's supposed to leave the file untouched]
>>>
>>> My gut feeling is if you really need a string as identifier, then try
>>> going with memfds. Yes, we might hit some road blocks to be sorted out,
>>> but it just logically makes sense to me: Files have names. These names
>>> exist before mapping and after mapping. They "name" the content.
>>
>> I'm investigating this direction. I don't have much background with
>> memfds, so I'll need to digest the code first.
> 
> I've done some investigation into the possibility of using memfds to
> name anonymous VMAs. Here are my findings:

Thanks for exploring the alternatives!

> 
> 1. Forking a process with anonymous vmas named using memfd is 5-15%
> slower than with prctl (depends on the number of VMAs in the process
> being forked). Profiling shows that i_mmap_lock_write() dominates
> dup_mmap(). Exit path is also slower by roughly 9% with
> free_pgtables() and fput() dominating exit_mmap(). Fork performance is
> important for Android because almost all processes are forked from
> zygote, therefore this limitation already makes this approach
> prohibitive.

Interesting, naturally I wonder if that can be optimized.

> 
> 2. mremap() usage to grow the mapping has an issue when used with memfds:
> 
> fd = memfd_create(name, MFD_ALLOW_SEALING);
> ftruncate(fd, size_bytes);
> ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
> close(fd);
> ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
> touch_mem(ptr, size_bytes * 2);
> 
> This would generate a SIGBUS in touch_mem(). I believe it's because
> ftruncate() specified the size to be size_bytes and we are accessing
> more than that after remapping. prctl() does not have this limitation
> and we do have a usecase for growing a named VMA.

Can't you simply size the memfd much larger? I mean, it doesn't really
cost much, does it?

> 
> 3. Leaves an fd exposed, even briefly, which may lead to unexpected
> flaws (e.g. anything using mmap MAP_SHARED could allow exposures or
> overwrites). Even MAP_PRIVATE, if an attacker writes into the file
> after ftruncate() and before mmap(), can cause private memory to be
> initialized with unexpected data.

I don't quite follow. Can you elaborate what exactly the issue here is?
We use a temporary fd, yes, but how is that a problem?

Any attacker can just write any random memory memory in the address
space, so I don't see the issue.

> 
> 4. There is a usecase in the Android userspace where vma naming
> happens after memory was allocated. Bionic linker does in-memory
> relocations and then names some relocated sections.

Would renaming a memfd be an option or is that "too late" ?

> 
> In the light of these findings, could the current patchset be reconsidered?
> Thanks,
> Suren.
>
Suren Baghdasaryan Oct. 15, 2021, 4:30 p.m. UTC | #64
On Fri, Oct 15, 2021 at 1:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.10.21 22:16, Suren Baghdasaryan wrote:
> > On Tue, Oct 12, 2021 at 10:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Tue, Oct 12, 2021 at 12:44 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>>> I'm still evaluating the proposal to use memfds but I'm not sure if
> >>>> the issue that David Hildenbrand mentioned about additional memory
> >>>> consumed in pagecache (which has to be addressed) is the only one we
> >>>> will encounter with this approach. If anyone knows of any potential
> >>>> issues with using memfds as named anonymous memory, I would really
> >>>> appreciate your feedback before I go too far in that direction.
> >>>
> >>> [MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so
> >>> I think it just has to be fixed. It doesn't make any sense to allocate a
> >>> page for the pagecache ("populate the file") when accessing via a
> >>> private mapping that's supposed to leave the file untouched]
> >>>
> >>> My gut feeling is if you really need a string as identifier, then try
> >>> going with memfds. Yes, we might hit some road blocks to be sorted out,
> >>> but it just logically makes sense to me: Files have names. These names
> >>> exist before mapping and after mapping. They "name" the content.
> >>
> >> I'm investigating this direction. I don't have much background with
> >> memfds, so I'll need to digest the code first.
> >
> > I've done some investigation into the possibility of using memfds to
> > name anonymous VMAs. Here are my findings:
>
> Thanks for exploring the alternatives!

Thanks for pointing to them!

>
> >
> > 1. Forking a process with anonymous vmas named using memfd is 5-15%
> > slower than with prctl (depends on the number of VMAs in the process
> > being forked). Profiling shows that i_mmap_lock_write() dominates
> > dup_mmap(). Exit path is also slower by roughly 9% with
> > free_pgtables() and fput() dominating exit_mmap(). Fork performance is
> > important for Android because almost all processes are forked from
> > zygote, therefore this limitation already makes this approach
> > prohibitive.
>
> Interesting, naturally I wonder if that can be optimized.

Maybe but it looks like we simply do additional things for file-backed
memory, which seems natural. The call to i_mmap_lock_write() is from
here: https://elixir.bootlin.com/linux/latest/source/kernel/fork.c#L565

>
> >
> > 2. mremap() usage to grow the mapping has an issue when used with memfds:
> >
> > fd = memfd_create(name, MFD_ALLOW_SEALING);
> > ftruncate(fd, size_bytes);
> > ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
> > close(fd);
> > ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
> > touch_mem(ptr, size_bytes * 2);
> >
> > This would generate a SIGBUS in touch_mem(). I believe it's because
> > ftruncate() specified the size to be size_bytes and we are accessing
> > more than that after remapping. prctl() does not have this limitation
> > and we do have a usecase for growing a named VMA.
>
> Can't you simply size the memfd much larger? I mean, it doesn't really
> cost much, does it?

If we know beforehand what the max size it can reach then that would
be possible. I would really hate to miscalculate here and cause a
simple memory access to generate signals. Tracking such corner cases
in the field is not an easy task and I would rather avoid the
possibility of it.

>
> >
> > 3. Leaves an fd exposed, even briefly, which may lead to unexpected
> > flaws (e.g. anything using mmap MAP_SHARED could allow exposures or
> > overwrites). Even MAP_PRIVATE, if an attacker writes into the file
> > after ftruncate() and before mmap(), can cause private memory to be
> > initialized with unexpected data.
>
> I don't quite follow. Can you elaborate what exactly the issue here is?
> We use a temporary fd, yes, but how is that a problem?
>
> Any attacker can just write any random memory memory in the address
> space, so I don't see the issue.

It feels to me that introducing another handle to the memory region is
a potential attack vector but I'm not a security expert. Maybe Kees
can assess this better?

>
> >
> > 4. There is a usecase in the Android userspace where vma naming
> > happens after memory was allocated. Bionic linker does in-memory
> > relocations and then names some relocated sections.
>
> Would renaming a memfd be an option or is that "too late" ?

My understanding is that linker allocates space to load and relocate
the code, performs the relocations in that space and then names some
of the regions after that. Whether it can be redesigned to allocate
multiple named regions and perform the relocation between them I did
not really try since it would be a project by itself.

TBH, at some point I just look at the amount of required changes (both
kernel and userspace) and new limitations that userspace has to adhere
to for fitting memfds to my usecase, and I feel that it's just not
worth it. In the end we end up using the same refcounted strings with
vma->vm_file->f_count as the refcount and name stored in
vma->vm_file->f_path->dentry but with more overhead.
Thanks,
Suren.

>
> >
> > In the light of these findings, could the current patchset be reconsidered?
> > Thanks,
> > Suren.
> >
>
>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand Oct. 15, 2021, 4:39 p.m. UTC | #65
>>>
>>> 1. Forking a process with anonymous vmas named using memfd is 5-15%
>>> slower than with prctl (depends on the number of VMAs in the process
>>> being forked). Profiling shows that i_mmap_lock_write() dominates
>>> dup_mmap(). Exit path is also slower by roughly 9% with
>>> free_pgtables() and fput() dominating exit_mmap(). Fork performance is
>>> important for Android because almost all processes are forked from
>>> zygote, therefore this limitation already makes this approach
>>> prohibitive.
>>
>> Interesting, naturally I wonder if that can be optimized.
> 
> Maybe but it looks like we simply do additional things for file-backed
> memory, which seems natural. The call to i_mmap_lock_write() is from
> here: https://elixir.bootlin.com/linux/latest/source/kernel/fork.c#L565
> 
>>
>>>
>>> 2. mremap() usage to grow the mapping has an issue when used with memfds:
>>>
>>> fd = memfd_create(name, MFD_ALLOW_SEALING);
>>> ftruncate(fd, size_bytes);
>>> ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
>>> close(fd);
>>> ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
>>> touch_mem(ptr, size_bytes * 2);
>>>
>>> This would generate a SIGBUS in touch_mem(). I believe it's because
>>> ftruncate() specified the size to be size_bytes and we are accessing
>>> more than that after remapping. prctl() does not have this limitation
>>> and we do have a usecase for growing a named VMA.
>>
>> Can't you simply size the memfd much larger? I mean, it doesn't really
>> cost much, does it?
> 
> If we know beforehand what the max size it can reach then that would
> be possible. I would really hate to miscalculate here and cause a
> simple memory access to generate signals. Tracking such corner cases
> in the field is not an easy task and I would rather avoid the
> possibility of it.

The question would be if you cannot simply add some extremely large
number, because the file size itself doesn't really matter for memfd IIRC.

Having that said, without trying it out, I wouldn't know from the top of
my head if memremap would work that way on an already closed fd that ahs
a sufficient size :/ If you have the example still somewhere, I would be
interested if that would work in general.

[...]

>>
>>>
>>> 4. There is a usecase in the Android userspace where vma naming
>>> happens after memory was allocated. Bionic linker does in-memory
>>> relocations and then names some relocated sections.
>>
>> Would renaming a memfd be an option or is that "too late" ?
> 
> My understanding is that linker allocates space to load and relocate
> the code, performs the relocations in that space and then names some
> of the regions after that. Whether it can be redesigned to allocate
> multiple named regions and perform the relocation between them I did
> not really try since it would be a project by itself.
> 
> TBH, at some point I just look at the amount of required changes (both
> kernel and userspace) and new limitations that userspace has to adhere
> to for fitting memfds to my usecase, and I feel that it's just not
> worth it. In the end we end up using the same refcounted strings with
> vma->vm_file->f_count as the refcount and name stored in
> vma->vm_file->f_path->dentry but with more overhead.

Yes, but it's glued to files which naturally have names :)

Again, I appreciate that you looked into alternatives! I can see the
late renaming could be the biggest blocker if user space cannot be
adjusted easily to be compatible with that using memfds.
Kees Cook Oct. 15, 2021, 5:45 p.m. UTC | #66
On Fri, Oct 15, 2021 at 09:30:09AM -0700, Suren Baghdasaryan wrote:
> On Fri, Oct 15, 2021 at 1:04 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 14.10.21 22:16, Suren Baghdasaryan wrote:
> > > [...]
> > > 3. Leaves an fd exposed, even briefly, which may lead to unexpected
> > > flaws (e.g. anything using mmap MAP_SHARED could allow exposures or
> > > overwrites). Even MAP_PRIVATE, if an attacker writes into the file
> > > after ftruncate() and before mmap(), can cause private memory to be
> > > initialized with unexpected data.
> >
> > I don't quite follow. Can you elaborate what exactly the issue here is?
> > We use a temporary fd, yes, but how is that a problem?
> >
> > Any attacker can just write any random memory memory in the address
> > space, so I don't see the issue.
> 
> It feels to me that introducing another handle to the memory region is
> a potential attack vector but I'm not a security expert. Maybe Kees
> can assess this better?

This case is kind of just an extension of "we don't need an fd, we need
a name". There is a lot of resulting baggage suddenly added to using
anonymous VMA (fork overhead to deal with the fds, etc), but for me, this
particular situation above is what really demonstrates the "unexpected
side-effects" of trying to swap an anonymous mmap for a memfd: there is
now an _external handle_ attached to memory that doesn't pass through
any of the existing security boundaries normally associated with process
memory (i.e. ptrace). Here's the example race:

victim process			attacker process (same uid)
memfd_create(name, flags);
	-> /proc/$pid/fd/3
ftruncate(3, size);
				open("/proc/$victim/fd/3", O_RDWR)
					-> 3
				mmap(NULL, size,
				     PROT_READ | PROT_WRITE | PROT_EXEC,
				     MAP_SHARED, 3, 0);
					-> addr
				memset(addr, 0xFF, size);

mmap(NULL, size, prot,
     MAP_PRIVATE, 3, 0);
	-> addr
close(3);

surprise, addr[0] != 0x00

And again, yes, we could program defensively, but it's a surprising
situation with new corner cases that haven't been present for years of
Just Using Anon VMAs. :) I would be worried about other vectors we
haven't imagined yet.

So, I think between both the overhead of files and the expanded attack
surface make memfd unsuited for this use-case.

-Kees
Suren Baghdasaryan Oct. 15, 2021, 6:33 p.m. UTC | #67
On Fri, Oct 15, 2021 at 9:39 AM David Hildenbrand <david@redhat.com> wrote:
>
>
> >>>
> >>> 1. Forking a process with anonymous vmas named using memfd is 5-15%
> >>> slower than with prctl (depends on the number of VMAs in the process
> >>> being forked). Profiling shows that i_mmap_lock_write() dominates
> >>> dup_mmap(). Exit path is also slower by roughly 9% with
> >>> free_pgtables() and fput() dominating exit_mmap(). Fork performance is
> >>> important for Android because almost all processes are forked from
> >>> zygote, therefore this limitation already makes this approach
> >>> prohibitive.
> >>
> >> Interesting, naturally I wonder if that can be optimized.
> >
> > Maybe but it looks like we simply do additional things for file-backed
> > memory, which seems natural. The call to i_mmap_lock_write() is from
> > here: https://elixir.bootlin.com/linux/latest/source/kernel/fork.c#L565
> >
> >>
> >>>
> >>> 2. mremap() usage to grow the mapping has an issue when used with memfds:
> >>>
> >>> fd = memfd_create(name, MFD_ALLOW_SEALING);
> >>> ftruncate(fd, size_bytes);
> >>> ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
> >>> close(fd);
> >>> ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
> >>> touch_mem(ptr, size_bytes * 2);
> >>>
> >>> This would generate a SIGBUS in touch_mem(). I believe it's because
> >>> ftruncate() specified the size to be size_bytes and we are accessing
> >>> more than that after remapping. prctl() does not have this limitation
> >>> and we do have a usecase for growing a named VMA.
> >>
> >> Can't you simply size the memfd much larger? I mean, it doesn't really
> >> cost much, does it?
> >
> > If we know beforehand what the max size it can reach then that would
> > be possible. I would really hate to miscalculate here and cause a
> > simple memory access to generate signals. Tracking such corner cases
> > in the field is not an easy task and I would rather avoid the
> > possibility of it.
>
> The question would be if you cannot simply add some extremely large
> number, because the file size itself doesn't really matter for memfd IIRC.
>
> Having that said, without trying it out, I wouldn't know from the top of
> my head if memremap would work that way on an already closed fd that ahs
> a sufficient size :/ If you have the example still somewhere, I would be
> interested if that would work in general.

Yes, I tried a simple test like this and it works:

fd = memfd_create(name, MFD_ALLOW_SEALING);
ftruncate(fd, size_bytes * 2);
ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0);
close(fd);
ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE);
touch_mem(ptr, size_bytes * 2);

I understand your suggestion but it's just another hoop we have to
jump to make this work and feels unnatural from userspace POV. Also
virtual address space exhaustion might be an issue for 32bit userspace
with this approach.

>
> [...]
>
> >>
> >>>
> >>> 4. There is a usecase in the Android userspace where vma naming
> >>> happens after memory was allocated. Bionic linker does in-memory
> >>> relocations and then names some relocated sections.
> >>
> >> Would renaming a memfd be an option or is that "too late" ?
> >
> > My understanding is that linker allocates space to load and relocate
> > the code, performs the relocations in that space and then names some
> > of the regions after that. Whether it can be redesigned to allocate
> > multiple named regions and perform the relocation between them I did
> > not really try since it would be a project by itself.
> >
> > TBH, at some point I just look at the amount of required changes (both
> > kernel and userspace) and new limitations that userspace has to adhere
> > to for fitting memfds to my usecase, and I feel that it's just not
> > worth it. In the end we end up using the same refcounted strings with
> > vma->vm_file->f_count as the refcount and name stored in
> > vma->vm_file->f_path->dentry but with more overhead.
>
> Yes, but it's glued to files which naturally have names :)

Yeah, I understand your motivations and that's why I'm exploring these
possibilities but it proves to be just too costly for a feature as
simple as naming a vma :)

>
> Again, I appreciate that you looked into alternatives! I can see the
> late renaming could be the biggest blocker if user space cannot be
> adjusted easily to be compatible with that using memfds.

Yeah, it would definitely be hard for Android to adopt this.

If there are no objections to the current approach I would like to
respin another version with the CONFIG option added sometime early
next week. If anyone has objections, please let me know.
Thanks,
Suren.

>
> --
> Thanks,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f1896c9b2da9..30c571b4f8d3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@ 
 #include <linux/mm_types_task.h>
 
 #include <linux/auxvec.h>
+#include <linux/kref.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
@@ -310,6 +311,12 @@  struct vm_userfaultfd_ctx {
 struct vm_userfaultfd_ctx {};
 #endif /* CONFIG_USERFAULTFD */
 
+struct anon_vma_name {
+	struct kref kref;
+	/* The name needs to be at the end because it is dynamically sized. */
+	char name[];
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -361,7 +368,7 @@  struct vm_area_struct {
 			unsigned long rb_subtree_last;
 		} shared;
 		/* Serialized by mmap_sem. */
-		char *anon_name;
+		struct anon_vma_name *anon_name;
 	};
 
 	/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 69729930dde1..2f1ec13faaaa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -63,6 +63,29 @@  static int madvise_need_mmap_write(int behavior)
 	}
 }
 
+static struct anon_vma_name *anon_vma_name_alloc(const char *name)
+{
+	struct anon_vma_name *anon_name;
+	size_t count;
+
+	/* Add 1 for NUL terminator at the end of the anon_name->name */
+	count = strlen(name) + 1;
+	anon_name = kmalloc(struct_size(anon_name, name, count), GFP_KERNEL);
+	if (anon_name) {
+		kref_init(&anon_name->kref);
+		memcpy(anon_name->name, name, count);
+	}
+
+	return anon_name;
+}
+
+static void vma_anon_name_free(struct kref *kref)
+{
+	struct anon_vma_name *anon_name =
+			container_of(kref, struct anon_vma_name, kref);
+	kfree(anon_name);
+}
+
 static inline bool has_vma_anon_name(struct vm_area_struct *vma)
 {
 	return !vma->vm_file && vma->anon_name;
@@ -75,7 +98,7 @@  const char *vma_anon_name(struct vm_area_struct *vma)
 
 	mmap_assert_locked(vma->vm_mm);
 
-	return vma->anon_name;
+	return vma->anon_name->name;
 }
 
 void dup_vma_anon_name(struct vm_area_struct *orig_vma,
@@ -84,34 +107,41 @@  void dup_vma_anon_name(struct vm_area_struct *orig_vma,
 	if (!has_vma_anon_name(orig_vma))
 		return;
 
-	new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
+	kref_get(&orig_vma->anon_name->kref);
+	new_vma->anon_name = orig_vma->anon_name;
 }
 
 void free_vma_anon_name(struct vm_area_struct *vma)
 {
+	struct anon_vma_name *anon_name;
+
 	if (!has_vma_anon_name(vma))
 		return;
 
-	kfree(vma->anon_name);
+	anon_name = vma->anon_name;
 	vma->anon_name = NULL;
+	kref_put(&anon_name->kref, vma_anon_name_free);
 }
 
 /* mmap_lock should be write-locked */
 static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
 {
+	const char *anon_name;
+
 	if (!name) {
 		free_vma_anon_name(vma);
 		return 0;
 	}
 
-	if (vma->anon_name) {
+	anon_name = vma_anon_name(vma);
+	if (anon_name) {
 		/* Same name, nothing to do here */
-		if (!strcmp(name, vma->anon_name))
+		if (!strcmp(name, anon_name))
 			return 0;
 
 		free_vma_anon_name(vma);
 	}
-	vma->anon_name = kstrdup(name, GFP_KERNEL);
+	vma->anon_name = anon_vma_name_alloc(name);
 	if (!vma->anon_name)
 		return -ENOMEM;