mbox series

[RFC,0/4] mm: Introduce lazy exec permission setting on a page

Message ID 1550045191-27483-1-git-send-email-anshuman.khandual@arm.com (mailing list archive)
Headers show
Series mm: Introduce lazy exec permission setting on a page | expand

Message

Anshuman Khandual Feb. 13, 2019, 8:06 a.m. UTC
Setting an exec permission on a page normally triggers I-cache invalidation
which might be expensive. I-cache invalidation is not mandatory on a given
page if there is no immediate exec access on it. Non-fault modification of
user page table from generic memory paths like migration can be improved if
setting of the exec permission on the page can be deferred till actual use.
There was a performance report [1] which highlighted the problem.

This introduces [pte|pmd]_mklazyexec() which clears the exec permission on
a page during migration. This exec permission deferral must be enabled back
with maybe_[pmd]_mkexec() during exec page fault (FAULT_FLAG_INSTRUCTION)
if the corresponding VMA contains exec flag (VM_EXEC).

This framework is encapsulated under CONFIG_ARCH_SUPPORTS_LAZY_EXEC so that
non-subscribing architectures don't take any performance hit. For now only
generic memory migration path will be using this framework but later it can
be extended to other generic memory paths as well.

This enables CONFIG_ARCH_SUPPORTS_LAZY_EXEC on arm64 and defines required
helper functions in this regard while changing ptep_set_access_flags() to
allow non-exec to exec transition.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

Anshuman Khandual (4):
  mm: Introduce lazy exec permission setting on a page
  arm64/mm: Identify user level instruction faults
  arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags()
  arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC

 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 17 +++++++++++++++++
 arch/arm64/mm/fault.c            | 22 ++++++++++++++--------
 include/asm-generic/pgtable.h    | 12 ++++++++++++
 include/linux/mm.h               | 26 ++++++++++++++++++++++++++
 mm/Kconfig                       |  9 +++++++++
 mm/huge_memory.c                 |  5 +++++
 mm/hugetlb.c                     |  2 ++
 mm/memory.c                      |  4 ++++
 mm/migrate.c                     |  2 ++
 10 files changed, 92 insertions(+), 8 deletions(-)

Comments

Catalin Marinas Feb. 13, 2019, 11:21 a.m. UTC | #1
On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> Setting an exec permission on a page normally triggers I-cache invalidation
> which might be expensive. I-cache invalidation is not mandatory on a given
> page if there is no immediate exec access on it. Non-fault modification of
> user page table from generic memory paths like migration can be improved if
> setting of the exec permission on the page can be deferred till actual use.
> There was a performance report [1] which highlighted the problem.
[...]
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

FTR, this performance regression has been addressed by commit
132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
said, I still think this patch series is valuable for further optimising
the page migration path on arm64 (and can be extended to other
architectures that currently require I/D cache maintenance for
executable pages).

BTW, if you are going to post new versions of this series, please
include linux-arch and linux-arm-kernel.
Michal Hocko Feb. 13, 2019, 3:38 p.m. UTC | #2
On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> > Setting an exec permission on a page normally triggers I-cache invalidation
> > which might be expensive. I-cache invalidation is not mandatory on a given
> > page if there is no immediate exec access on it. Non-fault modification of
> > user page table from generic memory paths like migration can be improved if
> > setting of the exec permission on the page can be deferred till actual use.
> > There was a performance report [1] which highlighted the problem.
> [...]
> > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
> 
> FTR, this performance regression has been addressed by commit
> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
> said, I still think this patch series is valuable for further optimising
> the page migration path on arm64 (and can be extended to other
> architectures that currently require I/D cache maintenance for
> executable pages).

Are there any numbers to show the optimization impact?
Dave Hansen Feb. 13, 2019, 3:44 p.m. UTC | #3
On 2/13/19 12:06 AM, Anshuman Khandual wrote:
> Setting an exec permission on a page normally triggers I-cache invalidation
> which might be expensive. I-cache invalidation is not mandatory on a given
> page if there is no immediate exec access on it. Non-fault modification of
> user page table from generic memory paths like migration can be improved if
> setting of the exec permission on the page can be deferred till actual use.
> There was a performance report [1] which highlighted the problem.

How does this happen?  If the page was not executed, then it'll
(presumably) be non-present which won't require icache invalidation.
So, this would only be for pages that have been executed (and won't
again before the next migration), *or* for pages that were mapped
executable but never executed.

Any idea which one it is?

If it's pages that got mapped in but were never executed, how did that
happen?  Was it fault-around?  If so, maybe it would just be simpler to
not do fault-around for executable pages on these platforms.
Anshuman Khandual Feb. 14, 2019, 4:12 a.m. UTC | #4
On 02/13/2019 09:14 PM, Dave Hansen wrote:
> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>> Setting an exec permission on a page normally triggers I-cache invalidation
>> which might be expensive. I-cache invalidation is not mandatory on a given
>> page if there is no immediate exec access on it. Non-fault modification of
>> user page table from generic memory paths like migration can be improved if
>> setting of the exec permission on the page can be deferred till actual use.
>> There was a performance report [1] which highlighted the problem.
> 
> How does this happen?  If the page was not executed, then it'll
> (presumably) be non-present which won't require icache invalidation.
> So, this would only be for pages that have been executed (and won't
> again before the next migration), *or* for pages that were mapped
> executable but never executed.
I-cache invalidation happens while migrating a 'mapped and executable' page
irrespective whether that page was really executed for being mapped there
in the first place.

> 
> Any idea which one it is?
> 

I am not sure about this particular reported case. But was able to reproduce
the problem through a test case where a buffer was mapped with R|W|X, get it
faulted/mapped through write, migrate and then execute from it.

> If it's pages that got mapped in but were never executed, how did that
> happen?  Was it fault-around?  If so, maybe it would just be simpler to
> not do fault-around for executable pages on these platforms.
Page can get mapped through a different access (write) without being executed.
Even if it got mapped through execution and subsequent invalidation, the
invalidation does not have to be repeated again after migration without first
getting an exec access subsequently. This series just tries to hold off the
invalidation after migration till subsequent exec access.
Anshuman Khandual Feb. 14, 2019, 6:04 a.m. UTC | #5
On 02/13/2019 09:08 PM, Michal Hocko wrote:
> On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
>> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>> page if there is no immediate exec access on it. Non-fault modification of
>>> user page table from generic memory paths like migration can be improved if
>>> setting of the exec permission on the page can be deferred till actual use.
>>> There was a performance report [1] which highlighted the problem.
>> [...]
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
>>
>> FTR, this performance regression has been addressed by commit
>> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
>> said, I still think this patch series is valuable for further optimising
>> the page migration path on arm64 (and can be extended to other
>> architectures that currently require I/D cache maintenance for
>> executable pages).
> 
> Are there any numbers to show the optimization impact?

This series transfers execution cost linearly with nr_pages from migration path
to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
HugeTLB and THP migration enablement on arm64 platform.

A. [Normal Pages]

nr_pages	migration1 	migration2	execfault1	execfault2	

1000 		7.000000	3.000000	24.000000	31.000000
5000 		38.000000 	18.000000	127.000000	153.000000
10000 		80.000000 	40.000000	289.000000	343.000000
15000		120.000000	60.000000	435.000000	514.000000
19900 		159.000000	79.000000	576.000000	681.000000

B. [THP Pages]

nr_pages	migration1 	migration2	execfault1	execfault2

10 		22.000000	3.000000	131.000000	146.000000
30 		72.000000	15.000000	443.000000	503.000000
50 		121.000000	24.000000	739.000000	837.000000
100 		242.000000	49.000000	1485.000000	1673.000000
199 		473.000000 	98.000000	2685.000000	3327.000000

C. [HugeTLB Pages]

nr_pages	migration1 	migration2	execfault1	execfault2

10		97.000000 	79.000000	125.000000	144.000000
30 		292.000000 	235.000000	408.000000	463.000000
50 		487.000000 	392.000000	674.000000	777.000000
100 		995.000000 	802.000000	1480.000000	1671.000000
130 		1300.000000 	1048.000000	1925.000000	2172.000000

NOTE:

migration1: Execution time (ms) for migrating nr_pages without patches
migration2: Execution time (ms) for migrating nr_pages with patches
execfault1: Execution time (ms) for executing nr_pages without patches
execfault2: Execution time (ms) for executing nr_pages with patches
Michal Hocko Feb. 14, 2019, 8:38 a.m. UTC | #6
On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> 
> 
> On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
> >> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> >>> Setting an exec permission on a page normally triggers I-cache invalidation
> >>> which might be expensive. I-cache invalidation is not mandatory on a given
> >>> page if there is no immediate exec access on it. Non-fault modification of
> >>> user page table from generic memory paths like migration can be improved if
> >>> setting of the exec permission on the page can be deferred till actual use.
> >>> There was a performance report [1] which highlighted the problem.
> >> [...]
> >>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
> >>
> >> FTR, this performance regression has been addressed by commit
> >> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
> >> said, I still think this patch series is valuable for further optimising
> >> the page migration path on arm64 (and can be extended to other
> >> architectures that currently require I/D cache maintenance for
> >> executable pages).
> > 
> > Are there any numbers to show the optimization impact?
> 
> This series transfers execution cost linearly with nr_pages from migration path
> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> HugeTLB and THP migration enablement on arm64 platform.

Please make sure that these numbers are in the changelog. I am also
missing an explanation why this is an overal win. Why should we pay
on the later access rather than the migration which is arguably a slower
path. What is the usecase that benefits from the cost shift?
Catalin Marinas Feb. 14, 2019, 10:19 a.m. UTC | #7
On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
> On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> > On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > > Are there any numbers to show the optimization impact?
> > 
> > This series transfers execution cost linearly with nr_pages from migration path
> > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> > HugeTLB and THP migration enablement on arm64 platform.
> 
> Please make sure that these numbers are in the changelog. I am also
> missing an explanation why this is an overal win. Why should we pay
> on the later access rather than the migration which is arguably a slower
> path. What is the usecase that benefits from the cost shift?

Originally the investigation started because of a regression we had
sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
separately, so the original value of this patchset has been diminished.

Trying to frame the problem, let's analyse the overall cost of migration
+ execute. Removing other invariants like cost of the initial mapping of
the pages or the mapping of new pages after migration, we have:

M - number of mapped executable pages just before migration
N - number of previously mapped pages that will be executed after
    migration (N <= M)
D - cost of migrating page data
I - cost of I-cache maintenance for a page
F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
    without the actual I-cache maintenance)

Tc - total migration cost current kernel (including executing)
Tp - total migration cost patched kernel (including executing)

  Tc = M * (D + I)
  Tp = M * D + N * (F + I)

To be useful, we want this patchset to lead to:

  Tp < Tc

Simplifying:

  M * D + N * (F + I) < M * (D + I)
  ...
  F < I * (M - N) / N

So the question is, in a *real-world* scenario, what proportion of the
mapped executable pages would still be executed from after migration.
I'd leave this as a task for Anshuman to investigate and come up with
some numbers (and it's fine if it's just in the noise, we won't need
this patchset).

Also note that there are ARM CPU implementations that don't need I-cache
maintenance (the I side can snoop the D side), so for those this
patchset introducing an additional cost. But we can make the decision in
the arch code via pte_mklazyexec().

We implemented something similar in arm64 KVM (d0e22b4ac3ba "KVM:
arm/arm64: Limit icache invalidation to prefetch aborts") but the
use-case was different: previously KVM considered all pages executable
though the vast majority were only data pages in guests.
Michal Hocko Feb. 14, 2019, 12:28 p.m. UTC | #8
On Thu 14-02-19 10:19:37, Catalin Marinas wrote:
> On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
> > On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> > > On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > > > Are there any numbers to show the optimization impact?
> > > 
> > > This series transfers execution cost linearly with nr_pages from migration path
> > > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> > > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> > > HugeTLB and THP migration enablement on arm64 platform.
> > 
> > Please make sure that these numbers are in the changelog. I am also
> > missing an explanation why this is an overal win. Why should we pay
> > on the later access rather than the migration which is arguably a slower
> > path. What is the usecase that benefits from the cost shift?
> 
> Originally the investigation started because of a regression we had
> sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
> separately, so the original value of this patchset has been diminished.
> 
> Trying to frame the problem, let's analyse the overall cost of migration
> + execute. Removing other invariants like cost of the initial mapping of
> the pages or the mapping of new pages after migration, we have:
> 
> M - number of mapped executable pages just before migration
> N - number of previously mapped pages that will be executed after
>     migration (N <= M)
> D - cost of migrating page data
> I - cost of I-cache maintenance for a page
> F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
>     without the actual I-cache maintenance)
> 
> Tc - total migration cost current kernel (including executing)
> Tp - total migration cost patched kernel (including executing)
> 
>   Tc = M * (D + I)
>   Tp = M * D + N * (F + I)
> 
> To be useful, we want this patchset to lead to:
> 
>   Tp < Tc
> 
> Simplifying:
> 
>   M * D + N * (F + I) < M * (D + I)
>   ...
>   F < I * (M - N) / N
> 
> So the question is, in a *real-world* scenario, what proportion of the
> mapped executable pages would still be executed from after migration.
> I'd leave this as a task for Anshuman to investigate and come up with
> some numbers (and it's fine if it's just in the noise, we won't need
> this patchset).

Yeah, betting on accessing only a smaller subset of the migrated memory
is something I figured out. But I am really missing a usecase or a
larger set of them to actually benefit from it. We have different
triggers for a migration. E.g. numa balancing. I would expect that
migrated pages are likely to be accessed after migration because
the primary reason to migrate them is that they are accessed from a
remote node. Then we a compaction which is a completely different story.
It is hard to assume any further access for migrated pages here. Then we
have an explicit move_pages syscall and I would expect this to be
somewhere in the middle. One would expect that the caller knows why the
memory is migrated and it will be used but again, we cannot really
assume anything.

This would suggest that this depends on the migration reason quite a
lot. So I would really like to see a more comprehensive analysis of
different workloads to see whether this is really worth it.

Thanks!
Dave Hansen Feb. 14, 2019, 3:38 p.m. UTC | #9
On 2/13/19 10:04 PM, Anshuman Khandual wrote:
>> Are there any numbers to show the optimization impact?
> This series transfers execution cost linearly with nr_pages from migration path
> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> HugeTLB and THP migration enablement on arm64 platform.
> 
> A. [Normal Pages]
> 
> nr_pages	migration1 	migration2	execfault1	execfault2	
> 
> 1000 		7.000000	3.000000	24.000000	31.000000
> 5000 		38.000000 	18.000000	127.000000	153.000000
> 10000 		80.000000 	40.000000	289.000000	343.000000
> 15000		120.000000	60.000000	435.000000	514.000000
> 19900 		159.000000	79.000000	576.000000	681.000000

Do these numbers comprehend the increased fault costs or just the
decreased migration costs?
Dave Hansen Feb. 14, 2019, 4:55 p.m. UTC | #10
On 2/13/19 8:12 PM, Anshuman Khandual wrote:
> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>> page if there is no immediate exec access on it. Non-fault modification of
>>> user page table from generic memory paths like migration can be improved if
>>> setting of the exec permission on the page can be deferred till actual use.
>>> There was a performance report [1] which highlighted the problem.
>>
>> How does this happen?  If the page was not executed, then it'll
>> (presumably) be non-present which won't require icache invalidation.
>> So, this would only be for pages that have been executed (and won't
>> again before the next migration), *or* for pages that were mapped
>> executable but never executed.
> I-cache invalidation happens while migrating a 'mapped and executable' page
> irrespective whether that page was really executed for being mapped there
> in the first place.

Ahh, got it.  I also assume that the Accessed bit on these platforms is
also managed similar to how we do it on x86 such that it can't be used
to drive invalidation decisions?

>> Any idea which one it is?
> 
> I am not sure about this particular reported case. But was able to reproduce
> the problem through a test case where a buffer was mapped with R|W|X, get it
> faulted/mapped through write, migrate and then execute from it.

Could you make sure, please?

Write and Execute at the same time are generally a "bad idea".  Given
the hardware, I'm not surprised that this problem pops up, but it would
be great to find out if this is a real application, or a "doctor it
hurts when I do this."

>> If it's pages that got mapped in but were never executed, how did that
>> happen?  Was it fault-around?  If so, maybe it would just be simpler to
>> not do fault-around for executable pages on these platforms.
> Page can get mapped through a different access (write) without being executed.
> Even if it got mapped through execution and subsequent invalidation, the
> invalidation does not have to be repeated again after migration without first
> getting an exec access subsequently. This series just tries to hold off the
> invalidation after migration till subsequent exec access.

This set generally seems to be assuming an environment with "lots of
migration, and not much execution".  That seems like a kinda odd
situation to me.
Anshuman Khandual Feb. 15, 2019, 8:45 a.m. UTC | #11
a

On 02/14/2019 05:58 PM, Michal Hocko wrote:
> On Thu 14-02-19 10:19:37, Catalin Marinas wrote:
>> On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
>>> On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
>>>> On 02/13/2019 09:08 PM, Michal Hocko wrote:
>>>>> Are there any numbers to show the optimization impact?
>>>>
>>>> This series transfers execution cost linearly with nr_pages from migration path
>>>> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
>>>> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
>>>> HugeTLB and THP migration enablement on arm64 platform.
>>>
>>> Please make sure that these numbers are in the changelog. I am also
>>> missing an explanation why this is an overal win. Why should we pay
>>> on the later access rather than the migration which is arguably a slower
>>> path. What is the usecase that benefits from the cost shift?
>>
>> Originally the investigation started because of a regression we had
>> sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
>> separately, so the original value of this patchset has been diminished.
>>
>> Trying to frame the problem, let's analyse the overall cost of migration
>> + execute. Removing other invariants like cost of the initial mapping of
>> the pages or the mapping of new pages after migration, we have:
>>
>> M - number of mapped executable pages just before migration
>> N - number of previously mapped pages that will be executed after
>>     migration (N <= M)
>> D - cost of migrating page data
>> I - cost of I-cache maintenance for a page
>> F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
>>     without the actual I-cache maintenance)
>>
>> Tc - total migration cost current kernel (including executing)
>> Tp - total migration cost patched kernel (including executing)
>>
>>   Tc = M * (D + I)
>>   Tp = M * D + N * (F + I)
>>
>> To be useful, we want this patchset to lead to:
>>
>>   Tp < Tc
>>
>> Simplifying:
>>
>>   M * D + N * (F + I) < M * (D + I)
>>   ...
>>   F < I * (M - N) / N
>>
>> So the question is, in a *real-world* scenario, what proportion of the
>> mapped executable pages would still be executed from after migration.
>> I'd leave this as a task for Anshuman to investigate and come up with
>> some numbers (and it's fine if it's just in the noise, we won't need
>> this patchset).
> 
> Yeah, betting on accessing only a smaller subset of the migrated memory
> is something I figured out. But I am really missing a usecase or a
> larger set of them to actually benefit from it. We have different
> triggers for a migration. E.g. numa balancing. I would expect that
> migrated pages are likely to be accessed after migration because
> the primary reason to migrate them is that they are accessed from a
> remote node. Then we a compaction which is a completely different story.

That access might not have been an exec fault it could have been bunch of
write faults which triggered NUMA migration. So NUMA triggered migration
does not necessarily mean continuing exec faults before and after migration.

Compaction might move around mapped pages with exec permission which might
not have any recent history of exec accesses before compaction or might not
even see any future exec access as well.

> It is hard to assume any further access for migrated pages here. Then we
> have an explicit move_pages syscall and I would expect this to be
> somewhere in the middle. One would expect that the caller knows why the
> memory is migrated and it will be used but again, we cannot really
> assume anything.

What if the caller knows that it wont be used ever again or in near future
and hence trying to migrate to a different node which has less expensive and
slower memory. Kernel should not assume either way on it but can decide to
be conservative in spending time in preparing for future exec faults.

But being conservative during migration risks additional exec faults which
would have been avoided if exec permission should have stayed on followed
by an I-cache invalidation. Deferral of the I-cache invalidation requires
removing the exec permission completely (unless there is some magic which
I am not aware about) i.e unmapping page for exec permission and risking
an exec fault next time around.

This problem gets particularly amplified for mixed permission (WRITE | EXEC)
user space mappings where things like NUMA migration, compaction etc probably
gets triggered by write faults and additional exec permission there never
really gets used.

> 
> This would suggest that this depends on the migration reason quite a
> lot. So I would really like to see a more comprehensive analysis of
> different workloads to see whether this is really worth it.

Sure. Could you please give some more details on how to go about this and
what specifically you are looking for ? User initiated migration through
systems calls seems bit tricky as an application can be written primarily
to benefit from this series. If real world applications can help give
some better insights then which ones I wonder. Or do we need to understand
more about compaction and NUMA triggered migration which are kernel
driven. Statistics from compaction/NUMA migration can reveal what ratio
of the exec enabled mapping gets exec faulted again later on after kernel
driven migrations (compaction/NUMA) which are more or less random without
depending too much on application behavior.

- Anshuman
Michal Hocko Feb. 15, 2019, 9:27 a.m. UTC | #12
On Fri 15-02-19 14:15:58, Anshuman Khandual wrote:
> On 02/14/2019 05:58 PM, Michal Hocko wrote:
> > It is hard to assume any further access for migrated pages here. Then we
> > have an explicit move_pages syscall and I would expect this to be
> > somewhere in the middle. One would expect that the caller knows why the
> > memory is migrated and it will be used but again, we cannot really
> > assume anything.
> 
> What if the caller knows that it wont be used ever again or in near future
> and hence trying to migrate to a different node which has less expensive and
> slower memory. Kernel should not assume either way on it but can decide to
> be conservative in spending time in preparing for future exec faults.
> 
> But being conservative during migration risks additional exec faults which
> would have been avoided if exec permission should have stayed on followed
> by an I-cache invalidation. Deferral of the I-cache invalidation requires
> removing the exec permission completely (unless there is some magic which
> I am not aware about) i.e unmapping page for exec permission and risking
> an exec fault next time around.
> 
> This problem gets particularly amplified for mixed permission (WRITE | EXEC)
> user space mappings where things like NUMA migration, compaction etc probably
> gets triggered by write faults and additional exec permission there never
> really gets used.

Please quantify that and provide us with some _data_

> > This would suggest that this depends on the migration reason quite a
> > lot. So I would really like to see a more comprehensive analysis of
> > different workloads to see whether this is really worth it.
> 
> Sure. Could you please give some more details on how to go about this and
> what specifically you are looking for ?

You are proposing an optimization without actually providing any
justification. The overhead is not removed it is just shifted from one
path to another. So you should have some pretty convincing arguments
to back that shift as a general win. You can go an test on wider range
of workloads and isolate the worst/best case behavior. I fully realize
that this is tedious. Another option would be to define conditions when
the optimization is going to be a huge win and have some convincing
arguments that many/most workloads are falling into that category while
pathological ones are not suffering much.

This is no different from any other optimizations/heuristics we have.

Btw. have you considered to have this optimization conditional based on
the migration reason or vma flags?
Anshuman Khandual Feb. 18, 2019, 3:07 a.m. UTC | #13
On 02/15/2019 02:57 PM, Michal Hocko wrote:
> On Fri 15-02-19 14:15:58, Anshuman Khandual wrote:
>> On 02/14/2019 05:58 PM, Michal Hocko wrote:
>>> It is hard to assume any further access for migrated pages here. Then we
>>> have an explicit move_pages syscall and I would expect this to be
>>> somewhere in the middle. One would expect that the caller knows why the
>>> memory is migrated and it will be used but again, we cannot really
>>> assume anything.
>>
>> What if the caller knows that it wont be used ever again or in near future
>> and hence trying to migrate to a different node which has less expensive and
>> slower memory. Kernel should not assume either way on it but can decide to
>> be conservative in spending time in preparing for future exec faults.
>>
>> But being conservative during migration risks additional exec faults which
>> would have been avoided if exec permission should have stayed on followed
>> by an I-cache invalidation. Deferral of the I-cache invalidation requires
>> removing the exec permission completely (unless there is some magic which
>> I am not aware about) i.e unmapping page for exec permission and risking
>> an exec fault next time around.
>>
>> This problem gets particularly amplified for mixed permission (WRITE | EXEC)
>> user space mappings where things like NUMA migration, compaction etc probably
>> gets triggered by write faults and additional exec permission there never
>> really gets used.
> 
> Please quantify that and provide us with some _data_> 
>>> This would suggest that this depends on the migration reason quite a
>>> lot. So I would really like to see a more comprehensive analysis of
>>> different workloads to see whether this is really worth it.
>>
>> Sure. Could you please give some more details on how to go about this and
>> what specifically you are looking for ?
> 
> You are proposing an optimization without actually providing any
> justification. The overhead is not removed it is just shifted from one
> path to another. So you should have some pretty convincing arguments
> to back that shift as a general win. You can go an test on wider range
> of workloads and isolate the worst/best case behavior. I fully realize
> that this is tedious. Another option would be to define conditions when
> the optimization is going to be a huge win and have some convincing

Yeah conditional approach might narrow down the field and provide better
probability for a general win. The system call (move_pages/mbind) based
migrations from the user space are better placed for an win because the
user might just want to put those pages aside for rare exec accesses in
the future and the worst case cost for those deferral is not too high as
well. A hint regarding probable rare exec access in the future for the
kernel would have been better but I am afraid it would then require a new
user interface. But I think lazy exec decision can be taken right away
for MR_SYSCALL triggered migrations for VMAs with mixed permission
([VM_READ]|VM_WRITE|VM_EXEC) knowing the fact that in worst case the
cost is just getting migrated.

MR_NUMA_MISPLACED triggered migrations requires explicit tracking of fault
type (exec/write/[read]) per VMA along with it's applicable permission to
determine if exec permission deferral would be helpful or not. These stats
can also be used for all other kernel or user initiated migrations like
MR_COMPACTION, MR_MEMORY_FAILURE, MR_MEMORY_HOTPLUG and MR_CONTIG_RANGE.

Would it be worth adding explicit fault type tracking per VMA ? Can it be
used for some other purpose as well.

> arguments that many/most workloads are falling into that category while
> pathological ones are not suffering much.
> 
> This is no different from any other optimizations/heuristics we have.

Sure. Will think about this further.

> 
> Btw. have you considered to have this optimization conditional based on
> the migration reason or vma flags?

Started considering it after our discussions here. It makes sense to look
into the migration reason and the VMA flags right away but as I mentioned
earlier VMA fault type stats can really help on this as well.
Anshuman Khandual Feb. 18, 2019, 3:19 a.m. UTC | #14
On 02/14/2019 09:08 PM, Dave Hansen wrote:
> On 2/13/19 10:04 PM, Anshuman Khandual wrote:
>>> Are there any numbers to show the optimization impact?
>> This series transfers execution cost linearly with nr_pages from migration path
>> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
>> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
>> HugeTLB and THP migration enablement on arm64 platform.
>>
>> A. [Normal Pages]
>>
>> nr_pages	migration1 	migration2	execfault1	execfault2	
>>
>> 1000 		7.000000	3.000000	24.000000	31.000000
>> 5000 		38.000000 	18.000000	127.000000	153.000000
>> 10000 		80.000000 	40.000000	289.000000	343.000000
>> 15000		120.000000	60.000000	435.000000	514.000000
>> 19900 		159.000000	79.000000	576.000000	681.000000
> 
> Do these numbers comprehend the increased fault costs or just the
> decreased migration costs?

Both. It transfers cost from migration path to exec fault path.
Anshuman Khandual Feb. 18, 2019, 8:31 a.m. UTC | #15
On 02/14/2019 10:25 PM, Dave Hansen wrote:
> On 2/13/19 8:12 PM, Anshuman Khandual wrote:
>> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>>> page if there is no immediate exec access on it. Non-fault modification of
>>>> user page table from generic memory paths like migration can be improved if
>>>> setting of the exec permission on the page can be deferred till actual use.
>>>> There was a performance report [1] which highlighted the problem.
>>>
>>> How does this happen?  If the page was not executed, then it'll
>>> (presumably) be non-present which won't require icache invalidation.
>>> So, this would only be for pages that have been executed (and won't
>>> again before the next migration), *or* for pages that were mapped
>>> executable but never executed.
>> I-cache invalidation happens while migrating a 'mapped and executable' page
>> irrespective whether that page was really executed for being mapped there
>> in the first place.
> 
> Ahh, got it.  I also assume that the Accessed bit on these platforms is
> also managed similar to how we do it on x86 such that it can't be used
> to drive invalidation decisions?

Drive I-cache invalidation ? Could you please elaborate on this. Is not that
the access bit mechanism is to identify dirty pages after write faults when
it is SW updated or write accesses when HW updated. In SW updated method, given
PTE goes through pte_young() during page fault. Then how to differentiate exec
fault/access from an write fault/access and decide to invalidate the I-cache.
Just being curious.

> 
>>> Any idea which one it is?
>>
>> I am not sure about this particular reported case. But was able to reproduce
>> the problem through a test case where a buffer was mapped with R|W|X, get it
>> faulted/mapped through write, migrate and then execute from it.
> 
> Could you make sure, please?

The test in the report [1] does not create any explicit PROT_EXEC maps and just
attempts to migrate all pages of the process (which has 10 child processes)
including the exec pages. So the only exec mappings would be the primary text
segment and the mapped shared glibc segment. Looks like the shared libraries
have some mapped pages.

$cat /proc/[PID]/numa_maps  | grep libc

ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4
ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so
ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4
ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4

Will keep looking into this.

> 
> Write and Execute at the same time are generally a "bad idea".  Given

But wont this be the case for all run-time generate code which gets written to a
buffer before being executed from there.

> the hardware, I'm not surprised that this problem pops up, but it would
> be great to find out if this is a real application, or a "doctor it
> hurts when I do this."

Is not that a problem though :)

> 
>>> If it's pages that got mapped in but were never executed, how did that
>>> happen?  Was it fault-around?  If so, maybe it would just be simpler to
>>> not do fault-around for executable pages on these platforms.
>> Page can get mapped through a different access (write) without being executed.
>> Even if it got mapped through execution and subsequent invalidation, the
>> invalidation does not have to be repeated again after migration without first
>> getting an exec access subsequently. This series just tries to hold off the
>> invalidation after migration till subsequent exec access.
> 
> This set generally seems to be assuming an environment with "lots of
> migration, and not much execution".  That seems like a kinda odd
> situation to me.

Irrespective of the reported problem which is user driven, there are many kernel
triggered migrations which can accumulate I-cache invalidation cost over time on
a memory heavy system with high number of exec enabled user pages. Will that be
such a rare situation !

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
Catalin Marinas Feb. 18, 2019, 9:04 a.m. UTC | #16
On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote:
> On 02/14/2019 10:25 PM, Dave Hansen wrote:
> > On 2/13/19 8:12 PM, Anshuman Khandual wrote:
> >> On 02/13/2019 09:14 PM, Dave Hansen wrote:
> >>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
> >>>> Setting an exec permission on a page normally triggers I-cache invalidation
> >>>> which might be expensive. I-cache invalidation is not mandatory on a given
> >>>> page if there is no immediate exec access on it. Non-fault modification of
> >>>> user page table from generic memory paths like migration can be improved if
> >>>> setting of the exec permission on the page can be deferred till actual use.
> >>>> There was a performance report [1] which highlighted the problem.
> >>>
> >>> How does this happen?  If the page was not executed, then it'll
> >>> (presumably) be non-present which won't require icache invalidation.
> >>> So, this would only be for pages that have been executed (and won't
> >>> again before the next migration), *or* for pages that were mapped
> >>> executable but never executed.
> >> I-cache invalidation happens while migrating a 'mapped and executable' page
> >> irrespective whether that page was really executed for being mapped there
> >> in the first place.
> > 
> > Ahh, got it.  I also assume that the Accessed bit on these platforms is
> > also managed similar to how we do it on x86 such that it can't be used
> > to drive invalidation decisions?
> 
> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
> the access bit mechanism is to identify dirty pages after write faults when
> it is SW updated or write accesses when HW updated. In SW updated method, given
> PTE goes through pte_young() during page fault. Then how to differentiate exec
> fault/access from an write fault/access and decide to invalidate the I-cache.
> Just being curious.

The access flag is used to identify young/old pages only (the dirty bit
is used to track writes to a page). Depending on the Arm implementation,
the access bit/flag could be managed by hardware transparently, so no
fault taken to the kernel on accessing through an 'old' pte.
Anshuman Khandual Feb. 18, 2019, 9:16 a.m. UTC | #17
On 02/18/2019 02:34 PM, Catalin Marinas wrote:
> On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote:
>> On 02/14/2019 10:25 PM, Dave Hansen wrote:
>>> On 2/13/19 8:12 PM, Anshuman Khandual wrote:
>>>> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>>>>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>>>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>>>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>>>>> page if there is no immediate exec access on it. Non-fault modification of
>>>>>> user page table from generic memory paths like migration can be improved if
>>>>>> setting of the exec permission on the page can be deferred till actual use.
>>>>>> There was a performance report [1] which highlighted the problem.
>>>>>
>>>>> How does this happen?  If the page was not executed, then it'll
>>>>> (presumably) be non-present which won't require icache invalidation.
>>>>> So, this would only be for pages that have been executed (and won't
>>>>> again before the next migration), *or* for pages that were mapped
>>>>> executable but never executed.
>>>> I-cache invalidation happens while migrating a 'mapped and executable' page
>>>> irrespective whether that page was really executed for being mapped there
>>>> in the first place.
>>>
>>> Ahh, got it.  I also assume that the Accessed bit on these platforms is
>>> also managed similar to how we do it on x86 such that it can't be used
>>> to drive invalidation decisions?
>>
>> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
>> the access bit mechanism is to identify dirty pages after write faults when
>> it is SW updated or write accesses when HW updated. In SW updated method, given
>> PTE goes through pte_young() during page fault. Then how to differentiate exec
>> fault/access from an write fault/access and decide to invalidate the I-cache.
>> Just being curious.
> 
> The access flag is used to identify young/old pages only (the dirty bit
> is used to track writes to a page). Depending on the Arm implementation,
> the access bit/flag could be managed by hardware transparently, so no
> fault taken to the kernel on accessing through an 'old' pte.

Then there is no way to identify an exec fault with either of the facilities of
access/reference bit or dirty bit whether managed by SW or HW. Still wondering about
previous comment where Dave mentioned how it can be used for I-cache invalidation.
Dave Hansen Feb. 18, 2019, 6:20 p.m. UTC | #18
On 2/18/19 12:31 AM, Anshuman Khandual wrote:
>> Ahh, got it.  I also assume that the Accessed bit on these platforms is
>> also managed similar to how we do it on x86 such that it can't be used
>> to drive invalidation decisions?
> 
> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
> the access bit mechanism is to identify dirty pages after write faults when
> it is SW updated or write accesses when HW updated. In SW updated method, given
> PTE goes through pte_young() during page fault. Then how to differentiate exec
> fault/access from an write fault/access and decide to invalidate the I-cache.
> Just being curious.

Let's say this was on x86 where the Accessed bit is set by the hardware
on any access.  Let's also say that Linux invalidated the TLB any time
that bit was cleared in software (it doesn't, but let's pretend it did).

In that case, if we needed to do icache invalidation, we could optimize
it by only invalidating the icache when we see the Accessed bit set.
That's because any execution would first set the Accessed bit before the
icache was populated.

So, my question

>>>> Any idea which one it is?
>>>
>>> I am not sure about this particular reported case. But was able to reproduce
>>> the problem through a test case where a buffer was mapped with R|W|X, get it
>>> faulted/mapped through write, migrate and then execute from it.
>>
>> Could you make sure, please?
> 
> The test in the report [1] does not create any explicit PROT_EXEC maps and just
> attempts to migrate all pages of the process (which has 10 child processes)
> including the exec pages. So the only exec mappings would be the primary text
> segment and the mapped shared glibc segment. Looks like the shared libraries
> have some mapped pages.

Yeah, but the executable ones are also read-only in your example:

> $cat /proc/[PID]/numa_maps  | grep libc
> 
> ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4

^ These are all page-cache, executable and read-only.

> ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so
> ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4
> ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4

This last one is the only read-write one and it's not executable.


>> Write and Execute at the same time are generally a "bad idea".  Given
> 
> But wont this be the case for all run-time generate code which gets written to a
> buffer before being executed from there.

No.  They usually are r=1,w=1,x=0, then transition to r=1,w=0,x=1.  It's
never simultaneously executable and writable.

>> the hardware, I'm not surprised that this problem pops up, but it would
>> be great to find out if this is a real application, or a "doctor it
>> hurts when I do this."
> 
> Is not that a problem though :)

The point is that it's not a real-world problem.  You can certainly
expose this, but do *real* apps do this rather than something entirely
synthetic?

>> This set generally seems to be assuming an environment with "lots of
>> migration, and not much execution".  That seems like a kinda odd
>> situation to me.
> 
> Irrespective of the reported problem which is user driven, there are many kernel
> triggered migrations which can accumulate I-cache invalidation cost over time on
> a memory heavy system with high number of exec enabled user pages. Will that be
> such a rare situation !
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

I translate "trivial C application" to "highly synthetic microbenchmark".

I suspect what's happening here is that somebody wrote a micro that
worked well on x86, although it was being rather stupid.  Somebody got
an arm system, and voila: it's slower.  Someone says "Oh, this arm
system is slower than x86!"

Again, the big questions you have real-world applications with writable,
executable pages?  The kernel essentially has *zero* of these because
they're such a massive security risk.  Adding this feature will
encourage folks to replicate this massive security risk in userspace.

Seems like a bad idea.