[v5,04/10] arm64: Add __flush_tlb_one()

Message ID	20170809200755.11234-5-tycho@docker.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kernel-hardening-return-9293-patchwork-kernel-hardening=patchwork.kernel.org@lists.openwall.com> Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk From: Tycho Andersen <tycho@docker.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, kernel-hardening@lists.openwall.com, Marco Benatto <marco.antonio.780@gmail.com>, Juerg Haefliger <juerg.haefliger@canonical.com>, Juerg Haefliger <juerg.haefliger@hpe.com> Date: Wed, 9 Aug 2017 14:07:49 -0600 Message-Id: <20170809200755.11234-5-tycho@docker.com> In-Reply-To: <20170809200755.11234-1-tycho@docker.com> References: <20170809200755.11234-1-tycho@docker.com> Subject: [kernel-hardening] [PATCH v5 04/10] arm64: Add __flush_tlb_one()

Tycho Andersen Aug. 9, 2017, 8:07 p.m. UTC

From: Juerg Haefliger <juerg.haefliger@hpe.com>

Add a hook for flushing a single TLB entry on arm64.

Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Tested-by: Tycho Andersen <tycho@docker.com>
---
 arch/arm64/include/asm/tlbflush.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Mark Rutland Aug. 12, 2017, 11:26 a.m. UTC | #1

On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> From: Juerg Haefliger <juerg.haefliger@hpe.com>
> 
> Add a hook for flushing a single TLB entry on arm64.
> 
> Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
> Tested-by: Tycho Andersen <tycho@docker.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index af1c76981911..8e0c49105d3e 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -184,6 +184,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
>  	isb();
>  }
>  
> +static inline void __flush_tlb_one(unsigned long addr)
> +{
> +	dsb(ishst);
> +	__tlbi(vaae1is, addr >> 12);
> +	dsb(ish);
> +	isb();
> +}

Is this going to be called by generic code?

It would be nice if we could drop 'kernel' into the name, to make it clear this
is intended to affect the kernel mappings, which have different maintenance
requirements to user mappings.

We should be able to implement this more simply as:

flush_tlb_kernel_page(unsigned long addr)
{
	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}

Thanks,
Mark.

Tycho Andersen Aug. 14, 2017, 4:35 p.m. UTC | #2

Hi Mark,

On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > From: Juerg Haefliger <juerg.haefliger@hpe.com>
> > 
> > Add a hook for flushing a single TLB entry on arm64.
> > 
> > Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
> > Tested-by: Tycho Andersen <tycho@docker.com>
> > ---
> >  arch/arm64/include/asm/tlbflush.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index af1c76981911..8e0c49105d3e 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -184,6 +184,14 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
> >  	isb();
> >  }
> >  
> > +static inline void __flush_tlb_one(unsigned long addr)
> > +{
> > +	dsb(ishst);
> > +	__tlbi(vaae1is, addr >> 12);
> > +	dsb(ish);
> > +	isb();
> > +}
> 
> Is this going to be called by generic code?

Yes, it's called in mm/xpfo.c:xpfo_kunmap.

> It would be nice if we could drop 'kernel' into the name, to make it clear this
> is intended to affect the kernel mappings, which have different maintenance
> requirements to user mappings.
> 
> We should be able to implement this more simply as:
> 
> flush_tlb_kernel_page(unsigned long addr)
> {
> 	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> }

It's named __flush_tlb_one after the x86 (and a few other arches)
function of the same name. I can change it to flush_tlb_kernel_page,
but then we'll need some x86-specific code to map the name as well.

Maybe since it's called from generic code that's warranted though?
I'll change the implementation for now, let me know what you want to
do about the name.

Cheers,

Tycho

Mark Rutland Aug. 14, 2017, 4:50 p.m. UTC | #3

On Mon, Aug 14, 2017 at 10:35:36AM -0600, Tycho Andersen wrote:
> Hi Mark,
> 
> On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> > On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > > +static inline void __flush_tlb_one(unsigned long addr)
> > > +{
> > > +	dsb(ishst);
> > > +	__tlbi(vaae1is, addr >> 12);
> > > +	dsb(ish);
> > > +	isb();
> > > +}
> > 
> > Is this going to be called by generic code?
> 
> Yes, it's called in mm/xpfo.c:xpfo_kunmap.
> 
> > It would be nice if we could drop 'kernel' into the name, to make it clear this
> > is intended to affect the kernel mappings, which have different maintenance
> > requirements to user mappings.

> It's named __flush_tlb_one after the x86 (and a few other arches)
> function of the same name. I can change it to flush_tlb_kernel_page,
> but then we'll need some x86-specific code to map the name as well.
> 
> Maybe since it's called from generic code that's warranted though?
> I'll change the implementation for now, let me know what you want to
> do about the name.

I think it would be preferable to do so, to align with 
flush_tlb_kernel_range(), which is an existing generic interface.

That said, is there any reason not to use flush_tlb_kernel_range()
directly?

Thanks,
Mark.

Tycho Andersen Aug. 14, 2017, 5:01 p.m. UTC | #4

On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> On Mon, Aug 14, 2017 at 10:35:36AM -0600, Tycho Andersen wrote:
> > Hi Mark,
> > 
> > On Sat, Aug 12, 2017 at 12:26:03PM +0100, Mark Rutland wrote:
> > > On Wed, Aug 09, 2017 at 02:07:49PM -0600, Tycho Andersen wrote:
> > > > +static inline void __flush_tlb_one(unsigned long addr)
> > > > +{
> > > > +	dsb(ishst);
> > > > +	__tlbi(vaae1is, addr >> 12);
> > > > +	dsb(ish);
> > > > +	isb();
> > > > +}
> > > 
> > > Is this going to be called by generic code?
> > 
> > Yes, it's called in mm/xpfo.c:xpfo_kunmap.
> > 
> > > It would be nice if we could drop 'kernel' into the name, to make it clear this
> > > is intended to affect the kernel mappings, which have different maintenance
> > > requirements to user mappings.
> 
> > It's named __flush_tlb_one after the x86 (and a few other arches)
> > function of the same name. I can change it to flush_tlb_kernel_page,
> > but then we'll need some x86-specific code to map the name as well.
> > 
> > Maybe since it's called from generic code that's warranted though?
> > I'll change the implementation for now, let me know what you want to
> > do about the name.
> 
> I think it would be preferable to do so, to align with 
> flush_tlb_kernel_range(), which is an existing generic interface.
> 
> That said, is there any reason not to use flush_tlb_kernel_range()
> directly?

I don't think so, I'll change the generic code to that and drop this
patch.

Thanks!

Tycho

Tycho Andersen Aug. 23, 2017, 4:58 p.m. UTC | #5

Hi Mark,

On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> That said, is there any reason not to use flush_tlb_kernel_range()
> directly?

So it turns out that there is a difference between __flush_tlb_one() and
flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
I think is enough here).

As you might expect, this is quite a performance hit (at least under kvm), I
ran a little kernbench:

# __flush_tlb_one
Wed Aug 23 15:47:33 UTC 2017
4.13.0-rc5+
Average Half load -j 2 Run (std deviation):
Elapsed Time 50.3233 (1.82716)
User Time 87.1233 (1.26871)
System Time 15.36 (0.500899)
Percent CPU 203.667 (4.04145)
Context Switches 7350.33 (1339.65)
Sleeps 16008.3 (980.362)

Average Optimal load -j 4 Run (std deviation):
Elapsed Time 27.4267 (0.215019)
User Time 88.6983 (1.91501)
System Time 13.1933 (2.39488)
Percent CPU 286.333 (90.6083)
Context Switches 11393 (4509.14)
Sleeps 15764.7 (698.048)

# flush_tlb_kernel_range()
Wed Aug 23 16:00:03 UTC 2017
4.13.0-rc5+
Average Half load -j 2 Run (std deviation):
Elapsed Time 86.57 (1.06099)
User Time 103.25 (1.85475)
System Time 75.4433 (0.415852)
Percent CPU 205.667 (3.21455)
Context Switches 9363.33 (1361.57)
Sleeps 14703.3 (1439.12)

Average Optimal load -j 4 Run (std deviation):
Elapsed Time 51.27 (0.615873)
User Time 110.328 (7.93884)
System Time 74.06 (1.55788)
Percent CPU 288 (90.2197)
Context Switches 16557.5 (7930.01)
Sleeps 14774.7 (921.746)

So, I think we need to keep something like __flush_tlb_one around.
I'll call it flush_one_local_tlb() for now, and will cc x86@ on the
next version to see if they have any insight.

Cheers,

Tycho

Mark Rutland Aug. 23, 2017, 5:04 p.m. UTC | #6

On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> Hi Mark,
> 
> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> > That said, is there any reason not to use flush_tlb_kernel_range()
> > directly?
> 
> So it turns out that there is a difference between __flush_tlb_one() and
> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> I think is enough here).

That sounds suspicious; I don't think that __flush_tlb_one() is
sufficient.

If you only do local TLB maintenance, then the page is left accessible
to other CPUs via the (stale) kernel mappings. i.e. the page isn't
exclusively mapped by userspace.

Thanks,
Mark.

Juerg Haefliger Aug. 30, 2017, 5:31 a.m. UTC | #7

On 08/23/2017 07:04 PM, Mark Rutland wrote:
> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
>> Hi Mark,
>>
>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
>>> That said, is there any reason not to use flush_tlb_kernel_range()
>>> directly?
>>
>> So it turns out that there is a difference between __flush_tlb_one() and
>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
>> I think is enough here).
> 
> That sounds suspicious; I don't think that __flush_tlb_one() is
> sufficient.
> 
> If you only do local TLB maintenance, then the page is left accessible
> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> exclusively mapped by userspace.

We flush all CPUs to get rid of stale entries when a new page is
allocated to userspace that was previously allocated to the kernel.
Is that the scenario you were thinking of?

...Juerg


> Thanks,
> Mark.
>

Tycho Andersen Aug. 30, 2017, 4:47 p.m. UTC | #8

On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
> 
> 
> On 08/23/2017 07:04 PM, Mark Rutland wrote:
> > On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> >> Hi Mark,
> >>
> >> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> >>> That said, is there any reason not to use flush_tlb_kernel_range()
> >>> directly?
> >>
> >> So it turns out that there is a difference between __flush_tlb_one() and
> >> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> >> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> >> I think is enough here).
> > 
> > That sounds suspicious; I don't think that __flush_tlb_one() is
> > sufficient.
> > 
> > If you only do local TLB maintenance, then the page is left accessible
> > to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> > exclusively mapped by userspace.
> 
> We flush all CPUs to get rid of stale entries when a new page is
> allocated to userspace that was previously allocated to the kernel.
> Is that the scenario you were thinking of?

I think there are two cases, the one you describe above, where the
pages are first allocated, and a second one, where e.g. the pages are
mapped into the kernel because of DMA or whatever. In the case you
describe above, I think we're doing the right thing (which is why my
test worked correctly, because it tested this case).

In the second case, when the pages are unmapped (i.e. the kernel is
done doing DMA), do we need to flush the other CPUs TLBs? I think the
current code is not quite correct, because if multiple tasks (CPUs)
map the pages, only the TLB of the last one is flushed when the
mapping is cleared, because the tlb is only flushed when ->mapcount
drops to zero, leaving stale entries in the other TLBs. It's not clear
to me what to do about this case.

Thoughts?

Tycho

> ...Juerg
> 
> 
> > Thanks,
> > Mark.
> > 
>

Juerg Haefliger Aug. 31, 2017, 9:43 a.m. UTC | #9

On 08/30/2017 06:47 PM, Tycho Andersen wrote:
> On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
>>
>>
>> On 08/23/2017 07:04 PM, Mark Rutland wrote:
>>> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
>>>> Hi Mark,
>>>>
>>>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
>>>>> That said, is there any reason not to use flush_tlb_kernel_range()
>>>>> directly?
>>>>
>>>> So it turns out that there is a difference between __flush_tlb_one() and
>>>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
>>>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
>>>> I think is enough here).
>>>
>>> That sounds suspicious; I don't think that __flush_tlb_one() is
>>> sufficient.
>>>
>>> If you only do local TLB maintenance, then the page is left accessible
>>> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
>>> exclusively mapped by userspace.
>>
>> We flush all CPUs to get rid of stale entries when a new page is
>> allocated to userspace that was previously allocated to the kernel.
>> Is that the scenario you were thinking of?
> 
> I think there are two cases, the one you describe above, where the
> pages are first allocated, and a second one, where e.g. the pages are
> mapped into the kernel because of DMA or whatever. In the case you
> describe above, I think we're doing the right thing (which is why my
> test worked correctly, because it tested this case).
> 
> In the second case, when the pages are unmapped (i.e. the kernel is
> done doing DMA), do we need to flush the other CPUs TLBs? I think the
> current code is not quite correct, because if multiple tasks (CPUs)
> map the pages, only the TLB of the last one is flushed when the
> mapping is cleared, because the tlb is only flushed when ->mapcount
> drops to zero, leaving stale entries in the other TLBs. It's not clear
> to me what to do about this case.

For this to happen, multiple CPUs need to have the same userspace page
mapped at the same time. Is this a valid scenario?

...Juerg


> Thoughts?
> 
> Tycho
> 
>> ...Juerg
>>
>>
>>> Thanks,
>>> Mark.
>>>
>>
> 
> 
>

Mark Rutland Aug. 31, 2017, 9:47 a.m. UTC | #10

On Thu, Aug 31, 2017 at 11:43:53AM +0200, Juerg Haefliger wrote:
> On 08/30/2017 06:47 PM, Tycho Andersen wrote:
> > On Wed, Aug 30, 2017 at 07:31:25AM +0200, Juerg Haefliger wrote:
> >>
> >>
> >> On 08/23/2017 07:04 PM, Mark Rutland wrote:
> >>> On Wed, Aug 23, 2017 at 10:58:42AM -0600, Tycho Andersen wrote:
> >>>> Hi Mark,
> >>>>
> >>>> On Mon, Aug 14, 2017 at 05:50:47PM +0100, Mark Rutland wrote:
> >>>>> That said, is there any reason not to use flush_tlb_kernel_range()
> >>>>> directly?
> >>>>
> >>>> So it turns out that there is a difference between __flush_tlb_one() and
> >>>> flush_tlb_kernel_range() on x86: flush_tlb_kernel_range() flushes all the TLBs
> >>>> via on_each_cpu(), where as __flush_tlb_one() only flushes the local TLB (which
> >>>> I think is enough here).
> >>>
> >>> That sounds suspicious; I don't think that __flush_tlb_one() is
> >>> sufficient.
> >>>
> >>> If you only do local TLB maintenance, then the page is left accessible
> >>> to other CPUs via the (stale) kernel mappings. i.e. the page isn't
> >>> exclusively mapped by userspace.
> >>
> >> We flush all CPUs to get rid of stale entries when a new page is
> >> allocated to userspace that was previously allocated to the kernel.
> >> Is that the scenario you were thinking of?
> > 
> > I think there are two cases, the one you describe above, where the
> > pages are first allocated, and a second one, where e.g. the pages are
> > mapped into the kernel because of DMA or whatever. In the case you
> > describe above, I think we're doing the right thing (which is why my
> > test worked correctly, because it tested this case).
> > 
> > In the second case, when the pages are unmapped (i.e. the kernel is
> > done doing DMA), do we need to flush the other CPUs TLBs? I think the
> > current code is not quite correct, because if multiple tasks (CPUs)
> > map the pages, only the TLB of the last one is flushed when the
> > mapping is cleared, because the tlb is only flushed when ->mapcount
> > drops to zero, leaving stale entries in the other TLBs. It's not clear
> > to me what to do about this case.
> 
> For this to happen, multiple CPUs need to have the same userspace page
> mapped at the same time. Is this a valid scenario?

I believe so. I think you could trigger that with a multi-threaded
application running across several CPUs. All those threads would share
the same page tables.

Thanks,
Mark.

[v5,04/10] arm64: Add __flush_tlb_one()

Commit Message

Comments

Patch