diff mbox

[V2] arm64: optimized copy_to_user and copy_from_user assembly code

Message ID 1407538940-9167-1-git-send-email-fkan@apm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Feng Kan Aug. 8, 2014, 11:02 p.m. UTC
Using the glibc cortex string work work authored by Linaro as base to
create new copy to/from user kernel routine.

Iperf performance increase:
		-l (size)		1 core result
Optimized 	64B			44-51Mb/s
		1500B			4.9Gb/s
		30000B			16.2Gb/s
Original	64B			34-50.7Mb/s
		1500B			4.7Gb/s
		30000B			14.5Gb/s

Signed-off-by: Feng Kan <fkan@apm.com>
---
 arch/arm64/lib/copy_from_user.S |  36 +-----
 arch/arm64/lib/copy_template.S  | 278 ++++++++++++++++++++++++++++++++++++++++
 arch/arm64/lib/copy_to_user.S   |  31 +----
 3 files changed, 284 insertions(+), 61 deletions(-)
 create mode 100644 arch/arm64/lib/copy_template.S

Comments

Radha Mohan Aug. 11, 2014, 3:01 a.m. UTC | #1
Hi Feng,


> +
> +.Lcpy_not_short:
> +       /*
> +        * We don't much care about the alignment of DST, but we want SRC
> +        * to be 128-bit (16 byte) aligned so that we don't cross cache line
> +        * boundaries on both loads and stores.
> +        */

Could you please tell why is destination alignment not an issue? Is
this a generic implementation that you are referring to or specific to
your platform?

> --
> 1.9.1
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Feng Kan Aug. 11, 2014, 6:05 p.m. UTC | #2
On Sun, Aug 10, 2014 at 8:01 PM, Radha Mohan <mohun106@gmail.com> wrote:
> Hi Feng,
>
>
>> +
>> +.Lcpy_not_short:
>> +       /*
>> +        * We don't much care about the alignment of DST, but we want SRC
>> +        * to be 128-bit (16 byte) aligned so that we don't cross cache line
>> +        * boundaries on both loads and stores.
>> +        */
>
> Could you please tell why is destination alignment not an issue? Is
> this a generic implementation that you are referring to or specific to
> your platform?
This is per Linaro Cortext String optimization routines.

https://launchpad.net/cortex-strings

Zhichang submitted something similar for the memcpy from the
same optimization.

Sorry resend in text mode.
>
>> --
>> 1.9.1
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
zhichang.yuan@linaro.org Aug. 13, 2014, 3:13 a.m. UTC | #3
Hi Feng,

On 2014?08?12? 02:05, Feng Kan wrote:
> On Sun, Aug 10, 2014 at 8:01 PM, Radha Mohan <mohun106@gmail.com> wrote:
>> Hi Feng,
>>
>>
>>> +
>>> +.Lcpy_not_short:
>>> +       /*
>>> +        * We don't much care about the alignment of DST, but we want SRC
>>> +        * to be 128-bit (16 byte) aligned so that we don't cross cache line
>>> +        * boundaries on both loads and stores.
>>> +        */
>> Could you please tell why is destination alignment not an issue? Is
>> this a generic implementation that you are referring to or specific to
>> your platform?
> This is per Linaro Cortext String optimization routines.
>
> https://launchpad.net/cortex-strings
>
> Zhichang submitted something similar for the memcpy from the
> same optimization.
>
> Sorry resend in text mode.

If the both dst and src are not aligned and their alignment offset are not equal, i haven't found better way
to handle.
But it is lucky ARMv8 support the non-align memory access.
At the beginning of my patch work, i also think maybe it is more better that all load or store are aligned. I
wrote the code just like the ARMv7 memcpy, firstly loaded the data from SRC and buffered them in several
registers and combined as a new word( 16 bytes), then stored it to the aligned DST. But the performance is a
bit worst.

~Zhichang

>>> --
>>> 1.9.1
>>>
>>>
>>> _______________________________________________
>>> linux-arm-kernel mailing list
>>> linux-arm-kernel@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Philipp Tomsich Aug. 13, 2014, 8:19 a.m. UTC | #4
Feng & Zhichang,

On 13 Aug 2014, at 05:13 , zhichang.yuan <zhichang.yuan@linaro.org> wrote:

> If the both dst and src are not aligned and their alignment offset are not equal, i haven't found better way
> to handle.
> But it is lucky ARMv8 support the non-align memory access.
> At the beginning of my patch work, i also think maybe it is more better that all load or store are aligned. I
> wrote the code just like the ARMv7 memcpy, firstly loaded the data from SRC and buffered them in several
> registers and combined as a new word( 16 bytes), then stored it to the aligned DST. But the performance is a
> bit worst.

When looking at the underlying effects in the execution pipeline, the store-operations are non-critical for the throughput and we need to optimize for optimal throughput on the load-operations. This is because the store operations have no dependent operations and the store-pipeline will take care of any lingering effects from the misalignment (given that mechanisms like write-allocate make cache-effects on the store-operations more difficult to predict, I’m glad we don’t have to go into too much detail on those), as there are no throughput/bandwidth limits on the store-pipeline that we could even theoretically hit with such a loop.

The load-operations are much more critical in the context of what we try to achieve: as our progress through the loop depends on the load-operations getting their results, so we process the associated stores, we need to ensure optimal and deterministic throughput on those. As a misaligned load is likely to carry a penalty (e.g. on XGene it will typically carry a small penalty when crossing a cache line, especially if the second cache-line isn’t cached yet), we need to avoid misaligned loads.

If we would try to buffer data and then perform aligned stores, we’ll only introduce additional instructions and latency into our critical loop. 
At the same time—given what I wrote above misaligned store-operations being essentially free—there’s no benefit to be gained from the extra work required.

I hope this explains the observed behaviour somewhat better.

Cheers,
Philipp.
Steve Capper Nov. 27, 2014, 11:43 a.m. UTC | #5
On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote:
> Using the glibc cortex string work work authored by Linaro as base to
> create new copy to/from user kernel routine.
> 
> Iperf performance increase:
> 		-l (size)		1 core result
> Optimized 	64B			44-51Mb/s
> 		1500B			4.9Gb/s
> 		30000B			16.2Gb/s
> Original	64B			34-50.7Mb/s
> 		1500B			4.7Gb/s
> 		30000B			14.5Gb/s

I ran into some horrible crashes with this patch, it had been applied
to the Ubuntu kernel I was running. Details below:

> 
> Signed-off-by: Feng Kan <fkan@apm.com>
> ---
>  arch/arm64/lib/copy_from_user.S |  36 +-----
>  arch/arm64/lib/copy_template.S  | 278 ++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/lib/copy_to_user.S   |  31 +----
>  3 files changed, 284 insertions(+), 61 deletions(-)
>  create mode 100644 arch/arm64/lib/copy_template.S
> 
> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 5e27add..c4c5187 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -15,7 +15,6 @@
>   */
>  
>  #include <linux/linkage.h>
> -#include <asm/assembler.h>
>  
>  /*
>   * Copy from user space to a kernel buffer (alignment handled by the hardware)
> @@ -28,39 +27,10 @@
>   *	x0 - bytes not copied
>   */
>  ENTRY(__copy_from_user)
> -	add	x4, x1, x2			// upper user buffer boundary
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:
> -USER(9f, ldr	x3, [x1], #8	)
> -	subs	x2, x2, #8
> -	str	x3, [x0], #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -USER(9f, ldr	w3, [x1], #4	)
> -	sub	x2, x2, #4
> -	str	w3, [x0], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -USER(9f, ldrh	w3, [x1], #2	)
> -	sub	x2, x2, #2
> -	strh	w3, [x0], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -USER(9f, ldrb	w3, [x1]	)
> -	strb	w3, [x0]
> -5:	mov	x0, #0
> -	ret
> +#include "copy_template.S"
>  ENDPROC(__copy_from_user)
>  
>  	.section .fixup,"ax"
> -	.align	2
> -9:	sub	x2, x4, x1
> -	mov	x3, x2
> -10:	strb	wzr, [x0], #1			// zero remaining buffer space
> -	subs	x3, x3, #1
> -	b.ne	10b
> -	mov	x0, x2				// bytes not copied
> -	ret
> +	.align    2
> +	copy_abort_table
>  	.previous
> diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
> new file mode 100644
> index 0000000..f2c7003
> --- /dev/null
> +++ b/arch/arm64/lib/copy_template.S
> @@ -0,0 +1,278 @@
> +/*
> + * Copyright (c) 2013, Applied Micro Circuits Corporation
> + * Copyright (c) 2012-2013, Linaro Limited
> + *
> + * Author: Feng Kan <fkan@apm.com>
> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com>
> + *
> + * The code is adopted from the memcpy routine by Linaro Limited.
> + *
> + * This file is free software: you may copy, redistribute and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation, either version 2 of the License, or (at your
> + * option) any later version.
> + *
> + * This file is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + *
> + * This file incorporates work covered by the following copyright and
> + * permission notice:
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions are met:
> + *      1 Redistributions of source code must retain the above copyright
> + *        notice, this list of conditions and the following disclaimer.
> + *      2 Redistributions in binary form must reproduce the above copyright
> + *        notice, this list of conditions and the following disclaimer in the
> + *        documentation and/or other materials provided with the distribution.
> + *      3 Neither the name of the Linaro nor the
> + *        names of its contributors may be used to endorse or promote products
> + *        derived from this software without specific prior written permission.
> + *
> + *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +#include <asm/assembler.h>
> +
> +dstin	.req x0
> +src	.req x1
> +count	.req x2
> +tmp1	.req x3
> +tmp1w	.req w3
> +tmp2	.req x4
> +tmp2w	.req w4
> +tmp3	.req x5
> +tmp3w	.req w5
> +dst	.req x6
> +
> +A_l	.req x7
> +A_h	.req x8
> +B_l	.req x9
> +B_h	.req x10
> +C_l	.req x11
> +C_h	.req x12
> +D_l	.req x13
> +D_h	.req x14
> +
> +	mov	dst, dstin
> +	cmp	count, #64
> +	b.ge	.Lcpy_not_short
> +	cmp	count, #15
> +	b.le	.Ltail15tiny
> +
> +	/*
> +	 * Deal with small copies quickly by dropping straight into the
> +	 * exit block.
> +	 */
> +.Ltail63:
> +	/*
> +	 * Copy up to 48 bytes of data.  At this point we only need the
> +	 * bottom 6 bits of count to be accurate.
> +	 */
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltail15
> +	add	dst, dst, tmp1
> +	add	src, src, tmp1
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	USER(8f, ldp A_l, A_h, [src, #-48])
> +	USER(8f, stp A_l, A_h, [dst, #-48])
> +1:
> +	USER(8f, ldp A_l, A_h, [src, #-32])
> +	USER(8f, stp A_l, A_h, [dst, #-32])
> +2:
> +	USER(8f, ldp A_l, A_h, [src, #-16])
> +	USER(8f, stp A_l, A_h, [dst, #-16])
> +
> +.Ltail15:
> +	ands	count, count, #15
> +	beq	1f
> +	add	src, src, count
> +	USER(9f, ldp A_l, A_h, [src, #-16])
> +	add	dst, dst, count
> +	USER(9f, stp A_l, A_h, [dst, #-16])
> +1:
> +	b	.Lsuccess
> +
> +.Ltail15tiny:
> +	/*
> +	 * Copy up to 15 bytes of data.  Does not assume additional data
> +	 * being copied.
> +	 */
> +	tbz	count, #3, 1f
> +	USER(10f, ldr tmp1, [src], #8)
> +	USER(10f, str tmp1, [dst], #8)
> +1:
> +	tbz	count, #2, 1f
> +	USER(10f, ldr tmp1w, [src], #4)
> +	USER(10f, str tmp1w, [dst], #4)
> +1:
> +	tbz	count, #1, 1f
> +	USER(10f, ldrh tmp1w, [src], #2)
> +	USER(10f, strh tmp1w, [dst], #2)
> +1:
> +	tbz	count, #0, 1f
> +	USER(10f, ldrb tmp1w, [src])
> +	USER(10f, strb tmp1w, [dst])
> +1:
> +	b	.Lsuccess
> +
> +.Lcpy_not_short:
> +	/*
> +	 * We don't much care about the alignment of DST, but we want SRC
> +	 * to be 128-bit (16 byte) aligned so that we don't cross cache line
> +	 * boundaries on both loads and stores.
> +	 */
> +	neg	tmp2, src
> +	ands	tmp2, tmp2, #15		/* Bytes to reach alignment.  */
> +	b.eq	2f
> +	sub	count, count, tmp2
> +	/*
> +	 * Copy more data than needed; it's faster than jumping
> +	 * around copying sub-Quadword quantities.  We know that
> +	 * it can't overrun.
> +	 */
> +	USER(11f, ldp A_l, A_h, [src])
> +	add	src, src, tmp2
> +	USER(11f, stp A_l, A_h, [dst])
> +	add	dst, dst, tmp2
> +	/* There may be less than 63 bytes to go now.  */
> +	cmp	count, #63
> +	b.le	.Ltail63
> +2:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	 * Less than 128 bytes to copy, so handle 64 here and then jump
> +	 * to the tail.
> +	 */
> +	USER(12f, ldp A_l, A_h, [src])
> +	USER(12f, ldp B_l, B_h, [src, #16])
> +	USER(12f, ldp C_l, C_h, [src, #32])
> +	USER(12f, ldp D_l, D_h, [src, #48])
> +	USER(12f, stp A_l, A_h, [dst])
> +	USER(12f, stp B_l, B_h, [dst, #16])
> +	USER(12f, stp C_l, C_h, [dst, #32])
> +	USER(12f, stp D_l, D_h, [dst, #48])
> +	tst	count, #0x3f
> +	add	src, src, #64
> +	add	dst, dst, #64
> +	b.ne	.Ltail63
> +	b	.Lsuccess
> +
> +	/*
> +	 * Critical loop.  Start at a new cache line boundary.  Assuming
> +	 * 64 bytes per line this ensures the entire loop is in one line.
> +	 */
> +	.p2align 6
> +.Lcpy_body_large:
> +	/* There are at least 128 bytes to copy.  */
> +	USER(12f, ldp A_l, A_h, [src, #0])
> +	sub	dst, dst, #16			/* Pre-bias.  */
> +	USER(13f, ldp B_l, B_h, [src, #16])
> +	USER(13f, ldp C_l, C_h, [src, #32])
> +	USER(13f, ldp D_l, D_h, [src, #48]!)	/* src += 64 - Pre-bias. */
> +1:
> +	USER(13f, stp A_l, A_h, [dst, #16])
> +	USER(13f, ldp A_l, A_h, [src, #16])
> +	USER(13f, stp B_l, B_h, [dst, #32])
> +	USER(13f, ldp B_l, B_h, [src, #32])
> +	USER(13f, stp C_l, C_h, [dst, #48])
> +	USER(13f, ldp C_l, C_h, [src, #48])
> +	USER(13f, stp D_l, D_h, [dst, #64]!)
> +	USER(13f, ldp D_l, D_h, [src, #64]!)
> +	subs	count, count, #64
> +	b.ge	1b
> +	USER(14f, stp A_l, A_h, [dst, #16])
> +	USER(14f, stp B_l, B_h, [dst, #32])
> +	USER(14f, stp C_l, C_h, [dst, #48])
> +	USER(14f, stp D_l, D_h, [dst, #64])
> +	add	src, src, #16
> +	add	dst, dst, #64 + 16
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +.Lsuccess:
> +	/* Nothing left to copy */
> +	mov	x0, #0
> +	ret
> +
> +	.macro	copy_abort_table
> +8:
> +	/*
> +	 * Count bytes remain
> +	 * dst points to (dst + tmp1)
> +	 */
> +	mov	x0, count
> +	sub	dst, dst, tmp1
> +	b	.Lfinalize
> +9:
> +	/*
> +	 * 16 bytes remain
> +	 * dst is accurate
> +	 */
> +	mov	x0, #16
> +	b	.Lfinalize
> +10:
> +	/*
> +	 * count is accurate
> +	 * dst is accurate
> +	 */
> +	mov	x0, count
> +	b	.Lfinalize
> +11:
> +	/*
> +	 *(count + tmp2) bytes remain
> +	 * dst points to the start of the remaining bytes
> +	 */
> +	add	x0, count, tmp2
> +	b	.Lfinalize
> +12:
> +	/*
> +	 * (count + 128) bytes remain
> +	 * dst is accurate
> +	 */
> +	add	x0, count, #128
> +	b	.Lfinalize
> +13:
> +	/*
> +	 * (count + 128) bytes remain
> +	 * dst is pre-biased to (dst + 16)
> +	 */
> +	add	x0, count, #128
> +	sub	dst, dst, #16
> +	b	.Lfinalize
> +14:
> +	/*
> +	 * count is accurate
> +	 * dst is pre-biased to (dst + 16)
> +	 */
> +	mov	x0, count
> +	sub	dst, dst, #16
> +	/* fall-through */
> +.Lfinalize:
> +	/*
> +	 * Zeroize remaining destination-buffer
> +	 */
> +	mov	count, x0
> +20:
> +	/* Zero remaining buffer space */
> +	strb	wzr, [dst], #1
> +	subs	count, count, #1
> +	b.ne	20b
> +	ret
> +	.endm
> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
> index a0aeeb9..08787b0 100644
> --- a/arch/arm64/lib/copy_to_user.S
> +++ b/arch/arm64/lib/copy_to_user.S
> @@ -15,7 +15,6 @@
>   */
>  
>  #include <linux/linkage.h>
> -#include <asm/assembler.h>
>  
>  /*
>   * Copy to user space from a kernel buffer (alignment handled by the hardware)
> @@ -28,34 +27,10 @@
>   *	x0 - bytes not copied
>   */
>  ENTRY(__copy_to_user)
> -	add	x4, x0, x2			// upper user buffer boundary
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:
> -	ldr	x3, [x1], #8
> -	subs	x2, x2, #8
> -USER(9f, str	x3, [x0], #8	)
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1], #4
> -	sub	x2, x2, #4
> -USER(9f, str	w3, [x0], #4	)
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1], #2
> -	sub	x2, x2, #2
> -USER(9f, strh	w3, [x0], #2	)
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1]
> -USER(9f, strb	w3, [x0]	)
> -5:	mov	x0, #0
> -	ret
> +#include "copy_template.S"
>  ENDPROC(__copy_to_user)
>  
>  	.section .fixup,"ax"
> -	.align	2
> -9:	sub	x0, x4, x0			// bytes not copied
> -	ret
> +	.align    2
> +	copy_abort_table

The exact same fixup code is being used for copy_to_user and
copy_from_user.

For the copy_from_user case we want to zero the rest of the kernel
destination buffer when we hit a pagefault reading from user space.

However, for the copy_to_user case we most definitely don't want to
write zeros in the destination buffer when we hit a pagefault writing
to user space! I get unhandled pagefaults here, when copy_to_user is
called:

   0xffffffc00073c638 <+8920>:  strb    wzr, [x6],#1
   0xffffffc00073c63c <+8924>:  subs    x2, x2, #0x1
   0xffffffc00073c640 <+8928>:  b.ne    0xffffffc00073c638 <__hyp_text_end+8920>
   0xffffffc00073c644 <+8932>:  ret

I would suggest re-working the fixup path and testing both fixup paths
thoroughly by placing the system under memory pressure and confirming
that they are both "hit".

Dann, Tim,
Could you please revert this from the Ubuntu kernels, whilst it is
being fixed?

Cheers,
dann frazier Dec. 2, 2014, 10:59 p.m. UTC | #6
On Thu, Nov 27, 2014 at 4:43 AM, Steve Capper <steve.capper@linaro.org> wrote:
> On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote:
>> Using the glibc cortex string work work authored by Linaro as base to
>> create new copy to/from user kernel routine.
>>
>> Iperf performance increase:
>>               -l (size)               1 core result
>> Optimized     64B                     44-51Mb/s
>>               1500B                   4.9Gb/s
>>               30000B                  16.2Gb/s
>> Original      64B                     34-50.7Mb/s
>>               1500B                   4.7Gb/s
>>               30000B                  14.5Gb/s
>
> I ran into some horrible crashes with this patch, it had been applied
> to the Ubuntu kernel I was running. Details below:
>
>>
>> Signed-off-by: Feng Kan <fkan@apm.com>
>> ---
>>  arch/arm64/lib/copy_from_user.S |  36 +-----
>>  arch/arm64/lib/copy_template.S  | 278 ++++++++++++++++++++++++++++++++++++++++
>>  arch/arm64/lib/copy_to_user.S   |  31 +----
>>  3 files changed, 284 insertions(+), 61 deletions(-)
>>  create mode 100644 arch/arm64/lib/copy_template.S
>>
>> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
>> index 5e27add..c4c5187 100644
>> --- a/arch/arm64/lib/copy_from_user.S
>> +++ b/arch/arm64/lib/copy_from_user.S
>> @@ -15,7 +15,6 @@
>>   */
>>
>>  #include <linux/linkage.h>
>> -#include <asm/assembler.h>
>>
>>  /*
>>   * Copy from user space to a kernel buffer (alignment handled by the hardware)
>> @@ -28,39 +27,10 @@
>>   *   x0 - bytes not copied
>>   */
>>  ENTRY(__copy_from_user)
>> -     add     x4, x1, x2                      // upper user buffer boundary
>> -     subs    x2, x2, #8
>> -     b.mi    2f
>> -1:
>> -USER(9f, ldr x3, [x1], #8    )
>> -     subs    x2, x2, #8
>> -     str     x3, [x0], #8
>> -     b.pl    1b
>> -2:   adds    x2, x2, #4
>> -     b.mi    3f
>> -USER(9f, ldr w3, [x1], #4    )
>> -     sub     x2, x2, #4
>> -     str     w3, [x0], #4
>> -3:   adds    x2, x2, #2
>> -     b.mi    4f
>> -USER(9f, ldrh        w3, [x1], #2    )
>> -     sub     x2, x2, #2
>> -     strh    w3, [x0], #2
>> -4:   adds    x2, x2, #1
>> -     b.mi    5f
>> -USER(9f, ldrb        w3, [x1]        )
>> -     strb    w3, [x0]
>> -5:   mov     x0, #0
>> -     ret
>> +#include "copy_template.S"
>>  ENDPROC(__copy_from_user)
>>
>>       .section .fixup,"ax"
>> -     .align  2
>> -9:   sub     x2, x4, x1
>> -     mov     x3, x2
>> -10:  strb    wzr, [x0], #1                   // zero remaining buffer space
>> -     subs    x3, x3, #1
>> -     b.ne    10b
>> -     mov     x0, x2                          // bytes not copied
>> -     ret
>> +     .align    2
>> +     copy_abort_table
>>       .previous
>> diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
>> new file mode 100644
>> index 0000000..f2c7003
>> --- /dev/null
>> +++ b/arch/arm64/lib/copy_template.S
>> @@ -0,0 +1,278 @@
>> +/*
>> + * Copyright (c) 2013, Applied Micro Circuits Corporation
>> + * Copyright (c) 2012-2013, Linaro Limited
>> + *
>> + * Author: Feng Kan <fkan@apm.com>
>> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com>
>> + *
>> + * The code is adopted from the memcpy routine by Linaro Limited.
>> + *
>> + * This file is free software: you may copy, redistribute and/or modify it
>> + * under the terms of the GNU General Public License as published by the
>> + * Free Software Foundation, either version 2 of the License, or (at your
>> + * option) any later version.
>> + *
>> + * This file is distributed in the hope that it will be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + *
>> + * This file incorporates work covered by the following copyright and
>> + * permission notice:
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions are met:
>> + *      1 Redistributions of source code must retain the above copyright
>> + *        notice, this list of conditions and the following disclaimer.
>> + *      2 Redistributions in binary form must reproduce the above copyright
>> + *        notice, this list of conditions and the following disclaimer in the
>> + *        documentation and/or other materials provided with the distribution.
>> + *      3 Neither the name of the Linaro nor the
>> + *        names of its contributors may be used to endorse or promote products
>> + *        derived from this software without specific prior written permission.
>> + *
>> + *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + *  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + */
>> +#include <asm/assembler.h>
>> +
>> +dstin        .req x0
>> +src  .req x1
>> +count        .req x2
>> +tmp1 .req x3
>> +tmp1w        .req w3
>> +tmp2 .req x4
>> +tmp2w        .req w4
>> +tmp3 .req x5
>> +tmp3w        .req w5
>> +dst  .req x6
>> +
>> +A_l  .req x7
>> +A_h  .req x8
>> +B_l  .req x9
>> +B_h  .req x10
>> +C_l  .req x11
>> +C_h  .req x12
>> +D_l  .req x13
>> +D_h  .req x14
>> +
>> +     mov     dst, dstin
>> +     cmp     count, #64
>> +     b.ge    .Lcpy_not_short
>> +     cmp     count, #15
>> +     b.le    .Ltail15tiny
>> +
>> +     /*
>> +      * Deal with small copies quickly by dropping straight into the
>> +      * exit block.
>> +      */
>> +.Ltail63:
>> +     /*
>> +      * Copy up to 48 bytes of data.  At this point we only need the
>> +      * bottom 6 bits of count to be accurate.
>> +      */
>> +     ands    tmp1, count, #0x30
>> +     b.eq    .Ltail15
>> +     add     dst, dst, tmp1
>> +     add     src, src, tmp1
>> +     cmp     tmp1w, #0x20
>> +     b.eq    1f
>> +     b.lt    2f
>> +     USER(8f, ldp A_l, A_h, [src, #-48])
>> +     USER(8f, stp A_l, A_h, [dst, #-48])
>> +1:
>> +     USER(8f, ldp A_l, A_h, [src, #-32])
>> +     USER(8f, stp A_l, A_h, [dst, #-32])
>> +2:
>> +     USER(8f, ldp A_l, A_h, [src, #-16])
>> +     USER(8f, stp A_l, A_h, [dst, #-16])
>> +
>> +.Ltail15:
>> +     ands    count, count, #15
>> +     beq     1f
>> +     add     src, src, count
>> +     USER(9f, ldp A_l, A_h, [src, #-16])
>> +     add     dst, dst, count
>> +     USER(9f, stp A_l, A_h, [dst, #-16])
>> +1:
>> +     b       .Lsuccess
>> +
>> +.Ltail15tiny:
>> +     /*
>> +      * Copy up to 15 bytes of data.  Does not assume additional data
>> +      * being copied.
>> +      */
>> +     tbz     count, #3, 1f
>> +     USER(10f, ldr tmp1, [src], #8)
>> +     USER(10f, str tmp1, [dst], #8)
>> +1:
>> +     tbz     count, #2, 1f
>> +     USER(10f, ldr tmp1w, [src], #4)
>> +     USER(10f, str tmp1w, [dst], #4)
>> +1:
>> +     tbz     count, #1, 1f
>> +     USER(10f, ldrh tmp1w, [src], #2)
>> +     USER(10f, strh tmp1w, [dst], #2)
>> +1:
>> +     tbz     count, #0, 1f
>> +     USER(10f, ldrb tmp1w, [src])
>> +     USER(10f, strb tmp1w, [dst])
>> +1:
>> +     b       .Lsuccess
>> +
>> +.Lcpy_not_short:
>> +     /*
>> +      * We don't much care about the alignment of DST, but we want SRC
>> +      * to be 128-bit (16 byte) aligned so that we don't cross cache line
>> +      * boundaries on both loads and stores.
>> +      */
>> +     neg     tmp2, src
>> +     ands    tmp2, tmp2, #15         /* Bytes to reach alignment.  */
>> +     b.eq    2f
>> +     sub     count, count, tmp2
>> +     /*
>> +      * Copy more data than needed; it's faster than jumping
>> +      * around copying sub-Quadword quantities.  We know that
>> +      * it can't overrun.
>> +      */
>> +     USER(11f, ldp A_l, A_h, [src])
>> +     add     src, src, tmp2
>> +     USER(11f, stp A_l, A_h, [dst])
>> +     add     dst, dst, tmp2
>> +     /* There may be less than 63 bytes to go now.  */
>> +     cmp     count, #63
>> +     b.le    .Ltail63
>> +2:
>> +     subs    count, count, #128
>> +     b.ge    .Lcpy_body_large
>> +     /*
>> +      * Less than 128 bytes to copy, so handle 64 here and then jump
>> +      * to the tail.
>> +      */
>> +     USER(12f, ldp A_l, A_h, [src])
>> +     USER(12f, ldp B_l, B_h, [src, #16])
>> +     USER(12f, ldp C_l, C_h, [src, #32])
>> +     USER(12f, ldp D_l, D_h, [src, #48])
>> +     USER(12f, stp A_l, A_h, [dst])
>> +     USER(12f, stp B_l, B_h, [dst, #16])
>> +     USER(12f, stp C_l, C_h, [dst, #32])
>> +     USER(12f, stp D_l, D_h, [dst, #48])
>> +     tst     count, #0x3f
>> +     add     src, src, #64
>> +     add     dst, dst, #64
>> +     b.ne    .Ltail63
>> +     b       .Lsuccess
>> +
>> +     /*
>> +      * Critical loop.  Start at a new cache line boundary.  Assuming
>> +      * 64 bytes per line this ensures the entire loop is in one line.
>> +      */
>> +     .p2align 6
>> +.Lcpy_body_large:
>> +     /* There are at least 128 bytes to copy.  */
>> +     USER(12f, ldp A_l, A_h, [src, #0])
>> +     sub     dst, dst, #16                   /* Pre-bias.  */
>> +     USER(13f, ldp B_l, B_h, [src, #16])
>> +     USER(13f, ldp C_l, C_h, [src, #32])
>> +     USER(13f, ldp D_l, D_h, [src, #48]!)    /* src += 64 - Pre-bias. */
>> +1:
>> +     USER(13f, stp A_l, A_h, [dst, #16])
>> +     USER(13f, ldp A_l, A_h, [src, #16])
>> +     USER(13f, stp B_l, B_h, [dst, #32])
>> +     USER(13f, ldp B_l, B_h, [src, #32])
>> +     USER(13f, stp C_l, C_h, [dst, #48])
>> +     USER(13f, ldp C_l, C_h, [src, #48])
>> +     USER(13f, stp D_l, D_h, [dst, #64]!)
>> +     USER(13f, ldp D_l, D_h, [src, #64]!)
>> +     subs    count, count, #64
>> +     b.ge    1b
>> +     USER(14f, stp A_l, A_h, [dst, #16])
>> +     USER(14f, stp B_l, B_h, [dst, #32])
>> +     USER(14f, stp C_l, C_h, [dst, #48])
>> +     USER(14f, stp D_l, D_h, [dst, #64])
>> +     add     src, src, #16
>> +     add     dst, dst, #64 + 16
>> +     tst     count, #0x3f
>> +     b.ne    .Ltail63
>> +.Lsuccess:
>> +     /* Nothing left to copy */
>> +     mov     x0, #0
>> +     ret
>> +
>> +     .macro  copy_abort_table
>> +8:
>> +     /*
>> +      * Count bytes remain
>> +      * dst points to (dst + tmp1)
>> +      */
>> +     mov     x0, count
>> +     sub     dst, dst, tmp1
>> +     b       .Lfinalize
>> +9:
>> +     /*
>> +      * 16 bytes remain
>> +      * dst is accurate
>> +      */
>> +     mov     x0, #16
>> +     b       .Lfinalize
>> +10:
>> +     /*
>> +      * count is accurate
>> +      * dst is accurate
>> +      */
>> +     mov     x0, count
>> +     b       .Lfinalize
>> +11:
>> +     /*
>> +      *(count + tmp2) bytes remain
>> +      * dst points to the start of the remaining bytes
>> +      */
>> +     add     x0, count, tmp2
>> +     b       .Lfinalize
>> +12:
>> +     /*
>> +      * (count + 128) bytes remain
>> +      * dst is accurate
>> +      */
>> +     add     x0, count, #128
>> +     b       .Lfinalize
>> +13:
>> +     /*
>> +      * (count + 128) bytes remain
>> +      * dst is pre-biased to (dst + 16)
>> +      */
>> +     add     x0, count, #128
>> +     sub     dst, dst, #16
>> +     b       .Lfinalize
>> +14:
>> +     /*
>> +      * count is accurate
>> +      * dst is pre-biased to (dst + 16)
>> +      */
>> +     mov     x0, count
>> +     sub     dst, dst, #16
>> +     /* fall-through */
>> +.Lfinalize:
>> +     /*
>> +      * Zeroize remaining destination-buffer
>> +      */
>> +     mov     count, x0
>> +20:
>> +     /* Zero remaining buffer space */
>> +     strb    wzr, [dst], #1
>> +     subs    count, count, #1
>> +     b.ne    20b
>> +     ret
>> +     .endm
>> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
>> index a0aeeb9..08787b0 100644
>> --- a/arch/arm64/lib/copy_to_user.S
>> +++ b/arch/arm64/lib/copy_to_user.S
>> @@ -15,7 +15,6 @@
>>   */
>>
>>  #include <linux/linkage.h>
>> -#include <asm/assembler.h>
>>
>>  /*
>>   * Copy to user space from a kernel buffer (alignment handled by the hardware)
>> @@ -28,34 +27,10 @@
>>   *   x0 - bytes not copied
>>   */
>>  ENTRY(__copy_to_user)
>> -     add     x4, x0, x2                      // upper user buffer boundary
>> -     subs    x2, x2, #8
>> -     b.mi    2f
>> -1:
>> -     ldr     x3, [x1], #8
>> -     subs    x2, x2, #8
>> -USER(9f, str x3, [x0], #8    )
>> -     b.pl    1b
>> -2:   adds    x2, x2, #4
>> -     b.mi    3f
>> -     ldr     w3, [x1], #4
>> -     sub     x2, x2, #4
>> -USER(9f, str w3, [x0], #4    )
>> -3:   adds    x2, x2, #2
>> -     b.mi    4f
>> -     ldrh    w3, [x1], #2
>> -     sub     x2, x2, #2
>> -USER(9f, strh        w3, [x0], #2    )
>> -4:   adds    x2, x2, #1
>> -     b.mi    5f
>> -     ldrb    w3, [x1]
>> -USER(9f, strb        w3, [x0]        )
>> -5:   mov     x0, #0
>> -     ret
>> +#include "copy_template.S"
>>  ENDPROC(__copy_to_user)
>>
>>       .section .fixup,"ax"
>> -     .align  2
>> -9:   sub     x0, x4, x0                      // bytes not copied
>> -     ret
>> +     .align    2
>> +     copy_abort_table
>
> The exact same fixup code is being used for copy_to_user and
> copy_from_user.
>
> For the copy_from_user case we want to zero the rest of the kernel
> destination buffer when we hit a pagefault reading from user space.
>
> However, for the copy_to_user case we most definitely don't want to
> write zeros in the destination buffer when we hit a pagefault writing
> to user space! I get unhandled pagefaults here, when copy_to_user is
> called:
>
>    0xffffffc00073c638 <+8920>:  strb    wzr, [x6],#1
>    0xffffffc00073c63c <+8924>:  subs    x2, x2, #0x1
>    0xffffffc00073c640 <+8928>:  b.ne    0xffffffc00073c638 <__hyp_text_end+8920>
>    0xffffffc00073c644 <+8932>:  ret
>
> I would suggest re-working the fixup path and testing both fixup paths
> thoroughly by placing the system under memory pressure and confirming
> that they are both "hit".
>
> Dann, Tim,
> Could you please revert this from the Ubuntu kernels, whilst it is
> being fixed?

Seems wise, thanks for the heads-up. My coworker has started the process here:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398596

 -dann

> Cheers,
> --
> Steve
Craig Magina Dec. 3, 2014, 8:01 p.m. UTC | #7
On Tue, Dec 2, 2014 at 5:59 PM, Dann Frazier <dann.frazier@canonical.com> wrote:
> On Thu, Nov 27, 2014 at 4:43 AM, Steve Capper <steve.capper@linaro.org> wrote:
>> On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote:
>>> Using the glibc cortex string work work authored by Linaro as base to
>>> create new copy to/from user kernel routine.
>>>
>>> Iperf performance increase:
>>>               -l (size)               1 core result
>>> Optimized     64B                     44-51Mb/s
>>>               1500B                   4.9Gb/s
>>>               30000B                  16.2Gb/s
>>> Original      64B                     34-50.7Mb/s
>>>               1500B                   4.7Gb/s
>>>               30000B                  14.5Gb/s
>>
>> I ran into some horrible crashes with this patch, it had been applied
>> to the Ubuntu kernel I was running. Details below:
>>
>>>
>>> Signed-off-by: Feng Kan <fkan@apm.com>
>>> ---
>>>  arch/arm64/lib/copy_from_user.S |  36 +-----
>>>  arch/arm64/lib/copy_template.S  | 278 ++++++++++++++++++++++++++++++++++++++++
>>>  arch/arm64/lib/copy_to_user.S   |  31 +----
>>>  3 files changed, 284 insertions(+), 61 deletions(-)
>>>  create mode 100644 arch/arm64/lib/copy_template.S
>>>
>>> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
>>> index 5e27add..c4c5187 100644
>>> --- a/arch/arm64/lib/copy_from_user.S
>>> +++ b/arch/arm64/lib/copy_from_user.S
>>> @@ -15,7 +15,6 @@
>>>   */
>>>
>>>  #include <linux/linkage.h>
>>> -#include <asm/assembler.h>
>>>
>>>  /*
>>>   * Copy from user space to a kernel buffer (alignment handled by the hardware)
>>> @@ -28,39 +27,10 @@
>>>   *   x0 - bytes not copied
>>>   */
>>>  ENTRY(__copy_from_user)
>>> -     add     x4, x1, x2                      // upper user buffer boundary
>>> -     subs    x2, x2, #8
>>> -     b.mi    2f
>>> -1:
>>> -USER(9f, ldr x3, [x1], #8    )
>>> -     subs    x2, x2, #8
>>> -     str     x3, [x0], #8
>>> -     b.pl    1b
>>> -2:   adds    x2, x2, #4
>>> -     b.mi    3f
>>> -USER(9f, ldr w3, [x1], #4    )
>>> -     sub     x2, x2, #4
>>> -     str     w3, [x0], #4
>>> -3:   adds    x2, x2, #2
>>> -     b.mi    4f
>>> -USER(9f, ldrh        w3, [x1], #2    )
>>> -     sub     x2, x2, #2
>>> -     strh    w3, [x0], #2
>>> -4:   adds    x2, x2, #1
>>> -     b.mi    5f
>>> -USER(9f, ldrb        w3, [x1]        )
>>> -     strb    w3, [x0]
>>> -5:   mov     x0, #0
>>> -     ret
>>> +#include "copy_template.S"
>>>  ENDPROC(__copy_from_user)
>>>
>>>       .section .fixup,"ax"
>>> -     .align  2
>>> -9:   sub     x2, x4, x1
>>> -     mov     x3, x2
>>> -10:  strb    wzr, [x0], #1                   // zero remaining buffer space
>>> -     subs    x3, x3, #1
>>> -     b.ne    10b
>>> -     mov     x0, x2                          // bytes not copied
>>> -     ret
>>> +     .align    2
>>> +     copy_abort_table
>>>       .previous
>>> diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
>>> new file mode 100644
>>> index 0000000..f2c7003
>>> --- /dev/null
>>> +++ b/arch/arm64/lib/copy_template.S
>>> @@ -0,0 +1,278 @@
>>> +/*
>>> + * Copyright (c) 2013, Applied Micro Circuits Corporation
>>> + * Copyright (c) 2012-2013, Linaro Limited
>>> + *
>>> + * Author: Feng Kan <fkan@apm.com>
>>> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com>
>>> + *
>>> + * The code is adopted from the memcpy routine by Linaro Limited.
>>> + *
>>> + * This file is free software: you may copy, redistribute and/or modify it
>>> + * under the terms of the GNU General Public License as published by the
>>> + * Free Software Foundation, either version 2 of the License, or (at your
>>> + * option) any later version.
>>> + *
>>> + * This file is distributed in the hope that it will be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> + * General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>>> + *
>>> + * This file incorporates work covered by the following copyright and
>>> + * permission notice:
>>> + *
>>> + * Redistribution and use in source and binary forms, with or without
>>> + * modification, are permitted provided that the following conditions are met:
>>> + *      1 Redistributions of source code must retain the above copyright
>>> + *        notice, this list of conditions and the following disclaimer.
>>> + *      2 Redistributions in binary form must reproduce the above copyright
>>> + *        notice, this list of conditions and the following disclaimer in the
>>> + *        documentation and/or other materials provided with the distribution.
>>> + *      3 Neither the name of the Linaro nor the
>>> + *        names of its contributors may be used to endorse or promote products
>>> + *        derived from this software without specific prior written permission.
>>> + *
>>> + *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>>> + *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>>> + *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>>> + *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>>> + *  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>>> + *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>>> + *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>>> + *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>>> + *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>>> + *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>>> + *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>>> + */
>>> +#include <asm/assembler.h>
>>> +
>>> +dstin        .req x0
>>> +src  .req x1
>>> +count        .req x2
>>> +tmp1 .req x3
>>> +tmp1w        .req w3
>>> +tmp2 .req x4
>>> +tmp2w        .req w4
>>> +tmp3 .req x5
>>> +tmp3w        .req w5
>>> +dst  .req x6
>>> +
>>> +A_l  .req x7
>>> +A_h  .req x8
>>> +B_l  .req x9
>>> +B_h  .req x10
>>> +C_l  .req x11
>>> +C_h  .req x12
>>> +D_l  .req x13
>>> +D_h  .req x14
>>> +
>>> +     mov     dst, dstin
>>> +     cmp     count, #64
>>> +     b.ge    .Lcpy_not_short
>>> +     cmp     count, #15
>>> +     b.le    .Ltail15tiny
>>> +
>>> +     /*
>>> +      * Deal with small copies quickly by dropping straight into the
>>> +      * exit block.
>>> +      */
>>> +.Ltail63:
>>> +     /*
>>> +      * Copy up to 48 bytes of data.  At this point we only need the
>>> +      * bottom 6 bits of count to be accurate.
>>> +      */
>>> +     ands    tmp1, count, #0x30
>>> +     b.eq    .Ltail15
>>> +     add     dst, dst, tmp1
>>> +     add     src, src, tmp1
>>> +     cmp     tmp1w, #0x20
>>> +     b.eq    1f
>>> +     b.lt    2f
>>> +     USER(8f, ldp A_l, A_h, [src, #-48])
>>> +     USER(8f, stp A_l, A_h, [dst, #-48])
>>> +1:
>>> +     USER(8f, ldp A_l, A_h, [src, #-32])
>>> +     USER(8f, stp A_l, A_h, [dst, #-32])
>>> +2:
>>> +     USER(8f, ldp A_l, A_h, [src, #-16])
>>> +     USER(8f, stp A_l, A_h, [dst, #-16])
>>> +
>>> +.Ltail15:
>>> +     ands    count, count, #15
>>> +     beq     1f
>>> +     add     src, src, count
>>> +     USER(9f, ldp A_l, A_h, [src, #-16])
>>> +     add     dst, dst, count
>>> +     USER(9f, stp A_l, A_h, [dst, #-16])
>>> +1:
>>> +     b       .Lsuccess
>>> +
>>> +.Ltail15tiny:
>>> +     /*
>>> +      * Copy up to 15 bytes of data.  Does not assume additional data
>>> +      * being copied.
>>> +      */
>>> +     tbz     count, #3, 1f
>>> +     USER(10f, ldr tmp1, [src], #8)
>>> +     USER(10f, str tmp1, [dst], #8)
>>> +1:
>>> +     tbz     count, #2, 1f
>>> +     USER(10f, ldr tmp1w, [src], #4)
>>> +     USER(10f, str tmp1w, [dst], #4)
>>> +1:
>>> +     tbz     count, #1, 1f
>>> +     USER(10f, ldrh tmp1w, [src], #2)
>>> +     USER(10f, strh tmp1w, [dst], #2)
>>> +1:
>>> +     tbz     count, #0, 1f
>>> +     USER(10f, ldrb tmp1w, [src])
>>> +     USER(10f, strb tmp1w, [dst])
>>> +1:
>>> +     b       .Lsuccess
>>> +
>>> +.Lcpy_not_short:
>>> +     /*
>>> +      * We don't much care about the alignment of DST, but we want SRC
>>> +      * to be 128-bit (16 byte) aligned so that we don't cross cache line
>>> +      * boundaries on both loads and stores.
>>> +      */
>>> +     neg     tmp2, src
>>> +     ands    tmp2, tmp2, #15         /* Bytes to reach alignment.  */
>>> +     b.eq    2f
>>> +     sub     count, count, tmp2
>>> +     /*
>>> +      * Copy more data than needed; it's faster than jumping
>>> +      * around copying sub-Quadword quantities.  We know that
>>> +      * it can't overrun.
>>> +      */
>>> +     USER(11f, ldp A_l, A_h, [src])
>>> +     add     src, src, tmp2
>>> +     USER(11f, stp A_l, A_h, [dst])
>>> +     add     dst, dst, tmp2
>>> +     /* There may be less than 63 bytes to go now.  */
>>> +     cmp     count, #63
>>> +     b.le    .Ltail63
>>> +2:
>>> +     subs    count, count, #128
>>> +     b.ge    .Lcpy_body_large
>>> +     /*
>>> +      * Less than 128 bytes to copy, so handle 64 here and then jump
>>> +      * to the tail.
>>> +      */
>>> +     USER(12f, ldp A_l, A_h, [src])
>>> +     USER(12f, ldp B_l, B_h, [src, #16])
>>> +     USER(12f, ldp C_l, C_h, [src, #32])
>>> +     USER(12f, ldp D_l, D_h, [src, #48])
>>> +     USER(12f, stp A_l, A_h, [dst])
>>> +     USER(12f, stp B_l, B_h, [dst, #16])
>>> +     USER(12f, stp C_l, C_h, [dst, #32])
>>> +     USER(12f, stp D_l, D_h, [dst, #48])
>>> +     tst     count, #0x3f
>>> +     add     src, src, #64
>>> +     add     dst, dst, #64
>>> +     b.ne    .Ltail63
>>> +     b       .Lsuccess
>>> +
>>> +     /*
>>> +      * Critical loop.  Start at a new cache line boundary.  Assuming
>>> +      * 64 bytes per line this ensures the entire loop is in one line.
>>> +      */
>>> +     .p2align 6
>>> +.Lcpy_body_large:
>>> +     /* There are at least 128 bytes to copy.  */
>>> +     USER(12f, ldp A_l, A_h, [src, #0])
>>> +     sub     dst, dst, #16                   /* Pre-bias.  */
>>> +     USER(13f, ldp B_l, B_h, [src, #16])
>>> +     USER(13f, ldp C_l, C_h, [src, #32])
>>> +     USER(13f, ldp D_l, D_h, [src, #48]!)    /* src += 64 - Pre-bias. */
>>> +1:
>>> +     USER(13f, stp A_l, A_h, [dst, #16])
>>> +     USER(13f, ldp A_l, A_h, [src, #16])
>>> +     USER(13f, stp B_l, B_h, [dst, #32])
>>> +     USER(13f, ldp B_l, B_h, [src, #32])
>>> +     USER(13f, stp C_l, C_h, [dst, #48])
>>> +     USER(13f, ldp C_l, C_h, [src, #48])
>>> +     USER(13f, stp D_l, D_h, [dst, #64]!)
>>> +     USER(13f, ldp D_l, D_h, [src, #64]!)
>>> +     subs    count, count, #64
>>> +     b.ge    1b
>>> +     USER(14f, stp A_l, A_h, [dst, #16])
>>> +     USER(14f, stp B_l, B_h, [dst, #32])
>>> +     USER(14f, stp C_l, C_h, [dst, #48])
>>> +     USER(14f, stp D_l, D_h, [dst, #64])
>>> +     add     src, src, #16
>>> +     add     dst, dst, #64 + 16
>>> +     tst     count, #0x3f
>>> +     b.ne    .Ltail63
>>> +.Lsuccess:
>>> +     /* Nothing left to copy */
>>> +     mov     x0, #0
>>> +     ret
>>> +
>>> +     .macro  copy_abort_table
>>> +8:
>>> +     /*
>>> +      * Count bytes remain
>>> +      * dst points to (dst + tmp1)
>>> +      */
>>> +     mov     x0, count
>>> +     sub     dst, dst, tmp1
>>> +     b       .Lfinalize
>>> +9:
>>> +     /*
>>> +      * 16 bytes remain
>>> +      * dst is accurate
>>> +      */
>>> +     mov     x0, #16
>>> +     b       .Lfinalize
>>> +10:
>>> +     /*
>>> +      * count is accurate
>>> +      * dst is accurate
>>> +      */
>>> +     mov     x0, count
>>> +     b       .Lfinalize
>>> +11:
>>> +     /*
>>> +      *(count + tmp2) bytes remain
>>> +      * dst points to the start of the remaining bytes
>>> +      */
>>> +     add     x0, count, tmp2
>>> +     b       .Lfinalize
>>> +12:
>>> +     /*
>>> +      * (count + 128) bytes remain
>>> +      * dst is accurate
>>> +      */
>>> +     add     x0, count, #128
>>> +     b       .Lfinalize
>>> +13:
>>> +     /*
>>> +      * (count + 128) bytes remain
>>> +      * dst is pre-biased to (dst + 16)
>>> +      */
>>> +     add     x0, count, #128
>>> +     sub     dst, dst, #16
>>> +     b       .Lfinalize
>>> +14:
>>> +     /*
>>> +      * count is accurate
>>> +      * dst is pre-biased to (dst + 16)
>>> +      */
>>> +     mov     x0, count
>>> +     sub     dst, dst, #16
>>> +     /* fall-through */
>>> +.Lfinalize:
>>> +     /*
>>> +      * Zeroize remaining destination-buffer
>>> +      */
>>> +     mov     count, x0
>>> +20:
>>> +     /* Zero remaining buffer space */
>>> +     strb    wzr, [dst], #1
>>> +     subs    count, count, #1
>>> +     b.ne    20b
>>> +     ret
>>> +     .endm
>>> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
>>> index a0aeeb9..08787b0 100644
>>> --- a/arch/arm64/lib/copy_to_user.S
>>> +++ b/arch/arm64/lib/copy_to_user.S
>>> @@ -15,7 +15,6 @@
>>>   */
>>>
>>>  #include <linux/linkage.h>
>>> -#include <asm/assembler.h>
>>>
>>>  /*
>>>   * Copy to user space from a kernel buffer (alignment handled by the hardware)
>>> @@ -28,34 +27,10 @@
>>>   *   x0 - bytes not copied
>>>   */
>>>  ENTRY(__copy_to_user)
>>> -     add     x4, x0, x2                      // upper user buffer boundary
>>> -     subs    x2, x2, #8
>>> -     b.mi    2f
>>> -1:
>>> -     ldr     x3, [x1], #8
>>> -     subs    x2, x2, #8
>>> -USER(9f, str x3, [x0], #8    )
>>> -     b.pl    1b
>>> -2:   adds    x2, x2, #4
>>> -     b.mi    3f
>>> -     ldr     w3, [x1], #4
>>> -     sub     x2, x2, #4
>>> -USER(9f, str w3, [x0], #4    )
>>> -3:   adds    x2, x2, #2
>>> -     b.mi    4f
>>> -     ldrh    w3, [x1], #2
>>> -     sub     x2, x2, #2
>>> -USER(9f, strh        w3, [x0], #2    )
>>> -4:   adds    x2, x2, #1
>>> -     b.mi    5f
>>> -     ldrb    w3, [x1]
>>> -USER(9f, strb        w3, [x0]        )
>>> -5:   mov     x0, #0
>>> -     ret
>>> +#include "copy_template.S"
>>>  ENDPROC(__copy_to_user)
>>>
>>>       .section .fixup,"ax"
>>> -     .align  2
>>> -9:   sub     x0, x4, x0                      // bytes not copied
>>> -     ret
>>> +     .align    2
>>> +     copy_abort_table
>>
>> The exact same fixup code is being used for copy_to_user and
>> copy_from_user.
>>
>> For the copy_from_user case we want to zero the rest of the kernel
>> destination buffer when we hit a pagefault reading from user space.
>>
>> However, for the copy_to_user case we most definitely don't want to
>> write zeros in the destination buffer when we hit a pagefault writing
>> to user space! I get unhandled pagefaults here, when copy_to_user is
>> called:
>>
>>    0xffffffc00073c638 <+8920>:  strb    wzr, [x6],#1
>>    0xffffffc00073c63c <+8924>:  subs    x2, x2, #0x1
>>    0xffffffc00073c640 <+8928>:  b.ne    0xffffffc00073c638 <__hyp_text_end+8920>
>>    0xffffffc00073c644 <+8932>:  ret
>>
>> I would suggest re-working the fixup path and testing both fixup paths
>> thoroughly by placing the system under memory pressure and confirming
>> that they are both "hit".
>>
>> Dann, Tim,
>> Could you please revert this from the Ubuntu kernels, whilst it is
>> being fixed?
>
> Seems wise, thanks for the heads-up. My coworker has started the process here:
>   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398596
>
>  -dann
>
>> Cheers,
>> --
>> Steve
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Could you provide the steps you used to reproduce this issue? I have
already submitted an SRU to revert this patch, but would like to have
the method you used to reproduce it for testing. Thanks for reporting
this.
Steve Capper Dec. 4, 2014, 12:27 p.m. UTC | #8
On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote:
[...]
>
> Could you provide the steps you used to reproduce this issue? I have
> already submitted an SRU to revert this patch, but would like to have
> the method you used to reproduce it for testing. Thanks for reporting
> this.
>
> --
> Craig Magina

Hi Craig,
It's not easy to reproduce as the problem occurs when the kernel is
copying memory to userspace and experiences a page fault.

I have come across it whilst running "perf record -a" on a system that
is swapping.

Also, building source code under memory pressure (i.e. kernel compile
just after the system has swapped out pages), gave me this problem
too.

Cheers,
--
Steve
Philipp Tomsich Dec. 4, 2014, 1:56 p.m. UTC | #9
Craig/Steve/Dann,

trying to understand what went wrong and when, I just went back to my original mail (May 1st, 2013) providing the original optimized functions for an iperf run at APM… this had two separate error-handling paths for _to_user and _from_user, where the _to_user case didn’t zeroize.

I can’t quite track when these two paths were unified (or by who)… but I can confirm that only the _from_user path was designed to zeroize the buffer on error.

@Craig: I’ll also forward you the original files, which I had earlier sent to Dann on Jul 3rd 2014. This should provide a point-of-reference for resolving this.

Best,
Philipp.

> On 04 Dec 2014, at 13:27, Steve Capper <steve.capper@linaro.org> wrote:
> 
> On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote:
> [...]
>> 
>> Could you provide the steps you used to reproduce this issue? I have
>> already submitted an SRU to revert this patch, but would like to have
>> the method you used to reproduce it for testing. Thanks for reporting
>> this.
>> 
>> --
>> Craig Magina
> 
> Hi Craig,
> It's not easy to reproduce as the problem occurs when the kernel is
> copying memory to userspace and experiences a page fault.
> 
> I have come across it whilst running "perf record -a" on a system that
> is swapping.
> 
> Also, building source code under memory pressure (i.e. kernel compile
> just after the system has swapped out pages), gave me this problem
> too.
> 
> Cheers,
> --
> Steve
dann frazier Dec. 9, 2014, 10:18 p.m. UTC | #10
On Thu, Dec 4, 2014 at 6:56 AM, Dr. Philipp Tomsich
<philipp.tomsich@theobroma-systems.com> wrote:
> Craig/Steve/Dann,
>
> trying to understand what went wrong and when, I just went back to my original mail (May 1st, 2013) providing the original optimized functions for an iperf run at APM… this had two separate error-handling paths for _to_user and _from_user, where the _to_user case didn’t zeroize.
>
> I can’t quite track when these two paths were unified (or by who)… but I can confirm that only the _from_user path was designed to zeroize the buffer on error.

This is the source we pulled (previous version had some issues w/
licensing iirc):
  http://www.spinics.net/lists/arm-kernel/msg353650.html

 -dann

> @Craig: I’ll also forward you the original files, which I had earlier sent to Dann on Jul 3rd 2014. This should provide a point-of-reference for resolving this.
>
> Best,
> Philipp.
>
>> On 04 Dec 2014, at 13:27, Steve Capper <steve.capper@linaro.org> wrote:
>>
>> On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote:
>> [...]
>>>
>>> Could you provide the steps you used to reproduce this issue? I have
>>> already submitted an SRU to revert this patch, but would like to have
>>> the method you used to reproduce it for testing. Thanks for reporting
>>> this.
>>>
>>> --
>>> Craig Magina
>>
>> Hi Craig,
>> It's not easy to reproduce as the problem occurs when the kernel is
>> copying memory to userspace and experiences a page fault.
>>
>> I have come across it whilst running "perf record -a" on a system that
>> is swapping.
>>
>> Also, building source code under memory pressure (i.e. kernel compile
>> just after the system has swapped out pages), gave me this problem
>> too.
>>
>> Cheers,
>> --
>> Steve
>
diff mbox

Patch

diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
index 5e27add..c4c5187 100644
--- a/arch/arm64/lib/copy_from_user.S
+++ b/arch/arm64/lib/copy_from_user.S
@@ -15,7 +15,6 @@ 
  */
 
 #include <linux/linkage.h>
-#include <asm/assembler.h>
 
 /*
  * Copy from user space to a kernel buffer (alignment handled by the hardware)
@@ -28,39 +27,10 @@ 
  *	x0 - bytes not copied
  */
 ENTRY(__copy_from_user)
-	add	x4, x1, x2			// upper user buffer boundary
-	subs	x2, x2, #8
-	b.mi	2f
-1:
-USER(9f, ldr	x3, [x1], #8	)
-	subs	x2, x2, #8
-	str	x3, [x0], #8
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-USER(9f, ldr	w3, [x1], #4	)
-	sub	x2, x2, #4
-	str	w3, [x0], #4
-3:	adds	x2, x2, #2
-	b.mi	4f
-USER(9f, ldrh	w3, [x1], #2	)
-	sub	x2, x2, #2
-	strh	w3, [x0], #2
-4:	adds	x2, x2, #1
-	b.mi	5f
-USER(9f, ldrb	w3, [x1]	)
-	strb	w3, [x0]
-5:	mov	x0, #0
-	ret
+#include "copy_template.S"
 ENDPROC(__copy_from_user)
 
 	.section .fixup,"ax"
-	.align	2
-9:	sub	x2, x4, x1
-	mov	x3, x2
-10:	strb	wzr, [x0], #1			// zero remaining buffer space
-	subs	x3, x3, #1
-	b.ne	10b
-	mov	x0, x2				// bytes not copied
-	ret
+	.align    2
+	copy_abort_table
 	.previous
diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
new file mode 100644
index 0000000..f2c7003
--- /dev/null
+++ b/arch/arm64/lib/copy_template.S
@@ -0,0 +1,278 @@ 
+/*
+ * Copyright (c) 2013, Applied Micro Circuits Corporation
+ * Copyright (c) 2012-2013, Linaro Limited
+ *
+ * Author: Feng Kan <fkan@apm.com>
+ * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com>
+ *
+ * The code is adopted from the memcpy routine by Linaro Limited.
+ *
+ * This file is free software: you may copy, redistribute and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation, either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * This file is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * This file incorporates work covered by the following copyright and
+ * permission notice:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *      1 Redistributions of source code must retain the above copyright
+ *        notice, this list of conditions and the following disclaimer.
+ *      2 Redistributions in binary form must reproduce the above copyright
+ *        notice, this list of conditions and the following disclaimer in the
+ *        documentation and/or other materials provided with the distribution.
+ *      3 Neither the name of the Linaro nor the
+ *        names of its contributors may be used to endorse or promote products
+ *        derived from this software without specific prior written permission.
+ *
+ *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <asm/assembler.h>
+
+dstin	.req x0
+src	.req x1
+count	.req x2
+tmp1	.req x3
+tmp1w	.req w3
+tmp2	.req x4
+tmp2w	.req w4
+tmp3	.req x5
+tmp3w	.req w5
+dst	.req x6
+
+A_l	.req x7
+A_h	.req x8
+B_l	.req x9
+B_h	.req x10
+C_l	.req x11
+C_h	.req x12
+D_l	.req x13
+D_h	.req x14
+
+	mov	dst, dstin
+	cmp	count, #64
+	b.ge	.Lcpy_not_short
+	cmp	count, #15
+	b.le	.Ltail15tiny
+
+	/*
+	 * Deal with small copies quickly by dropping straight into the
+	 * exit block.
+	 */
+.Ltail63:
+	/*
+	 * Copy up to 48 bytes of data.  At this point we only need the
+	 * bottom 6 bits of count to be accurate.
+	 */
+	ands	tmp1, count, #0x30
+	b.eq	.Ltail15
+	add	dst, dst, tmp1
+	add	src, src, tmp1
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	USER(8f, ldp A_l, A_h, [src, #-48])
+	USER(8f, stp A_l, A_h, [dst, #-48])
+1:
+	USER(8f, ldp A_l, A_h, [src, #-32])
+	USER(8f, stp A_l, A_h, [dst, #-32])
+2:
+	USER(8f, ldp A_l, A_h, [src, #-16])
+	USER(8f, stp A_l, A_h, [dst, #-16])
+
+.Ltail15:
+	ands	count, count, #15
+	beq	1f
+	add	src, src, count
+	USER(9f, ldp A_l, A_h, [src, #-16])
+	add	dst, dst, count
+	USER(9f, stp A_l, A_h, [dst, #-16])
+1:
+	b	.Lsuccess
+
+.Ltail15tiny:
+	/*
+	 * Copy up to 15 bytes of data.  Does not assume additional data
+	 * being copied.
+	 */
+	tbz	count, #3, 1f
+	USER(10f, ldr tmp1, [src], #8)
+	USER(10f, str tmp1, [dst], #8)
+1:
+	tbz	count, #2, 1f
+	USER(10f, ldr tmp1w, [src], #4)
+	USER(10f, str tmp1w, [dst], #4)
+1:
+	tbz	count, #1, 1f
+	USER(10f, ldrh tmp1w, [src], #2)
+	USER(10f, strh tmp1w, [dst], #2)
+1:
+	tbz	count, #0, 1f
+	USER(10f, ldrb tmp1w, [src])
+	USER(10f, strb tmp1w, [dst])
+1:
+	b	.Lsuccess
+
+.Lcpy_not_short:
+	/*
+	 * We don't much care about the alignment of DST, but we want SRC
+	 * to be 128-bit (16 byte) aligned so that we don't cross cache line
+	 * boundaries on both loads and stores.
+	 */
+	neg	tmp2, src
+	ands	tmp2, tmp2, #15		/* Bytes to reach alignment.  */
+	b.eq	2f
+	sub	count, count, tmp2
+	/*
+	 * Copy more data than needed; it's faster than jumping
+	 * around copying sub-Quadword quantities.  We know that
+	 * it can't overrun.
+	 */
+	USER(11f, ldp A_l, A_h, [src])
+	add	src, src, tmp2
+	USER(11f, stp A_l, A_h, [dst])
+	add	dst, dst, tmp2
+	/* There may be less than 63 bytes to go now.  */
+	cmp	count, #63
+	b.le	.Ltail63
+2:
+	subs	count, count, #128
+	b.ge	.Lcpy_body_large
+	/*
+	 * Less than 128 bytes to copy, so handle 64 here and then jump
+	 * to the tail.
+	 */
+	USER(12f, ldp A_l, A_h, [src])
+	USER(12f, ldp B_l, B_h, [src, #16])
+	USER(12f, ldp C_l, C_h, [src, #32])
+	USER(12f, ldp D_l, D_h, [src, #48])
+	USER(12f, stp A_l, A_h, [dst])
+	USER(12f, stp B_l, B_h, [dst, #16])
+	USER(12f, stp C_l, C_h, [dst, #32])
+	USER(12f, stp D_l, D_h, [dst, #48])
+	tst	count, #0x3f
+	add	src, src, #64
+	add	dst, dst, #64
+	b.ne	.Ltail63
+	b	.Lsuccess
+
+	/*
+	 * Critical loop.  Start at a new cache line boundary.  Assuming
+	 * 64 bytes per line this ensures the entire loop is in one line.
+	 */
+	.p2align 6
+.Lcpy_body_large:
+	/* There are at least 128 bytes to copy.  */
+	USER(12f, ldp A_l, A_h, [src, #0])
+	sub	dst, dst, #16			/* Pre-bias.  */
+	USER(13f, ldp B_l, B_h, [src, #16])
+	USER(13f, ldp C_l, C_h, [src, #32])
+	USER(13f, ldp D_l, D_h, [src, #48]!)	/* src += 64 - Pre-bias. */
+1:
+	USER(13f, stp A_l, A_h, [dst, #16])
+	USER(13f, ldp A_l, A_h, [src, #16])
+	USER(13f, stp B_l, B_h, [dst, #32])
+	USER(13f, ldp B_l, B_h, [src, #32])
+	USER(13f, stp C_l, C_h, [dst, #48])
+	USER(13f, ldp C_l, C_h, [src, #48])
+	USER(13f, stp D_l, D_h, [dst, #64]!)
+	USER(13f, ldp D_l, D_h, [src, #64]!)
+	subs	count, count, #64
+	b.ge	1b
+	USER(14f, stp A_l, A_h, [dst, #16])
+	USER(14f, stp B_l, B_h, [dst, #32])
+	USER(14f, stp C_l, C_h, [dst, #48])
+	USER(14f, stp D_l, D_h, [dst, #64])
+	add	src, src, #16
+	add	dst, dst, #64 + 16
+	tst	count, #0x3f
+	b.ne	.Ltail63
+.Lsuccess:
+	/* Nothing left to copy */
+	mov	x0, #0
+	ret
+
+	.macro	copy_abort_table
+8:
+	/*
+	 * Count bytes remain
+	 * dst points to (dst + tmp1)
+	 */
+	mov	x0, count
+	sub	dst, dst, tmp1
+	b	.Lfinalize
+9:
+	/*
+	 * 16 bytes remain
+	 * dst is accurate
+	 */
+	mov	x0, #16
+	b	.Lfinalize
+10:
+	/*
+	 * count is accurate
+	 * dst is accurate
+	 */
+	mov	x0, count
+	b	.Lfinalize
+11:
+	/*
+	 *(count + tmp2) bytes remain
+	 * dst points to the start of the remaining bytes
+	 */
+	add	x0, count, tmp2
+	b	.Lfinalize
+12:
+	/*
+	 * (count + 128) bytes remain
+	 * dst is accurate
+	 */
+	add	x0, count, #128
+	b	.Lfinalize
+13:
+	/*
+	 * (count + 128) bytes remain
+	 * dst is pre-biased to (dst + 16)
+	 */
+	add	x0, count, #128
+	sub	dst, dst, #16
+	b	.Lfinalize
+14:
+	/*
+	 * count is accurate
+	 * dst is pre-biased to (dst + 16)
+	 */
+	mov	x0, count
+	sub	dst, dst, #16
+	/* fall-through */
+.Lfinalize:
+	/*
+	 * Zeroize remaining destination-buffer
+	 */
+	mov	count, x0
+20:
+	/* Zero remaining buffer space */
+	strb	wzr, [dst], #1
+	subs	count, count, #1
+	b.ne	20b
+	ret
+	.endm
diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
index a0aeeb9..08787b0 100644
--- a/arch/arm64/lib/copy_to_user.S
+++ b/arch/arm64/lib/copy_to_user.S
@@ -15,7 +15,6 @@ 
  */
 
 #include <linux/linkage.h>
-#include <asm/assembler.h>
 
 /*
  * Copy to user space from a kernel buffer (alignment handled by the hardware)
@@ -28,34 +27,10 @@ 
  *	x0 - bytes not copied
  */
 ENTRY(__copy_to_user)
-	add	x4, x0, x2			// upper user buffer boundary
-	subs	x2, x2, #8
-	b.mi	2f
-1:
-	ldr	x3, [x1], #8
-	subs	x2, x2, #8
-USER(9f, str	x3, [x0], #8	)
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	ldr	w3, [x1], #4
-	sub	x2, x2, #4
-USER(9f, str	w3, [x0], #4	)
-3:	adds	x2, x2, #2
-	b.mi	4f
-	ldrh	w3, [x1], #2
-	sub	x2, x2, #2
-USER(9f, strh	w3, [x0], #2	)
-4:	adds	x2, x2, #1
-	b.mi	5f
-	ldrb	w3, [x1]
-USER(9f, strb	w3, [x0]	)
-5:	mov	x0, #0
-	ret
+#include "copy_template.S"
 ENDPROC(__copy_to_user)
 
 	.section .fixup,"ax"
-	.align	2
-9:	sub	x0, x4, x0			// bytes not copied
-	ret
+	.align    2
+	copy_abort_table
 	.previous