Message ID | 1407538940-9167-1-git-send-email-fkan@apm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Feng, > + > +.Lcpy_not_short: > + /* > + * We don't much care about the alignment of DST, but we want SRC > + * to be 128-bit (16 byte) aligned so that we don't cross cache line > + * boundaries on both loads and stores. > + */ Could you please tell why is destination alignment not an issue? Is this a generic implementation that you are referring to or specific to your platform? > -- > 1.9.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Sun, Aug 10, 2014 at 8:01 PM, Radha Mohan <mohun106@gmail.com> wrote: > Hi Feng, > > >> + >> +.Lcpy_not_short: >> + /* >> + * We don't much care about the alignment of DST, but we want SRC >> + * to be 128-bit (16 byte) aligned so that we don't cross cache line >> + * boundaries on both loads and stores. >> + */ > > Could you please tell why is destination alignment not an issue? Is > this a generic implementation that you are referring to or specific to > your platform? This is per Linaro Cortext String optimization routines. https://launchpad.net/cortex-strings Zhichang submitted something similar for the memcpy from the same optimization. Sorry resend in text mode. > >> -- >> 1.9.1 >> >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Hi Feng, On 2014?08?12? 02:05, Feng Kan wrote: > On Sun, Aug 10, 2014 at 8:01 PM, Radha Mohan <mohun106@gmail.com> wrote: >> Hi Feng, >> >> >>> + >>> +.Lcpy_not_short: >>> + /* >>> + * We don't much care about the alignment of DST, but we want SRC >>> + * to be 128-bit (16 byte) aligned so that we don't cross cache line >>> + * boundaries on both loads and stores. >>> + */ >> Could you please tell why is destination alignment not an issue? Is >> this a generic implementation that you are referring to or specific to >> your platform? > This is per Linaro Cortext String optimization routines. > > https://launchpad.net/cortex-strings > > Zhichang submitted something similar for the memcpy from the > same optimization. > > Sorry resend in text mode. If the both dst and src are not aligned and their alignment offset are not equal, i haven't found better way to handle. But it is lucky ARMv8 support the non-align memory access. At the beginning of my patch work, i also think maybe it is more better that all load or store are aligned. I wrote the code just like the ARMv7 memcpy, firstly loaded the data from SRC and buffered them in several registers and combined as a new word( 16 bytes), then stored it to the aligned DST. But the performance is a bit worst. ~Zhichang >>> -- >>> 1.9.1 >>> >>> >>> _______________________________________________ >>> linux-arm-kernel mailing list >>> linux-arm-kernel@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Feng & Zhichang, On 13 Aug 2014, at 05:13 , zhichang.yuan <zhichang.yuan@linaro.org> wrote: > If the both dst and src are not aligned and their alignment offset are not equal, i haven't found better way > to handle. > But it is lucky ARMv8 support the non-align memory access. > At the beginning of my patch work, i also think maybe it is more better that all load or store are aligned. I > wrote the code just like the ARMv7 memcpy, firstly loaded the data from SRC and buffered them in several > registers and combined as a new word( 16 bytes), then stored it to the aligned DST. But the performance is a > bit worst. When looking at the underlying effects in the execution pipeline, the store-operations are non-critical for the throughput and we need to optimize for optimal throughput on the load-operations. This is because the store operations have no dependent operations and the store-pipeline will take care of any lingering effects from the misalignment (given that mechanisms like write-allocate make cache-effects on the store-operations more difficult to predict, I’m glad we don’t have to go into too much detail on those), as there are no throughput/bandwidth limits on the store-pipeline that we could even theoretically hit with such a loop. The load-operations are much more critical in the context of what we try to achieve: as our progress through the loop depends on the load-operations getting their results, so we process the associated stores, we need to ensure optimal and deterministic throughput on those. As a misaligned load is likely to carry a penalty (e.g. on XGene it will typically carry a small penalty when crossing a cache line, especially if the second cache-line isn’t cached yet), we need to avoid misaligned loads. If we would try to buffer data and then perform aligned stores, we’ll only introduce additional instructions and latency into our critical loop. At the same time—given what I wrote above misaligned store-operations being essentially free—there’s no benefit to be gained from the extra work required. I hope this explains the observed behaviour somewhat better. Cheers, Philipp.
On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote: > Using the glibc cortex string work work authored by Linaro as base to > create new copy to/from user kernel routine. > > Iperf performance increase: > -l (size) 1 core result > Optimized 64B 44-51Mb/s > 1500B 4.9Gb/s > 30000B 16.2Gb/s > Original 64B 34-50.7Mb/s > 1500B 4.7Gb/s > 30000B 14.5Gb/s I ran into some horrible crashes with this patch, it had been applied to the Ubuntu kernel I was running. Details below: > > Signed-off-by: Feng Kan <fkan@apm.com> > --- > arch/arm64/lib/copy_from_user.S | 36 +----- > arch/arm64/lib/copy_template.S | 278 ++++++++++++++++++++++++++++++++++++++++ > arch/arm64/lib/copy_to_user.S | 31 +---- > 3 files changed, 284 insertions(+), 61 deletions(-) > create mode 100644 arch/arm64/lib/copy_template.S > > diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S > index 5e27add..c4c5187 100644 > --- a/arch/arm64/lib/copy_from_user.S > +++ b/arch/arm64/lib/copy_from_user.S > @@ -15,7 +15,6 @@ > */ > > #include <linux/linkage.h> > -#include <asm/assembler.h> > > /* > * Copy from user space to a kernel buffer (alignment handled by the hardware) > @@ -28,39 +27,10 @@ > * x0 - bytes not copied > */ > ENTRY(__copy_from_user) > - add x4, x1, x2 // upper user buffer boundary > - subs x2, x2, #8 > - b.mi 2f > -1: > -USER(9f, ldr x3, [x1], #8 ) > - subs x2, x2, #8 > - str x3, [x0], #8 > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > -USER(9f, ldr w3, [x1], #4 ) > - sub x2, x2, #4 > - str w3, [x0], #4 > -3: adds x2, x2, #2 > - b.mi 4f > -USER(9f, ldrh w3, [x1], #2 ) > - sub x2, x2, #2 > - strh w3, [x0], #2 > -4: adds x2, x2, #1 > - b.mi 5f > -USER(9f, ldrb w3, [x1] ) > - strb w3, [x0] > -5: mov x0, #0 > - ret > +#include "copy_template.S" > ENDPROC(__copy_from_user) > > .section .fixup,"ax" > - .align 2 > -9: sub x2, x4, x1 > - mov x3, x2 > -10: strb wzr, [x0], #1 // zero remaining buffer space > - subs x3, x3, #1 > - b.ne 10b > - mov x0, x2 // bytes not copied > - ret > + .align 2 > + copy_abort_table > .previous > diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S > new file mode 100644 > index 0000000..f2c7003 > --- /dev/null > +++ b/arch/arm64/lib/copy_template.S > @@ -0,0 +1,278 @@ > +/* > + * Copyright (c) 2013, Applied Micro Circuits Corporation > + * Copyright (c) 2012-2013, Linaro Limited > + * > + * Author: Feng Kan <fkan@apm.com> > + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com> > + * > + * The code is adopted from the memcpy routine by Linaro Limited. > + * > + * This file is free software: you may copy, redistribute and/or modify it > + * under the terms of the GNU General Public License as published by the > + * Free Software Foundation, either version 2 of the License, or (at your > + * option) any later version. > + * > + * This file is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + * > + * This file incorporates work covered by the following copyright and > + * permission notice: > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions are met: > + * 1 Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * 2 Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in the > + * documentation and/or other materials provided with the distribution. > + * 3 Neither the name of the Linaro nor the > + * names of its contributors may be used to endorse or promote products > + * derived from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > + * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > + */ > +#include <asm/assembler.h> > + > +dstin .req x0 > +src .req x1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +tmp3 .req x5 > +tmp3w .req w5 > +dst .req x6 > + > +A_l .req x7 > +A_h .req x8 > +B_l .req x9 > +B_h .req x10 > +C_l .req x11 > +C_h .req x12 > +D_l .req x13 > +D_h .req x14 > + > + mov dst, dstin > + cmp count, #64 > + b.ge .Lcpy_not_short > + cmp count, #15 > + b.le .Ltail15tiny > + > + /* > + * Deal with small copies quickly by dropping straight into the > + * exit block. > + */ > +.Ltail63: > + /* > + * Copy up to 48 bytes of data. At this point we only need the > + * bottom 6 bits of count to be accurate. > + */ > + ands tmp1, count, #0x30 > + b.eq .Ltail15 > + add dst, dst, tmp1 > + add src, src, tmp1 > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + USER(8f, ldp A_l, A_h, [src, #-48]) > + USER(8f, stp A_l, A_h, [dst, #-48]) > +1: > + USER(8f, ldp A_l, A_h, [src, #-32]) > + USER(8f, stp A_l, A_h, [dst, #-32]) > +2: > + USER(8f, ldp A_l, A_h, [src, #-16]) > + USER(8f, stp A_l, A_h, [dst, #-16]) > + > +.Ltail15: > + ands count, count, #15 > + beq 1f > + add src, src, count > + USER(9f, ldp A_l, A_h, [src, #-16]) > + add dst, dst, count > + USER(9f, stp A_l, A_h, [dst, #-16]) > +1: > + b .Lsuccess > + > +.Ltail15tiny: > + /* > + * Copy up to 15 bytes of data. Does not assume additional data > + * being copied. > + */ > + tbz count, #3, 1f > + USER(10f, ldr tmp1, [src], #8) > + USER(10f, str tmp1, [dst], #8) > +1: > + tbz count, #2, 1f > + USER(10f, ldr tmp1w, [src], #4) > + USER(10f, str tmp1w, [dst], #4) > +1: > + tbz count, #1, 1f > + USER(10f, ldrh tmp1w, [src], #2) > + USER(10f, strh tmp1w, [dst], #2) > +1: > + tbz count, #0, 1f > + USER(10f, ldrb tmp1w, [src]) > + USER(10f, strb tmp1w, [dst]) > +1: > + b .Lsuccess > + > +.Lcpy_not_short: > + /* > + * We don't much care about the alignment of DST, but we want SRC > + * to be 128-bit (16 byte) aligned so that we don't cross cache line > + * boundaries on both loads and stores. > + */ > + neg tmp2, src > + ands tmp2, tmp2, #15 /* Bytes to reach alignment. */ > + b.eq 2f > + sub count, count, tmp2 > + /* > + * Copy more data than needed; it's faster than jumping > + * around copying sub-Quadword quantities. We know that > + * it can't overrun. > + */ > + USER(11f, ldp A_l, A_h, [src]) > + add src, src, tmp2 > + USER(11f, stp A_l, A_h, [dst]) > + add dst, dst, tmp2 > + /* There may be less than 63 bytes to go now. */ > + cmp count, #63 > + b.le .Ltail63 > +2: > + subs count, count, #128 > + b.ge .Lcpy_body_large > + /* > + * Less than 128 bytes to copy, so handle 64 here and then jump > + * to the tail. > + */ > + USER(12f, ldp A_l, A_h, [src]) > + USER(12f, ldp B_l, B_h, [src, #16]) > + USER(12f, ldp C_l, C_h, [src, #32]) > + USER(12f, ldp D_l, D_h, [src, #48]) > + USER(12f, stp A_l, A_h, [dst]) > + USER(12f, stp B_l, B_h, [dst, #16]) > + USER(12f, stp C_l, C_h, [dst, #32]) > + USER(12f, stp D_l, D_h, [dst, #48]) > + tst count, #0x3f > + add src, src, #64 > + add dst, dst, #64 > + b.ne .Ltail63 > + b .Lsuccess > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line this ensures the entire loop is in one line. > + */ > + .p2align 6 > +.Lcpy_body_large: > + /* There are at least 128 bytes to copy. */ > + USER(12f, ldp A_l, A_h, [src, #0]) > + sub dst, dst, #16 /* Pre-bias. */ > + USER(13f, ldp B_l, B_h, [src, #16]) > + USER(13f, ldp C_l, C_h, [src, #32]) > + USER(13f, ldp D_l, D_h, [src, #48]!) /* src += 64 - Pre-bias. */ > +1: > + USER(13f, stp A_l, A_h, [dst, #16]) > + USER(13f, ldp A_l, A_h, [src, #16]) > + USER(13f, stp B_l, B_h, [dst, #32]) > + USER(13f, ldp B_l, B_h, [src, #32]) > + USER(13f, stp C_l, C_h, [dst, #48]) > + USER(13f, ldp C_l, C_h, [src, #48]) > + USER(13f, stp D_l, D_h, [dst, #64]!) > + USER(13f, ldp D_l, D_h, [src, #64]!) > + subs count, count, #64 > + b.ge 1b > + USER(14f, stp A_l, A_h, [dst, #16]) > + USER(14f, stp B_l, B_h, [dst, #32]) > + USER(14f, stp C_l, C_h, [dst, #48]) > + USER(14f, stp D_l, D_h, [dst, #64]) > + add src, src, #16 > + add dst, dst, #64 + 16 > + tst count, #0x3f > + b.ne .Ltail63 > +.Lsuccess: > + /* Nothing left to copy */ > + mov x0, #0 > + ret > + > + .macro copy_abort_table > +8: > + /* > + * Count bytes remain > + * dst points to (dst + tmp1) > + */ > + mov x0, count > + sub dst, dst, tmp1 > + b .Lfinalize > +9: > + /* > + * 16 bytes remain > + * dst is accurate > + */ > + mov x0, #16 > + b .Lfinalize > +10: > + /* > + * count is accurate > + * dst is accurate > + */ > + mov x0, count > + b .Lfinalize > +11: > + /* > + *(count + tmp2) bytes remain > + * dst points to the start of the remaining bytes > + */ > + add x0, count, tmp2 > + b .Lfinalize > +12: > + /* > + * (count + 128) bytes remain > + * dst is accurate > + */ > + add x0, count, #128 > + b .Lfinalize > +13: > + /* > + * (count + 128) bytes remain > + * dst is pre-biased to (dst + 16) > + */ > + add x0, count, #128 > + sub dst, dst, #16 > + b .Lfinalize > +14: > + /* > + * count is accurate > + * dst is pre-biased to (dst + 16) > + */ > + mov x0, count > + sub dst, dst, #16 > + /* fall-through */ > +.Lfinalize: > + /* > + * Zeroize remaining destination-buffer > + */ > + mov count, x0 > +20: > + /* Zero remaining buffer space */ > + strb wzr, [dst], #1 > + subs count, count, #1 > + b.ne 20b > + ret > + .endm > diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S > index a0aeeb9..08787b0 100644 > --- a/arch/arm64/lib/copy_to_user.S > +++ b/arch/arm64/lib/copy_to_user.S > @@ -15,7 +15,6 @@ > */ > > #include <linux/linkage.h> > -#include <asm/assembler.h> > > /* > * Copy to user space from a kernel buffer (alignment handled by the hardware) > @@ -28,34 +27,10 @@ > * x0 - bytes not copied > */ > ENTRY(__copy_to_user) > - add x4, x0, x2 // upper user buffer boundary > - subs x2, x2, #8 > - b.mi 2f > -1: > - ldr x3, [x1], #8 > - subs x2, x2, #8 > -USER(9f, str x3, [x0], #8 ) > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - ldr w3, [x1], #4 > - sub x2, x2, #4 > -USER(9f, str w3, [x0], #4 ) > -3: adds x2, x2, #2 > - b.mi 4f > - ldrh w3, [x1], #2 > - sub x2, x2, #2 > -USER(9f, strh w3, [x0], #2 ) > -4: adds x2, x2, #1 > - b.mi 5f > - ldrb w3, [x1] > -USER(9f, strb w3, [x0] ) > -5: mov x0, #0 > - ret > +#include "copy_template.S" > ENDPROC(__copy_to_user) > > .section .fixup,"ax" > - .align 2 > -9: sub x0, x4, x0 // bytes not copied > - ret > + .align 2 > + copy_abort_table The exact same fixup code is being used for copy_to_user and copy_from_user. For the copy_from_user case we want to zero the rest of the kernel destination buffer when we hit a pagefault reading from user space. However, for the copy_to_user case we most definitely don't want to write zeros in the destination buffer when we hit a pagefault writing to user space! I get unhandled pagefaults here, when copy_to_user is called: 0xffffffc00073c638 <+8920>: strb wzr, [x6],#1 0xffffffc00073c63c <+8924>: subs x2, x2, #0x1 0xffffffc00073c640 <+8928>: b.ne 0xffffffc00073c638 <__hyp_text_end+8920> 0xffffffc00073c644 <+8932>: ret I would suggest re-working the fixup path and testing both fixup paths thoroughly by placing the system under memory pressure and confirming that they are both "hit". Dann, Tim, Could you please revert this from the Ubuntu kernels, whilst it is being fixed? Cheers,
On Thu, Nov 27, 2014 at 4:43 AM, Steve Capper <steve.capper@linaro.org> wrote: > On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote: >> Using the glibc cortex string work work authored by Linaro as base to >> create new copy to/from user kernel routine. >> >> Iperf performance increase: >> -l (size) 1 core result >> Optimized 64B 44-51Mb/s >> 1500B 4.9Gb/s >> 30000B 16.2Gb/s >> Original 64B 34-50.7Mb/s >> 1500B 4.7Gb/s >> 30000B 14.5Gb/s > > I ran into some horrible crashes with this patch, it had been applied > to the Ubuntu kernel I was running. Details below: > >> >> Signed-off-by: Feng Kan <fkan@apm.com> >> --- >> arch/arm64/lib/copy_from_user.S | 36 +----- >> arch/arm64/lib/copy_template.S | 278 ++++++++++++++++++++++++++++++++++++++++ >> arch/arm64/lib/copy_to_user.S | 31 +---- >> 3 files changed, 284 insertions(+), 61 deletions(-) >> create mode 100644 arch/arm64/lib/copy_template.S >> >> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S >> index 5e27add..c4c5187 100644 >> --- a/arch/arm64/lib/copy_from_user.S >> +++ b/arch/arm64/lib/copy_from_user.S >> @@ -15,7 +15,6 @@ >> */ >> >> #include <linux/linkage.h> >> -#include <asm/assembler.h> >> >> /* >> * Copy from user space to a kernel buffer (alignment handled by the hardware) >> @@ -28,39 +27,10 @@ >> * x0 - bytes not copied >> */ >> ENTRY(__copy_from_user) >> - add x4, x1, x2 // upper user buffer boundary >> - subs x2, x2, #8 >> - b.mi 2f >> -1: >> -USER(9f, ldr x3, [x1], #8 ) >> - subs x2, x2, #8 >> - str x3, [x0], #8 >> - b.pl 1b >> -2: adds x2, x2, #4 >> - b.mi 3f >> -USER(9f, ldr w3, [x1], #4 ) >> - sub x2, x2, #4 >> - str w3, [x0], #4 >> -3: adds x2, x2, #2 >> - b.mi 4f >> -USER(9f, ldrh w3, [x1], #2 ) >> - sub x2, x2, #2 >> - strh w3, [x0], #2 >> -4: adds x2, x2, #1 >> - b.mi 5f >> -USER(9f, ldrb w3, [x1] ) >> - strb w3, [x0] >> -5: mov x0, #0 >> - ret >> +#include "copy_template.S" >> ENDPROC(__copy_from_user) >> >> .section .fixup,"ax" >> - .align 2 >> -9: sub x2, x4, x1 >> - mov x3, x2 >> -10: strb wzr, [x0], #1 // zero remaining buffer space >> - subs x3, x3, #1 >> - b.ne 10b >> - mov x0, x2 // bytes not copied >> - ret >> + .align 2 >> + copy_abort_table >> .previous >> diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S >> new file mode 100644 >> index 0000000..f2c7003 >> --- /dev/null >> +++ b/arch/arm64/lib/copy_template.S >> @@ -0,0 +1,278 @@ >> +/* >> + * Copyright (c) 2013, Applied Micro Circuits Corporation >> + * Copyright (c) 2012-2013, Linaro Limited >> + * >> + * Author: Feng Kan <fkan@apm.com> >> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com> >> + * >> + * The code is adopted from the memcpy routine by Linaro Limited. >> + * >> + * This file is free software: you may copy, redistribute and/or modify it >> + * under the terms of the GNU General Public License as published by the >> + * Free Software Foundation, either version 2 of the License, or (at your >> + * option) any later version. >> + * >> + * This file is distributed in the hope that it will be useful, but >> + * WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >> + * >> + * This file incorporates work covered by the following copyright and >> + * permission notice: >> + * >> + * Redistribution and use in source and binary forms, with or without >> + * modification, are permitted provided that the following conditions are met: >> + * 1 Redistributions of source code must retain the above copyright >> + * notice, this list of conditions and the following disclaimer. >> + * 2 Redistributions in binary form must reproduce the above copyright >> + * notice, this list of conditions and the following disclaimer in the >> + * documentation and/or other materials provided with the distribution. >> + * 3 Neither the name of the Linaro nor the >> + * names of its contributors may be used to endorse or promote products >> + * derived from this software without specific prior written permission. >> + * >> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS >> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT >> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR >> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT >> + * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, >> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT >> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, >> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY >> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT >> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE >> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. >> + */ >> +#include <asm/assembler.h> >> + >> +dstin .req x0 >> +src .req x1 >> +count .req x2 >> +tmp1 .req x3 >> +tmp1w .req w3 >> +tmp2 .req x4 >> +tmp2w .req w4 >> +tmp3 .req x5 >> +tmp3w .req w5 >> +dst .req x6 >> + >> +A_l .req x7 >> +A_h .req x8 >> +B_l .req x9 >> +B_h .req x10 >> +C_l .req x11 >> +C_h .req x12 >> +D_l .req x13 >> +D_h .req x14 >> + >> + mov dst, dstin >> + cmp count, #64 >> + b.ge .Lcpy_not_short >> + cmp count, #15 >> + b.le .Ltail15tiny >> + >> + /* >> + * Deal with small copies quickly by dropping straight into the >> + * exit block. >> + */ >> +.Ltail63: >> + /* >> + * Copy up to 48 bytes of data. At this point we only need the >> + * bottom 6 bits of count to be accurate. >> + */ >> + ands tmp1, count, #0x30 >> + b.eq .Ltail15 >> + add dst, dst, tmp1 >> + add src, src, tmp1 >> + cmp tmp1w, #0x20 >> + b.eq 1f >> + b.lt 2f >> + USER(8f, ldp A_l, A_h, [src, #-48]) >> + USER(8f, stp A_l, A_h, [dst, #-48]) >> +1: >> + USER(8f, ldp A_l, A_h, [src, #-32]) >> + USER(8f, stp A_l, A_h, [dst, #-32]) >> +2: >> + USER(8f, ldp A_l, A_h, [src, #-16]) >> + USER(8f, stp A_l, A_h, [dst, #-16]) >> + >> +.Ltail15: >> + ands count, count, #15 >> + beq 1f >> + add src, src, count >> + USER(9f, ldp A_l, A_h, [src, #-16]) >> + add dst, dst, count >> + USER(9f, stp A_l, A_h, [dst, #-16]) >> +1: >> + b .Lsuccess >> + >> +.Ltail15tiny: >> + /* >> + * Copy up to 15 bytes of data. Does not assume additional data >> + * being copied. >> + */ >> + tbz count, #3, 1f >> + USER(10f, ldr tmp1, [src], #8) >> + USER(10f, str tmp1, [dst], #8) >> +1: >> + tbz count, #2, 1f >> + USER(10f, ldr tmp1w, [src], #4) >> + USER(10f, str tmp1w, [dst], #4) >> +1: >> + tbz count, #1, 1f >> + USER(10f, ldrh tmp1w, [src], #2) >> + USER(10f, strh tmp1w, [dst], #2) >> +1: >> + tbz count, #0, 1f >> + USER(10f, ldrb tmp1w, [src]) >> + USER(10f, strb tmp1w, [dst]) >> +1: >> + b .Lsuccess >> + >> +.Lcpy_not_short: >> + /* >> + * We don't much care about the alignment of DST, but we want SRC >> + * to be 128-bit (16 byte) aligned so that we don't cross cache line >> + * boundaries on both loads and stores. >> + */ >> + neg tmp2, src >> + ands tmp2, tmp2, #15 /* Bytes to reach alignment. */ >> + b.eq 2f >> + sub count, count, tmp2 >> + /* >> + * Copy more data than needed; it's faster than jumping >> + * around copying sub-Quadword quantities. We know that >> + * it can't overrun. >> + */ >> + USER(11f, ldp A_l, A_h, [src]) >> + add src, src, tmp2 >> + USER(11f, stp A_l, A_h, [dst]) >> + add dst, dst, tmp2 >> + /* There may be less than 63 bytes to go now. */ >> + cmp count, #63 >> + b.le .Ltail63 >> +2: >> + subs count, count, #128 >> + b.ge .Lcpy_body_large >> + /* >> + * Less than 128 bytes to copy, so handle 64 here and then jump >> + * to the tail. >> + */ >> + USER(12f, ldp A_l, A_h, [src]) >> + USER(12f, ldp B_l, B_h, [src, #16]) >> + USER(12f, ldp C_l, C_h, [src, #32]) >> + USER(12f, ldp D_l, D_h, [src, #48]) >> + USER(12f, stp A_l, A_h, [dst]) >> + USER(12f, stp B_l, B_h, [dst, #16]) >> + USER(12f, stp C_l, C_h, [dst, #32]) >> + USER(12f, stp D_l, D_h, [dst, #48]) >> + tst count, #0x3f >> + add src, src, #64 >> + add dst, dst, #64 >> + b.ne .Ltail63 >> + b .Lsuccess >> + >> + /* >> + * Critical loop. Start at a new cache line boundary. Assuming >> + * 64 bytes per line this ensures the entire loop is in one line. >> + */ >> + .p2align 6 >> +.Lcpy_body_large: >> + /* There are at least 128 bytes to copy. */ >> + USER(12f, ldp A_l, A_h, [src, #0]) >> + sub dst, dst, #16 /* Pre-bias. */ >> + USER(13f, ldp B_l, B_h, [src, #16]) >> + USER(13f, ldp C_l, C_h, [src, #32]) >> + USER(13f, ldp D_l, D_h, [src, #48]!) /* src += 64 - Pre-bias. */ >> +1: >> + USER(13f, stp A_l, A_h, [dst, #16]) >> + USER(13f, ldp A_l, A_h, [src, #16]) >> + USER(13f, stp B_l, B_h, [dst, #32]) >> + USER(13f, ldp B_l, B_h, [src, #32]) >> + USER(13f, stp C_l, C_h, [dst, #48]) >> + USER(13f, ldp C_l, C_h, [src, #48]) >> + USER(13f, stp D_l, D_h, [dst, #64]!) >> + USER(13f, ldp D_l, D_h, [src, #64]!) >> + subs count, count, #64 >> + b.ge 1b >> + USER(14f, stp A_l, A_h, [dst, #16]) >> + USER(14f, stp B_l, B_h, [dst, #32]) >> + USER(14f, stp C_l, C_h, [dst, #48]) >> + USER(14f, stp D_l, D_h, [dst, #64]) >> + add src, src, #16 >> + add dst, dst, #64 + 16 >> + tst count, #0x3f >> + b.ne .Ltail63 >> +.Lsuccess: >> + /* Nothing left to copy */ >> + mov x0, #0 >> + ret >> + >> + .macro copy_abort_table >> +8: >> + /* >> + * Count bytes remain >> + * dst points to (dst + tmp1) >> + */ >> + mov x0, count >> + sub dst, dst, tmp1 >> + b .Lfinalize >> +9: >> + /* >> + * 16 bytes remain >> + * dst is accurate >> + */ >> + mov x0, #16 >> + b .Lfinalize >> +10: >> + /* >> + * count is accurate >> + * dst is accurate >> + */ >> + mov x0, count >> + b .Lfinalize >> +11: >> + /* >> + *(count + tmp2) bytes remain >> + * dst points to the start of the remaining bytes >> + */ >> + add x0, count, tmp2 >> + b .Lfinalize >> +12: >> + /* >> + * (count + 128) bytes remain >> + * dst is accurate >> + */ >> + add x0, count, #128 >> + b .Lfinalize >> +13: >> + /* >> + * (count + 128) bytes remain >> + * dst is pre-biased to (dst + 16) >> + */ >> + add x0, count, #128 >> + sub dst, dst, #16 >> + b .Lfinalize >> +14: >> + /* >> + * count is accurate >> + * dst is pre-biased to (dst + 16) >> + */ >> + mov x0, count >> + sub dst, dst, #16 >> + /* fall-through */ >> +.Lfinalize: >> + /* >> + * Zeroize remaining destination-buffer >> + */ >> + mov count, x0 >> +20: >> + /* Zero remaining buffer space */ >> + strb wzr, [dst], #1 >> + subs count, count, #1 >> + b.ne 20b >> + ret >> + .endm >> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S >> index a0aeeb9..08787b0 100644 >> --- a/arch/arm64/lib/copy_to_user.S >> +++ b/arch/arm64/lib/copy_to_user.S >> @@ -15,7 +15,6 @@ >> */ >> >> #include <linux/linkage.h> >> -#include <asm/assembler.h> >> >> /* >> * Copy to user space from a kernel buffer (alignment handled by the hardware) >> @@ -28,34 +27,10 @@ >> * x0 - bytes not copied >> */ >> ENTRY(__copy_to_user) >> - add x4, x0, x2 // upper user buffer boundary >> - subs x2, x2, #8 >> - b.mi 2f >> -1: >> - ldr x3, [x1], #8 >> - subs x2, x2, #8 >> -USER(9f, str x3, [x0], #8 ) >> - b.pl 1b >> -2: adds x2, x2, #4 >> - b.mi 3f >> - ldr w3, [x1], #4 >> - sub x2, x2, #4 >> -USER(9f, str w3, [x0], #4 ) >> -3: adds x2, x2, #2 >> - b.mi 4f >> - ldrh w3, [x1], #2 >> - sub x2, x2, #2 >> -USER(9f, strh w3, [x0], #2 ) >> -4: adds x2, x2, #1 >> - b.mi 5f >> - ldrb w3, [x1] >> -USER(9f, strb w3, [x0] ) >> -5: mov x0, #0 >> - ret >> +#include "copy_template.S" >> ENDPROC(__copy_to_user) >> >> .section .fixup,"ax" >> - .align 2 >> -9: sub x0, x4, x0 // bytes not copied >> - ret >> + .align 2 >> + copy_abort_table > > The exact same fixup code is being used for copy_to_user and > copy_from_user. > > For the copy_from_user case we want to zero the rest of the kernel > destination buffer when we hit a pagefault reading from user space. > > However, for the copy_to_user case we most definitely don't want to > write zeros in the destination buffer when we hit a pagefault writing > to user space! I get unhandled pagefaults here, when copy_to_user is > called: > > 0xffffffc00073c638 <+8920>: strb wzr, [x6],#1 > 0xffffffc00073c63c <+8924>: subs x2, x2, #0x1 > 0xffffffc00073c640 <+8928>: b.ne 0xffffffc00073c638 <__hyp_text_end+8920> > 0xffffffc00073c644 <+8932>: ret > > I would suggest re-working the fixup path and testing both fixup paths > thoroughly by placing the system under memory pressure and confirming > that they are both "hit". > > Dann, Tim, > Could you please revert this from the Ubuntu kernels, whilst it is > being fixed? Seems wise, thanks for the heads-up. My coworker has started the process here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398596 -dann > Cheers, > -- > Steve
On Tue, Dec 2, 2014 at 5:59 PM, Dann Frazier <dann.frazier@canonical.com> wrote: > On Thu, Nov 27, 2014 at 4:43 AM, Steve Capper <steve.capper@linaro.org> wrote: >> On Fri, Aug 08, 2014 at 04:02:20PM -0700, Feng Kan wrote: >>> Using the glibc cortex string work work authored by Linaro as base to >>> create new copy to/from user kernel routine. >>> >>> Iperf performance increase: >>> -l (size) 1 core result >>> Optimized 64B 44-51Mb/s >>> 1500B 4.9Gb/s >>> 30000B 16.2Gb/s >>> Original 64B 34-50.7Mb/s >>> 1500B 4.7Gb/s >>> 30000B 14.5Gb/s >> >> I ran into some horrible crashes with this patch, it had been applied >> to the Ubuntu kernel I was running. Details below: >> >>> >>> Signed-off-by: Feng Kan <fkan@apm.com> >>> --- >>> arch/arm64/lib/copy_from_user.S | 36 +----- >>> arch/arm64/lib/copy_template.S | 278 ++++++++++++++++++++++++++++++++++++++++ >>> arch/arm64/lib/copy_to_user.S | 31 +---- >>> 3 files changed, 284 insertions(+), 61 deletions(-) >>> create mode 100644 arch/arm64/lib/copy_template.S >>> >>> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S >>> index 5e27add..c4c5187 100644 >>> --- a/arch/arm64/lib/copy_from_user.S >>> +++ b/arch/arm64/lib/copy_from_user.S >>> @@ -15,7 +15,6 @@ >>> */ >>> >>> #include <linux/linkage.h> >>> -#include <asm/assembler.h> >>> >>> /* >>> * Copy from user space to a kernel buffer (alignment handled by the hardware) >>> @@ -28,39 +27,10 @@ >>> * x0 - bytes not copied >>> */ >>> ENTRY(__copy_from_user) >>> - add x4, x1, x2 // upper user buffer boundary >>> - subs x2, x2, #8 >>> - b.mi 2f >>> -1: >>> -USER(9f, ldr x3, [x1], #8 ) >>> - subs x2, x2, #8 >>> - str x3, [x0], #8 >>> - b.pl 1b >>> -2: adds x2, x2, #4 >>> - b.mi 3f >>> -USER(9f, ldr w3, [x1], #4 ) >>> - sub x2, x2, #4 >>> - str w3, [x0], #4 >>> -3: adds x2, x2, #2 >>> - b.mi 4f >>> -USER(9f, ldrh w3, [x1], #2 ) >>> - sub x2, x2, #2 >>> - strh w3, [x0], #2 >>> -4: adds x2, x2, #1 >>> - b.mi 5f >>> -USER(9f, ldrb w3, [x1] ) >>> - strb w3, [x0] >>> -5: mov x0, #0 >>> - ret >>> +#include "copy_template.S" >>> ENDPROC(__copy_from_user) >>> >>> .section .fixup,"ax" >>> - .align 2 >>> -9: sub x2, x4, x1 >>> - mov x3, x2 >>> -10: strb wzr, [x0], #1 // zero remaining buffer space >>> - subs x3, x3, #1 >>> - b.ne 10b >>> - mov x0, x2 // bytes not copied >>> - ret >>> + .align 2 >>> + copy_abort_table >>> .previous >>> diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S >>> new file mode 100644 >>> index 0000000..f2c7003 >>> --- /dev/null >>> +++ b/arch/arm64/lib/copy_template.S >>> @@ -0,0 +1,278 @@ >>> +/* >>> + * Copyright (c) 2013, Applied Micro Circuits Corporation >>> + * Copyright (c) 2012-2013, Linaro Limited >>> + * >>> + * Author: Feng Kan <fkan@apm.com> >>> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com> >>> + * >>> + * The code is adopted from the memcpy routine by Linaro Limited. >>> + * >>> + * This file is free software: you may copy, redistribute and/or modify it >>> + * under the terms of the GNU General Public License as published by the >>> + * Free Software Foundation, either version 2 of the License, or (at your >>> + * option) any later version. >>> + * >>> + * This file is distributed in the hope that it will be useful, but >>> + * WITHOUT ANY WARRANTY; without even the implied warranty of >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >>> + * General Public License for more details. >>> + * >>> + * You should have received a copy of the GNU General Public License >>> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >>> + * >>> + * This file incorporates work covered by the following copyright and >>> + * permission notice: >>> + * >>> + * Redistribution and use in source and binary forms, with or without >>> + * modification, are permitted provided that the following conditions are met: >>> + * 1 Redistributions of source code must retain the above copyright >>> + * notice, this list of conditions and the following disclaimer. >>> + * 2 Redistributions in binary form must reproduce the above copyright >>> + * notice, this list of conditions and the following disclaimer in the >>> + * documentation and/or other materials provided with the distribution. >>> + * 3 Neither the name of the Linaro nor the >>> + * names of its contributors may be used to endorse or promote products >>> + * derived from this software without specific prior written permission. >>> + * >>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS >>> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT >>> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR >>> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT >>> + * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, >>> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT >>> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, >>> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY >>> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT >>> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE >>> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. >>> + */ >>> +#include <asm/assembler.h> >>> + >>> +dstin .req x0 >>> +src .req x1 >>> +count .req x2 >>> +tmp1 .req x3 >>> +tmp1w .req w3 >>> +tmp2 .req x4 >>> +tmp2w .req w4 >>> +tmp3 .req x5 >>> +tmp3w .req w5 >>> +dst .req x6 >>> + >>> +A_l .req x7 >>> +A_h .req x8 >>> +B_l .req x9 >>> +B_h .req x10 >>> +C_l .req x11 >>> +C_h .req x12 >>> +D_l .req x13 >>> +D_h .req x14 >>> + >>> + mov dst, dstin >>> + cmp count, #64 >>> + b.ge .Lcpy_not_short >>> + cmp count, #15 >>> + b.le .Ltail15tiny >>> + >>> + /* >>> + * Deal with small copies quickly by dropping straight into the >>> + * exit block. >>> + */ >>> +.Ltail63: >>> + /* >>> + * Copy up to 48 bytes of data. At this point we only need the >>> + * bottom 6 bits of count to be accurate. >>> + */ >>> + ands tmp1, count, #0x30 >>> + b.eq .Ltail15 >>> + add dst, dst, tmp1 >>> + add src, src, tmp1 >>> + cmp tmp1w, #0x20 >>> + b.eq 1f >>> + b.lt 2f >>> + USER(8f, ldp A_l, A_h, [src, #-48]) >>> + USER(8f, stp A_l, A_h, [dst, #-48]) >>> +1: >>> + USER(8f, ldp A_l, A_h, [src, #-32]) >>> + USER(8f, stp A_l, A_h, [dst, #-32]) >>> +2: >>> + USER(8f, ldp A_l, A_h, [src, #-16]) >>> + USER(8f, stp A_l, A_h, [dst, #-16]) >>> + >>> +.Ltail15: >>> + ands count, count, #15 >>> + beq 1f >>> + add src, src, count >>> + USER(9f, ldp A_l, A_h, [src, #-16]) >>> + add dst, dst, count >>> + USER(9f, stp A_l, A_h, [dst, #-16]) >>> +1: >>> + b .Lsuccess >>> + >>> +.Ltail15tiny: >>> + /* >>> + * Copy up to 15 bytes of data. Does not assume additional data >>> + * being copied. >>> + */ >>> + tbz count, #3, 1f >>> + USER(10f, ldr tmp1, [src], #8) >>> + USER(10f, str tmp1, [dst], #8) >>> +1: >>> + tbz count, #2, 1f >>> + USER(10f, ldr tmp1w, [src], #4) >>> + USER(10f, str tmp1w, [dst], #4) >>> +1: >>> + tbz count, #1, 1f >>> + USER(10f, ldrh tmp1w, [src], #2) >>> + USER(10f, strh tmp1w, [dst], #2) >>> +1: >>> + tbz count, #0, 1f >>> + USER(10f, ldrb tmp1w, [src]) >>> + USER(10f, strb tmp1w, [dst]) >>> +1: >>> + b .Lsuccess >>> + >>> +.Lcpy_not_short: >>> + /* >>> + * We don't much care about the alignment of DST, but we want SRC >>> + * to be 128-bit (16 byte) aligned so that we don't cross cache line >>> + * boundaries on both loads and stores. >>> + */ >>> + neg tmp2, src >>> + ands tmp2, tmp2, #15 /* Bytes to reach alignment. */ >>> + b.eq 2f >>> + sub count, count, tmp2 >>> + /* >>> + * Copy more data than needed; it's faster than jumping >>> + * around copying sub-Quadword quantities. We know that >>> + * it can't overrun. >>> + */ >>> + USER(11f, ldp A_l, A_h, [src]) >>> + add src, src, tmp2 >>> + USER(11f, stp A_l, A_h, [dst]) >>> + add dst, dst, tmp2 >>> + /* There may be less than 63 bytes to go now. */ >>> + cmp count, #63 >>> + b.le .Ltail63 >>> +2: >>> + subs count, count, #128 >>> + b.ge .Lcpy_body_large >>> + /* >>> + * Less than 128 bytes to copy, so handle 64 here and then jump >>> + * to the tail. >>> + */ >>> + USER(12f, ldp A_l, A_h, [src]) >>> + USER(12f, ldp B_l, B_h, [src, #16]) >>> + USER(12f, ldp C_l, C_h, [src, #32]) >>> + USER(12f, ldp D_l, D_h, [src, #48]) >>> + USER(12f, stp A_l, A_h, [dst]) >>> + USER(12f, stp B_l, B_h, [dst, #16]) >>> + USER(12f, stp C_l, C_h, [dst, #32]) >>> + USER(12f, stp D_l, D_h, [dst, #48]) >>> + tst count, #0x3f >>> + add src, src, #64 >>> + add dst, dst, #64 >>> + b.ne .Ltail63 >>> + b .Lsuccess >>> + >>> + /* >>> + * Critical loop. Start at a new cache line boundary. Assuming >>> + * 64 bytes per line this ensures the entire loop is in one line. >>> + */ >>> + .p2align 6 >>> +.Lcpy_body_large: >>> + /* There are at least 128 bytes to copy. */ >>> + USER(12f, ldp A_l, A_h, [src, #0]) >>> + sub dst, dst, #16 /* Pre-bias. */ >>> + USER(13f, ldp B_l, B_h, [src, #16]) >>> + USER(13f, ldp C_l, C_h, [src, #32]) >>> + USER(13f, ldp D_l, D_h, [src, #48]!) /* src += 64 - Pre-bias. */ >>> +1: >>> + USER(13f, stp A_l, A_h, [dst, #16]) >>> + USER(13f, ldp A_l, A_h, [src, #16]) >>> + USER(13f, stp B_l, B_h, [dst, #32]) >>> + USER(13f, ldp B_l, B_h, [src, #32]) >>> + USER(13f, stp C_l, C_h, [dst, #48]) >>> + USER(13f, ldp C_l, C_h, [src, #48]) >>> + USER(13f, stp D_l, D_h, [dst, #64]!) >>> + USER(13f, ldp D_l, D_h, [src, #64]!) >>> + subs count, count, #64 >>> + b.ge 1b >>> + USER(14f, stp A_l, A_h, [dst, #16]) >>> + USER(14f, stp B_l, B_h, [dst, #32]) >>> + USER(14f, stp C_l, C_h, [dst, #48]) >>> + USER(14f, stp D_l, D_h, [dst, #64]) >>> + add src, src, #16 >>> + add dst, dst, #64 + 16 >>> + tst count, #0x3f >>> + b.ne .Ltail63 >>> +.Lsuccess: >>> + /* Nothing left to copy */ >>> + mov x0, #0 >>> + ret >>> + >>> + .macro copy_abort_table >>> +8: >>> + /* >>> + * Count bytes remain >>> + * dst points to (dst + tmp1) >>> + */ >>> + mov x0, count >>> + sub dst, dst, tmp1 >>> + b .Lfinalize >>> +9: >>> + /* >>> + * 16 bytes remain >>> + * dst is accurate >>> + */ >>> + mov x0, #16 >>> + b .Lfinalize >>> +10: >>> + /* >>> + * count is accurate >>> + * dst is accurate >>> + */ >>> + mov x0, count >>> + b .Lfinalize >>> +11: >>> + /* >>> + *(count + tmp2) bytes remain >>> + * dst points to the start of the remaining bytes >>> + */ >>> + add x0, count, tmp2 >>> + b .Lfinalize >>> +12: >>> + /* >>> + * (count + 128) bytes remain >>> + * dst is accurate >>> + */ >>> + add x0, count, #128 >>> + b .Lfinalize >>> +13: >>> + /* >>> + * (count + 128) bytes remain >>> + * dst is pre-biased to (dst + 16) >>> + */ >>> + add x0, count, #128 >>> + sub dst, dst, #16 >>> + b .Lfinalize >>> +14: >>> + /* >>> + * count is accurate >>> + * dst is pre-biased to (dst + 16) >>> + */ >>> + mov x0, count >>> + sub dst, dst, #16 >>> + /* fall-through */ >>> +.Lfinalize: >>> + /* >>> + * Zeroize remaining destination-buffer >>> + */ >>> + mov count, x0 >>> +20: >>> + /* Zero remaining buffer space */ >>> + strb wzr, [dst], #1 >>> + subs count, count, #1 >>> + b.ne 20b >>> + ret >>> + .endm >>> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S >>> index a0aeeb9..08787b0 100644 >>> --- a/arch/arm64/lib/copy_to_user.S >>> +++ b/arch/arm64/lib/copy_to_user.S >>> @@ -15,7 +15,6 @@ >>> */ >>> >>> #include <linux/linkage.h> >>> -#include <asm/assembler.h> >>> >>> /* >>> * Copy to user space from a kernel buffer (alignment handled by the hardware) >>> @@ -28,34 +27,10 @@ >>> * x0 - bytes not copied >>> */ >>> ENTRY(__copy_to_user) >>> - add x4, x0, x2 // upper user buffer boundary >>> - subs x2, x2, #8 >>> - b.mi 2f >>> -1: >>> - ldr x3, [x1], #8 >>> - subs x2, x2, #8 >>> -USER(9f, str x3, [x0], #8 ) >>> - b.pl 1b >>> -2: adds x2, x2, #4 >>> - b.mi 3f >>> - ldr w3, [x1], #4 >>> - sub x2, x2, #4 >>> -USER(9f, str w3, [x0], #4 ) >>> -3: adds x2, x2, #2 >>> - b.mi 4f >>> - ldrh w3, [x1], #2 >>> - sub x2, x2, #2 >>> -USER(9f, strh w3, [x0], #2 ) >>> -4: adds x2, x2, #1 >>> - b.mi 5f >>> - ldrb w3, [x1] >>> -USER(9f, strb w3, [x0] ) >>> -5: mov x0, #0 >>> - ret >>> +#include "copy_template.S" >>> ENDPROC(__copy_to_user) >>> >>> .section .fixup,"ax" >>> - .align 2 >>> -9: sub x0, x4, x0 // bytes not copied >>> - ret >>> + .align 2 >>> + copy_abort_table >> >> The exact same fixup code is being used for copy_to_user and >> copy_from_user. >> >> For the copy_from_user case we want to zero the rest of the kernel >> destination buffer when we hit a pagefault reading from user space. >> >> However, for the copy_to_user case we most definitely don't want to >> write zeros in the destination buffer when we hit a pagefault writing >> to user space! I get unhandled pagefaults here, when copy_to_user is >> called: >> >> 0xffffffc00073c638 <+8920>: strb wzr, [x6],#1 >> 0xffffffc00073c63c <+8924>: subs x2, x2, #0x1 >> 0xffffffc00073c640 <+8928>: b.ne 0xffffffc00073c638 <__hyp_text_end+8920> >> 0xffffffc00073c644 <+8932>: ret >> >> I would suggest re-working the fixup path and testing both fixup paths >> thoroughly by placing the system under memory pressure and confirming >> that they are both "hit". >> >> Dann, Tim, >> Could you please revert this from the Ubuntu kernels, whilst it is >> being fixed? > > Seems wise, thanks for the heads-up. My coworker has started the process here: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398596 > > -dann > >> Cheers, >> -- >> Steve > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel Could you provide the steps you used to reproduce this issue? I have already submitted an SRU to revert this patch, but would like to have the method you used to reproduce it for testing. Thanks for reporting this.
On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote: [...] > > Could you provide the steps you used to reproduce this issue? I have > already submitted an SRU to revert this patch, but would like to have > the method you used to reproduce it for testing. Thanks for reporting > this. > > -- > Craig Magina Hi Craig, It's not easy to reproduce as the problem occurs when the kernel is copying memory to userspace and experiences a page fault. I have come across it whilst running "perf record -a" on a system that is swapping. Also, building source code under memory pressure (i.e. kernel compile just after the system has swapped out pages), gave me this problem too. Cheers, -- Steve
Craig/Steve/Dann, trying to understand what went wrong and when, I just went back to my original mail (May 1st, 2013) providing the original optimized functions for an iperf run at APM… this had two separate error-handling paths for _to_user and _from_user, where the _to_user case didn’t zeroize. I can’t quite track when these two paths were unified (or by who)… but I can confirm that only the _from_user path was designed to zeroize the buffer on error. @Craig: I’ll also forward you the original files, which I had earlier sent to Dann on Jul 3rd 2014. This should provide a point-of-reference for resolving this. Best, Philipp. > On 04 Dec 2014, at 13:27, Steve Capper <steve.capper@linaro.org> wrote: > > On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote: > [...] >> >> Could you provide the steps you used to reproduce this issue? I have >> already submitted an SRU to revert this patch, but would like to have >> the method you used to reproduce it for testing. Thanks for reporting >> this. >> >> -- >> Craig Magina > > Hi Craig, > It's not easy to reproduce as the problem occurs when the kernel is > copying memory to userspace and experiences a page fault. > > I have come across it whilst running "perf record -a" on a system that > is swapping. > > Also, building source code under memory pressure (i.e. kernel compile > just after the system has swapped out pages), gave me this problem > too. > > Cheers, > -- > Steve
On Thu, Dec 4, 2014 at 6:56 AM, Dr. Philipp Tomsich <philipp.tomsich@theobroma-systems.com> wrote: > Craig/Steve/Dann, > > trying to understand what went wrong and when, I just went back to my original mail (May 1st, 2013) providing the original optimized functions for an iperf run at APM… this had two separate error-handling paths for _to_user and _from_user, where the _to_user case didn’t zeroize. > > I can’t quite track when these two paths were unified (or by who)… but I can confirm that only the _from_user path was designed to zeroize the buffer on error. This is the source we pulled (previous version had some issues w/ licensing iirc): http://www.spinics.net/lists/arm-kernel/msg353650.html -dann > @Craig: I’ll also forward you the original files, which I had earlier sent to Dann on Jul 3rd 2014. This should provide a point-of-reference for resolving this. > > Best, > Philipp. > >> On 04 Dec 2014, at 13:27, Steve Capper <steve.capper@linaro.org> wrote: >> >> On 3 December 2014 at 20:01, Craig Magina <craig.magina@canonical.com> wrote: >> [...] >>> >>> Could you provide the steps you used to reproduce this issue? I have >>> already submitted an SRU to revert this patch, but would like to have >>> the method you used to reproduce it for testing. Thanks for reporting >>> this. >>> >>> -- >>> Craig Magina >> >> Hi Craig, >> It's not easy to reproduce as the problem occurs when the kernel is >> copying memory to userspace and experiences a page fault. >> >> I have come across it whilst running "perf record -a" on a system that >> is swapping. >> >> Also, building source code under memory pressure (i.e. kernel compile >> just after the system has swapped out pages), gave me this problem >> too. >> >> Cheers, >> -- >> Steve >
diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 5e27add..c4c5187 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -15,7 +15,6 @@ */ #include <linux/linkage.h> -#include <asm/assembler.h> /* * Copy from user space to a kernel buffer (alignment handled by the hardware) @@ -28,39 +27,10 @@ * x0 - bytes not copied */ ENTRY(__copy_from_user) - add x4, x1, x2 // upper user buffer boundary - subs x2, x2, #8 - b.mi 2f -1: -USER(9f, ldr x3, [x1], #8 ) - subs x2, x2, #8 - str x3, [x0], #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f -USER(9f, ldr w3, [x1], #4 ) - sub x2, x2, #4 - str w3, [x0], #4 -3: adds x2, x2, #2 - b.mi 4f -USER(9f, ldrh w3, [x1], #2 ) - sub x2, x2, #2 - strh w3, [x0], #2 -4: adds x2, x2, #1 - b.mi 5f -USER(9f, ldrb w3, [x1] ) - strb w3, [x0] -5: mov x0, #0 - ret +#include "copy_template.S" ENDPROC(__copy_from_user) .section .fixup,"ax" - .align 2 -9: sub x2, x4, x1 - mov x3, x2 -10: strb wzr, [x0], #1 // zero remaining buffer space - subs x3, x3, #1 - b.ne 10b - mov x0, x2 // bytes not copied - ret + .align 2 + copy_abort_table .previous diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S new file mode 100644 index 0000000..f2c7003 --- /dev/null +++ b/arch/arm64/lib/copy_template.S @@ -0,0 +1,278 @@ +/* + * Copyright (c) 2013, Applied Micro Circuits Corporation + * Copyright (c) 2012-2013, Linaro Limited + * + * Author: Feng Kan <fkan@apm.com> + * Author: Philipp Tomsich <philipp.tomsich@theobroma-systems.com> + * + * The code is adopted from the memcpy routine by Linaro Limited. + * + * This file is free software: you may copy, redistribute and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation, either version 2 of the License, or (at your + * option) any later version. + * + * This file is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + * + * This file incorporates work covered by the following copyright and + * permission notice: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * 1 Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2 Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3 Neither the name of the Linaro nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ +#include <asm/assembler.h> + +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + + mov dst, dstin + cmp count, #64 + b.ge .Lcpy_not_short + cmp count, #15 + b.le .Ltail15tiny + + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltail15 + add dst, dst, tmp1 + add src, src, tmp1 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + USER(8f, ldp A_l, A_h, [src, #-48]) + USER(8f, stp A_l, A_h, [dst, #-48]) +1: + USER(8f, ldp A_l, A_h, [src, #-32]) + USER(8f, stp A_l, A_h, [dst, #-32]) +2: + USER(8f, ldp A_l, A_h, [src, #-16]) + USER(8f, stp A_l, A_h, [dst, #-16]) + +.Ltail15: + ands count, count, #15 + beq 1f + add src, src, count + USER(9f, ldp A_l, A_h, [src, #-16]) + add dst, dst, count + USER(9f, stp A_l, A_h, [dst, #-16]) +1: + b .Lsuccess + +.Ltail15tiny: + /* + * Copy up to 15 bytes of data. Does not assume additional data + * being copied. + */ + tbz count, #3, 1f + USER(10f, ldr tmp1, [src], #8) + USER(10f, str tmp1, [dst], #8) +1: + tbz count, #2, 1f + USER(10f, ldr tmp1w, [src], #4) + USER(10f, str tmp1w, [dst], #4) +1: + tbz count, #1, 1f + USER(10f, ldrh tmp1w, [src], #2) + USER(10f, strh tmp1w, [dst], #2) +1: + tbz count, #0, 1f + USER(10f, ldrb tmp1w, [src]) + USER(10f, strb tmp1w, [dst]) +1: + b .Lsuccess + +.Lcpy_not_short: + /* + * We don't much care about the alignment of DST, but we want SRC + * to be 128-bit (16 byte) aligned so that we don't cross cache line + * boundaries on both loads and stores. + */ + neg tmp2, src + ands tmp2, tmp2, #15 /* Bytes to reach alignment. */ + b.eq 2f + sub count, count, tmp2 + /* + * Copy more data than needed; it's faster than jumping + * around copying sub-Quadword quantities. We know that + * it can't overrun. + */ + USER(11f, ldp A_l, A_h, [src]) + add src, src, tmp2 + USER(11f, stp A_l, A_h, [dst]) + add dst, dst, tmp2 + /* There may be less than 63 bytes to go now. */ + cmp count, #63 + b.le .Ltail63 +2: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + USER(12f, ldp A_l, A_h, [src]) + USER(12f, ldp B_l, B_h, [src, #16]) + USER(12f, ldp C_l, C_h, [src, #32]) + USER(12f, ldp D_l, D_h, [src, #48]) + USER(12f, stp A_l, A_h, [dst]) + USER(12f, stp B_l, B_h, [dst, #16]) + USER(12f, stp C_l, C_h, [dst, #32]) + USER(12f, stp D_l, D_h, [dst, #48]) + tst count, #0x3f + add src, src, #64 + add dst, dst, #64 + b.ne .Ltail63 + b .Lsuccess + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align 6 +.Lcpy_body_large: + /* There are at least 128 bytes to copy. */ + USER(12f, ldp A_l, A_h, [src, #0]) + sub dst, dst, #16 /* Pre-bias. */ + USER(13f, ldp B_l, B_h, [src, #16]) + USER(13f, ldp C_l, C_h, [src, #32]) + USER(13f, ldp D_l, D_h, [src, #48]!) /* src += 64 - Pre-bias. */ +1: + USER(13f, stp A_l, A_h, [dst, #16]) + USER(13f, ldp A_l, A_h, [src, #16]) + USER(13f, stp B_l, B_h, [dst, #32]) + USER(13f, ldp B_l, B_h, [src, #32]) + USER(13f, stp C_l, C_h, [dst, #48]) + USER(13f, ldp C_l, C_h, [src, #48]) + USER(13f, stp D_l, D_h, [dst, #64]!) + USER(13f, ldp D_l, D_h, [src, #64]!) + subs count, count, #64 + b.ge 1b + USER(14f, stp A_l, A_h, [dst, #16]) + USER(14f, stp B_l, B_h, [dst, #32]) + USER(14f, stp C_l, C_h, [dst, #48]) + USER(14f, stp D_l, D_h, [dst, #64]) + add src, src, #16 + add dst, dst, #64 + 16 + tst count, #0x3f + b.ne .Ltail63 +.Lsuccess: + /* Nothing left to copy */ + mov x0, #0 + ret + + .macro copy_abort_table +8: + /* + * Count bytes remain + * dst points to (dst + tmp1) + */ + mov x0, count + sub dst, dst, tmp1 + b .Lfinalize +9: + /* + * 16 bytes remain + * dst is accurate + */ + mov x0, #16 + b .Lfinalize +10: + /* + * count is accurate + * dst is accurate + */ + mov x0, count + b .Lfinalize +11: + /* + *(count + tmp2) bytes remain + * dst points to the start of the remaining bytes + */ + add x0, count, tmp2 + b .Lfinalize +12: + /* + * (count + 128) bytes remain + * dst is accurate + */ + add x0, count, #128 + b .Lfinalize +13: + /* + * (count + 128) bytes remain + * dst is pre-biased to (dst + 16) + */ + add x0, count, #128 + sub dst, dst, #16 + b .Lfinalize +14: + /* + * count is accurate + * dst is pre-biased to (dst + 16) + */ + mov x0, count + sub dst, dst, #16 + /* fall-through */ +.Lfinalize: + /* + * Zeroize remaining destination-buffer + */ + mov count, x0 +20: + /* Zero remaining buffer space */ + strb wzr, [dst], #1 + subs count, count, #1 + b.ne 20b + ret + .endm diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index a0aeeb9..08787b0 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -15,7 +15,6 @@ */ #include <linux/linkage.h> -#include <asm/assembler.h> /* * Copy to user space from a kernel buffer (alignment handled by the hardware) @@ -28,34 +27,10 @@ * x0 - bytes not copied */ ENTRY(__copy_to_user) - add x4, x0, x2 // upper user buffer boundary - subs x2, x2, #8 - b.mi 2f -1: - ldr x3, [x1], #8 - subs x2, x2, #8 -USER(9f, str x3, [x0], #8 ) - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1], #4 - sub x2, x2, #4 -USER(9f, str w3, [x0], #4 ) -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1], #2 - sub x2, x2, #2 -USER(9f, strh w3, [x0], #2 ) -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1] -USER(9f, strb w3, [x0] ) -5: mov x0, #0 - ret +#include "copy_template.S" ENDPROC(__copy_to_user) .section .fixup,"ax" - .align 2 -9: sub x0, x4, x0 // bytes not copied - ret + .align 2 + copy_abort_table .previous
Using the glibc cortex string work work authored by Linaro as base to create new copy to/from user kernel routine. Iperf performance increase: -l (size) 1 core result Optimized 64B 44-51Mb/s 1500B 4.9Gb/s 30000B 16.2Gb/s Original 64B 34-50.7Mb/s 1500B 4.7Gb/s 30000B 14.5Gb/s Signed-off-by: Feng Kan <fkan@apm.com> --- arch/arm64/lib/copy_from_user.S | 36 +----- arch/arm64/lib/copy_template.S | 278 ++++++++++++++++++++++++++++++++++++++++ arch/arm64/lib/copy_to_user.S | 31 +---- 3 files changed, 284 insertions(+), 61 deletions(-) create mode 100644 arch/arm64/lib/copy_template.S