Message ID | 1360587435-28386-1-git-send-email-ivan.djelic@parrot.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am 11.02.2013 13:57, schrieb Ivan Djelic: > Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on > assumptions about the implementation of memset and similar functions. > The current ARM optimized memset code does not return the value of > its first argument, as is usually expected from standard implementations. > > For instance in the following function: > > void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) > { > memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); > waiter->magic = waiter; > INIT_LIST_HEAD(&waiter->list); > } > > compiled as: > > 800554d0 <debug_mutex_lock_common>: > 800554d0: e92d4008 push {r3, lr} > 800554d4: e1a00001 mov r0, r1 > 800554d8: e3a02010 mov r2, #16 ; 0x10 > 800554dc: e3a01011 mov r1, #17 ; 0x11 > 800554e0: eb04426e bl 80165ea0 <memset> > 800554e4: e1a03000 mov r3, r0 > 800554e8: e583000c str r0, [r3, #12] > 800554ec: e5830000 str r0, [r3] > 800554f0: e5830004 str r0, [r3, #4] > 800554f4: e8bd8008 pop {r3, pc} > > GCC assumes memset returns the value of pointer 'waiter' in register r0; causing > register/memory corruptions. > > This patch fixes the return value of the assembly version of memset. > It adds a 'mov' instruction and merges an additional load+store into > existing load/store instructions. > For ease of review, here is a breakdown of the patch into 4 simple steps: > > Step 1 > ====== > Perform the following substitutions: > ip -> r8, then > r0 -> ip, > and insert 'mov ip, r0' as the first statement of the function. > At this point, we have a memset() implementation returning the proper result, > but corrupting r8 on some paths (the ones that were using ip). > > Step 2 > ====== > Make sure r8 is saved and restored when (! CALGN(1)+0) == 1: > > save r8: > - str lr, [sp, #-4]! > + stmfd sp!, {r8, lr} > > and restore r8 on both exit paths: > - ldmeqfd sp!, {pc} @ Now <64 bytes to go. > + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. > (...) > tst r2, #16 > stmneia ip!, {r1, r3, r8, lr} > - ldr lr, [sp], #4 > + ldmfd sp!, {r8, lr} > > Step 3 > ====== > Make sure r8 is saved and restored when (! CALGN(1)+0) == 0: > > save r8: > - stmfd sp!, {r4-r7, lr} > + stmfd sp!, {r4-r8, lr} > > and restore r8 on both exit paths: > bgt 3b > - ldmeqfd sp!, {r4-r7, pc} > + ldmeqfd sp!, {r4-r8, pc} > (...) > tst r2, #16 > stmneia ip!, {r4-r7} > - ldmfd sp!, {r4-r7, lr} > + ldmfd sp!, {r4-r8, lr} > > Step 4 > ====== > Rewrite register list "r4-r7, r8" as "r4-r8". > > Signed-off-by: Ivan Djelic <ivan.djelic@parrot.com> > Reviewed-by: Nicolas Pitre <nico@linaro.org> Sent as 7668/1 to rmk's patch system. Thanks Dirk
On Wed, Mar 06, 2013 at 07:15:17PM +0000, Dirk Behme wrote: > Am 11.02.2013 13:57, schrieb Ivan Djelic: > > Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on > > assumptions about the implementation of memset and similar functions. > > The current ARM optimized memset code does not return the value of > > its first argument, as is usually expected from standard implementations. > > > > For instance in the following function: > > > > void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) > > { > > memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); > > waiter->magic = waiter; > > INIT_LIST_HEAD(&waiter->list); > > } > > > > compiled as: > > > > 800554d0 <debug_mutex_lock_common>: > > 800554d0: e92d4008 push {r3, lr} > > 800554d4: e1a00001 mov r0, r1 > > 800554d8: e3a02010 mov r2, #16 ; 0x10 > > 800554dc: e3a01011 mov r1, #17 ; 0x11 > > 800554e0: eb04426e bl 80165ea0 <memset> > > 800554e4: e1a03000 mov r3, r0 > > 800554e8: e583000c str r0, [r3, #12] > > 800554ec: e5830000 str r0, [r3] > > 800554f0: e5830004 str r0, [r3, #4] > > 800554f4: e8bd8008 pop {r3, pc} > > > > GCC assumes memset returns the value of pointer 'waiter' in register r0; causing > > register/memory corruptions. > > > > This patch fixes the return value of the assembly version of memset. > > It adds a 'mov' instruction and merges an additional load+store into > > existing load/store instructions. > > For ease of review, here is a breakdown of the patch into 4 simple steps: > > > > Step 1 > > ====== > > Perform the following substitutions: > > ip -> r8, then > > r0 -> ip, > > and insert 'mov ip, r0' as the first statement of the function. > > At this point, we have a memset() implementation returning the proper result, > > but corrupting r8 on some paths (the ones that were using ip). > > > > Step 2 > > ====== > > Make sure r8 is saved and restored when (! CALGN(1)+0) == 1: > > > > save r8: > > - str lr, [sp, #-4]! > > + stmfd sp!, {r8, lr} > > > > and restore r8 on both exit paths: > > - ldmeqfd sp!, {pc} @ Now <64 bytes to go. > > + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. > > (...) > > tst r2, #16 > > stmneia ip!, {r1, r3, r8, lr} > > - ldr lr, [sp], #4 > > + ldmfd sp!, {r8, lr} > > > > Step 3 > > ====== > > Make sure r8 is saved and restored when (! CALGN(1)+0) == 0: > > > > save r8: > > - stmfd sp!, {r4-r7, lr} > > + stmfd sp!, {r4-r8, lr} > > > > and restore r8 on both exit paths: > > bgt 3b > > - ldmeqfd sp!, {r4-r7, pc} > > + ldmeqfd sp!, {r4-r8, pc} > > (...) > > tst r2, #16 > > stmneia ip!, {r4-r7} > > - ldmfd sp!, {r4-r7, lr} > > + ldmfd sp!, {r4-r8, lr} > > > > Step 4 > > ====== > > Rewrite register list "r4-r7, r8" as "r4-r8". > > > > Signed-off-by: Ivan Djelic <ivan.djelic@parrot.com> > > Reviewed-by: Nicolas Pitre <nico@linaro.org> > > Sent as 7668/1 to rmk's patch system. OK, thanks a lot ! -- Ivan
On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: > Am 11.02.2013 13:57, schrieb Ivan Djelic: >> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on >> assumptions about the implementation of memset and similar functions. >> The current ARM optimized memset code does not return the value of >> its first argument, as is usually expected from standard implementations. >> >> For instance in the following function: >> >> void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter) >> { >> memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter)); >> waiter->magic = waiter; >> INIT_LIST_HEAD(&waiter->list); >> } >> >> compiled as: >> >> 800554d0 <debug_mutex_lock_common>: >> 800554d0: e92d4008 push {r3, lr} >> 800554d4: e1a00001 mov r0, r1 >> 800554d8: e3a02010 mov r2, #16 ; 0x10 >> 800554dc: e3a01011 mov r1, #17 ; 0x11 >> 800554e0: eb04426e bl 80165ea0 <memset> >> 800554e4: e1a03000 mov r3, r0 >> 800554e8: e583000c str r0, [r3, #12] >> 800554ec: e5830000 str r0, [r3] >> 800554f0: e5830004 str r0, [r3, #4] >> 800554f4: e8bd8008 pop {r3, pc} >> >> GCC assumes memset returns the value of pointer 'waiter' in register r0; causing >> register/memory corruptions. >> >> This patch fixes the return value of the assembly version of memset. >> It adds a 'mov' instruction and merges an additional load+store into >> existing load/store instructions. >> For ease of review, here is a breakdown of the patch into 4 simple steps: >> >> Step 1 >> ====== >> Perform the following substitutions: >> ip -> r8, then >> r0 -> ip, >> and insert 'mov ip, r0' as the first statement of the function. >> At this point, we have a memset() implementation returning the proper result, >> but corrupting r8 on some paths (the ones that were using ip). >> >> Step 2 >> ====== >> Make sure r8 is saved and restored when (! CALGN(1)+0) == 1: >> >> save r8: >> - str lr, [sp, #-4]! >> + stmfd sp!, {r8, lr} >> >> and restore r8 on both exit paths: >> - ldmeqfd sp!, {pc} @ Now <64 bytes to go. >> + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. >> (...) >> tst r2, #16 >> stmneia ip!, {r1, r3, r8, lr} >> - ldr lr, [sp], #4 >> + ldmfd sp!, {r8, lr} >> >> Step 3 >> ====== >> Make sure r8 is saved and restored when (! CALGN(1)+0) == 0: >> >> save r8: >> - stmfd sp!, {r4-r7, lr} >> + stmfd sp!, {r4-r8, lr} >> >> and restore r8 on both exit paths: >> bgt 3b >> - ldmeqfd sp!, {r4-r7, pc} >> + ldmeqfd sp!, {r4-r8, pc} >> (...) >> tst r2, #16 >> stmneia ip!, {r4-r7} >> - ldmfd sp!, {r4-r7, lr} >> + ldmfd sp!, {r4-r8, lr} >> >> Step 4 >> ====== >> Rewrite register list "r4-r7, r8" as "r4-r8". >> >> Signed-off-by: Ivan Djelic <ivan.djelic@parrot.com> >> Reviewed-by: Nicolas Pitre <nico@linaro.org> > > Sent as 7668/1 to rmk's patch system. Thanks, except for one minor detail. As you are the one sending it to me, you are "handling" the patch, so it should have your sign-off too. Rather than resubmit, please send me an email followup with the sign-off tag included. Thanks.
Am 07.03.2013 16:17, schrieb Russell King - ARM Linux: > On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: >> Am 11.02.2013 13:57, schrieb Ivan Djelic: >>> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on >>> assumptions about the implementation of memset and similar functions. >>> The current ARM optimized memset code does not return the value of >>> its first argument, as is usually expected from standard implementations. I've just tried this patch with kernel 4.8.2 on an armv5-system where I use gcc 4.7.2 since several months and where most parts of the system are compiled with gcc 4.7.2 too. And I had at least one problem which manifested itself with [ 181.198559] pts1: unknown flag 212 while trying to establish a btle-connection. So I assume either the patch is wrong, the patch isn't the whole story, an existing bug is now triggered, or ... I don't know. Regards, Alexander
On Sun, Mar 10, 2013 at 06:06:11PM +0100, Alexander Holler wrote: > Am 07.03.2013 16:17, schrieb Russell King - ARM Linux: >> On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: >>> Am 11.02.2013 13:57, schrieb Ivan Djelic: >>>> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on >>>> assumptions about the implementation of memset and similar functions. >>>> The current ARM optimized memset code does not return the value of >>>> its first argument, as is usually expected from standard implementations. > > I've just tried this patch with kernel 4.8.2 on an armv5-system where I > use gcc 4.7.2 since several months and where most parts of the system > are compiled with gcc 4.7.2 too. > > And I had at least one problem which manifested itself with Yes, the patch _is_ wrong. Reverted. I was trusting Nicolas' review of it, but the patch is definitely wrong. Look carefully at this fragment of code: 1: subs r2, r2, #4 @ 1 do we have enough blt 5f @ 1 bytes to align with? cmp r3, #2 @ 1 strltb r1, [ip], #1 @ 1 strleb r1, [ip], #1 @ 1 strb r1, [ip], #1 @ 1 add r2, r2, r3 @ 1 (r2 = r2 - (4 - r3)) /* * The pointer is now aligned and the length is adjusted. Try doing the * memset again. */ ENTRY(memset) /* * Preserve the contents of r0 for the return value. */ mov ip, r0 ands r3, ip, #3 @ 1 unaligned? bne 1b @ 1 and consider what happens when 'r0' is not aligned to a word... We end up aligning the pointer in "1:" and then fall through into memset again which reloads the old misaligned pointer.
Am 10.03.2013 18:28, schrieb Russell King - ARM Linux: > On Sun, Mar 10, 2013 at 06:06:11PM +0100, Alexander Holler wrote: >> Am 07.03.2013 16:17, schrieb Russell King - ARM Linux: >>> On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: >>>> Am 11.02.2013 13:57, schrieb Ivan Djelic: >>>>> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on >>>>> assumptions about the implementation of memset and similar functions. >>>>> The current ARM optimized memset code does not return the value of >>>>> its first argument, as is usually expected from standard implementations. >> >> I've just tried this patch with kernel 4.8.2 on an armv5-system where I >> use gcc 4.7.2 since several months and where most parts of the system >> are compiled with gcc 4.7.2 too. >> >> And I had at least one problem which manifested itself with > > Yes, the patch _is_ wrong. Reverted. I was trusting Nicolas' review > of it, but the patch is definitely wrong. Look carefully at this > fragment of code: > > 1: subs r2, r2, #4 @ 1 do we have enough > blt 5f @ 1 bytes to align with? > cmp r3, #2 @ 1 > strltb r1, [ip], #1 @ 1 > strleb r1, [ip], #1 @ 1 > strb r1, [ip], #1 @ 1 > add r2, r2, r3 @ 1 (r2 = r2 - (4 - r3)) > /* > * The pointer is now aligned and the length is adjusted. Try doing the > * memset again. > */ > > ENTRY(memset) > /* > * Preserve the contents of r0 for the return value. > */ > mov ip, r0 > ands r3, ip, #3 @ 1 unaligned? > bne 1b @ 1 > > and consider what happens when 'r0' is not aligned to a word... We end > up aligning the pointer in "1:" and then fall through into memset again > which reloads the old misaligned pointer. Thanks a lot for the very fast answer. I myself wasn't in the mood to go through arm-assembler (which I don't read that often), sorry. Regards, Alexander
On Sun, 10 Mar 2013, Russell King - ARM Linux wrote: > On Sun, Mar 10, 2013 at 06:06:11PM +0100, Alexander Holler wrote: > > Am 07.03.2013 16:17, schrieb Russell King - ARM Linux: > >> On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: > >>> Am 11.02.2013 13:57, schrieb Ivan Djelic: > >>>> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on > >>>> assumptions about the implementation of memset and similar functions. > >>>> The current ARM optimized memset code does not return the value of > >>>> its first argument, as is usually expected from standard implementations. > > > > I've just tried this patch with kernel 4.8.2 on an armv5-system where I > > use gcc 4.7.2 since several months and where most parts of the system > > are compiled with gcc 4.7.2 too. > > > > And I had at least one problem which manifested itself with > > Yes, the patch _is_ wrong. Reverted. I was trusting Nicolas' review > of it, but the patch is definitely wrong. Look carefully at this > fragment of code: Dang. Indeed. Sorry about that. Nicolas
On Sun, Mar 10, 2013 at 05:28:54PM +0000, Russell King - ARM Linux wrote: > On Sun, Mar 10, 2013 at 06:06:11PM +0100, Alexander Holler wrote: > > Am 07.03.2013 16:17, schrieb Russell King - ARM Linux: > >> On Wed, Mar 06, 2013 at 08:15:17PM +0100, Dirk Behme wrote: > >>> Am 11.02.2013 13:57, schrieb Ivan Djelic: > >>>> Recent GCC versions (e.g. GCC-4.7.2) perform optimizations based on > >>>> assumptions about the implementation of memset and similar functions. > >>>> The current ARM optimized memset code does not return the value of > >>>> its first argument, as is usually expected from standard implementations. > > > > I've just tried this patch with kernel 4.8.2 on an armv5-system where I > > use gcc 4.7.2 since several months and where most parts of the system > > are compiled with gcc 4.7.2 too. > > > > And I had at least one problem which manifested itself with > > Yes, the patch _is_ wrong. Reverted. I was trusting Nicolas' review > of it, but the patch is definitely wrong. Look carefully at this > fragment of code: > > 1: subs r2, r2, #4 @ 1 do we have enough > blt 5f @ 1 bytes to align with? > cmp r3, #2 @ 1 > strltb r1, [ip], #1 @ 1 > strleb r1, [ip], #1 @ 1 > strb r1, [ip], #1 @ 1 > add r2, r2, r3 @ 1 (r2 = r2 - (4 - r3)) > /* > * The pointer is now aligned and the length is adjusted. Try doing the > * memset again. > */ > > ENTRY(memset) > /* > * Preserve the contents of r0 for the return value. > */ > mov ip, r0 > ands r3, ip, #3 @ 1 unaligned? > bne 1b @ 1 > > and consider what happens when 'r0' is not aligned to a word... We end > up aligning the pointer in "1:" and then fall through into memset again > which reloads the old misaligned pointer. Oops... Indeed. Thanks very much for catching that and sorry for the regression. -- Ivan.
====== Perform the following substitutions: ip -> r8, then r0 -> ip, and insert 'mov ip, r0' as the first statement of the function. At this point, we have a memset() implementation returning the proper result, but corrupting r8 on some paths (the ones that were using ip). Step 2 ====== Make sure r8 is saved and restored when (! CALGN(1)+0) == 1: save r8: - str lr, [sp, #-4]! + stmfd sp!, {r8, lr} and restore r8 on both exit paths: - ldmeqfd sp!, {pc} @ Now <64 bytes to go. + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. (...) tst r2, #16 stmneia ip!, {r1, r3, r8, lr} - ldr lr, [sp], #4 + ldmfd sp!, {r8, lr} Step 3 ====== Make sure r8 is saved and restored when (! CALGN(1)+0) == 0: save r8: - stmfd sp!, {r4-r7, lr} + stmfd sp!, {r4-r8, lr} and restore r8 on both exit paths: bgt 3b - ldmeqfd sp!, {r4-r7, pc} + ldmeqfd sp!, {r4-r8, pc} (...) tst r2, #16 stmneia ip!, {r4-r7} - ldmfd sp!, {r4-r7, lr} + ldmfd sp!, {r4-r8, lr} Step 4 ====== Rewrite register list "r4-r7, r8" as "r4-r8". Signed-off-by: Ivan Djelic <ivan.djelic@parrot.com> Reviewed-by: Nicolas Pitre <nico@linaro.org> --- v2 changelog: - added patch breakdown explanation to commit text arch/arm/lib/memset.S | 85 +++++++++++++++++++++++++------------------------ 1 file changed, 44 insertions(+), 41 deletions(-) diff --git a/arch/arm/lib/memset.S b/arch/arm/lib/memset.S index 650d592..d912e73 100644 --- a/arch/arm/lib/memset.S +++ b/arch/arm/lib/memset.S @@ -19,9 +19,9 @@ 1: subs r2, r2, #4 @ 1 do we have enough blt 5f @ 1 bytes to align with? cmp r3, #2 @ 1 - strltb r1, [r0], #1 @ 1 - strleb r1, [r0], #1 @ 1 - strb r1, [r0], #1 @ 1 + strltb r1, [ip], #1 @ 1 + strleb r1, [ip], #1 @ 1 + strb r1, [ip], #1 @ 1 add r2, r2, r3 @ 1 (r2 = r2 - (4 - r3)) /* * The pointer is now aligned and the length is adjusted. Try doing the @@ -29,10 +29,14 @@ */ ENTRY(memset) - ands r3, r0, #3 @ 1 unaligned? +/* + * Preserve the contents of r0 for the return value. + */ + mov ip, r0 + ands r3, ip, #3 @ 1 unaligned? bne 1b @ 1 /* - * we know that the pointer in r0 is aligned to a word boundary. + * we know that the pointer in ip is aligned to a word boundary. */ orr r1, r1, r1, lsl #8 orr r1, r1, r1, lsl #16 @@ -43,29 +47,28 @@ ENTRY(memset) #if ! CALGN(1)+0 /* - * We need an extra register for this loop - save the return address and - * use the LR + * We need 2 extra registers for this loop - use r8 and the LR */ - str lr, [sp, #-4]! - mov ip, r1 + stmfd sp!, {r8, lr} + mov r8, r1 mov lr, r1 2: subs r2, r2, #64 - stmgeia r0!, {r1, r3, ip, lr} @ 64 bytes at a time. - stmgeia r0!, {r1, r3, ip, lr} - stmgeia r0!, {r1, r3, ip, lr} - stmgeia r0!, {r1, r3, ip, lr} + stmgeia ip!, {r1, r3, r8, lr} @ 64 bytes at a time. + stmgeia ip!, {r1, r3, r8, lr} + stmgeia ip!, {r1, r3, r8, lr} + stmgeia ip!, {r1, r3, r8, lr} bgt 2b - ldmeqfd sp!, {pc} @ Now <64 bytes to go. + ldmeqfd sp!, {r8, pc} @ Now <64 bytes to go. /* * No need to correct the count; we're only testing bits from now on */ tst r2, #32 - stmneia r0!, {r1, r3, ip, lr} - stmneia r0!, {r1, r3, ip, lr} + stmneia ip!, {r1, r3, r8, lr} + stmneia ip!, {r1, r3, r8, lr} tst r2, #16 - stmneia r0!, {r1, r3, ip, lr} - ldr lr, [sp], #4 + stmneia ip!, {r1, r3, r8, lr} + ldmfd sp!, {r8, lr} #else @@ -74,54 +77,54 @@ ENTRY(memset) * whole cache lines at once. */ - stmfd sp!, {r4-r7, lr} + stmfd sp!, {r4-r8, lr} mov r4, r1 mov r5, r1 mov r6, r1 mov r7, r1 - mov ip, r1 + mov r8, r1 mov lr, r1 cmp r2, #96 - tstgt r0, #31 + tstgt ip, #31 ble 3f - and ip, r0, #31 - rsb ip, ip, #32 - sub r2, r2, ip - movs ip, ip, lsl #(32 - 4) - stmcsia r0!, {r4, r5, r6, r7} - stmmiia r0!, {r4, r5} - tst ip, #(1 << 30) - mov ip, r1 - strne r1, [r0], #4 + and r8, ip, #31 + rsb r8, r8, #32 + sub r2, r2, r8 + movs r8, r8, lsl #(32 - 4) + stmcsia ip!, {r4, r5, r6, r7} + stmmiia ip!, {r4, r5} + tst r8, #(1 << 30) + mov r8, r1 + strne r1, [ip], #4 3: subs r2, r2, #64 - stmgeia r0!, {r1, r3-r7, ip, lr} - stmgeia r0!, {r1, r3-r7, ip, lr} + stmgeia ip!, {r1, r3-r8, lr} + stmgeia ip!, {r1, r3-r8, lr} bgt 3b - ldmeqfd sp!, {r4-r7, pc} + ldmeqfd sp!, {r4-r8, pc} tst r2, #32 - stmneia r0!, {r1, r3-r7, ip, lr} + stmneia ip!, {r1, r3-r8, lr} tst r2, #16 - stmneia r0!, {r4-r7} - ldmfd sp!, {r4-r7, lr} + stmneia ip!, {r4-r7} + ldmfd sp!, {r4-r8, lr} #endif 4: tst r2, #8 - stmneia r0!, {r1, r3} + stmneia ip!, {r1, r3} tst r2, #4 - strne r1, [r0], #4 + strne r1, [ip], #4 /* * When we get here, we've got less than 4 bytes to zero. We * may have an unaligned pointer as well. */ 5: tst r2, #2 - strneb r1, [r0], #1 - strneb r1, [r0], #1 + strneb r1, [ip], #1 + strneb r1, [ip], #1 tst r2, #1 - strneb r1, [r0], #1 + strneb r1, [ip], #1 mov pc, lr ENDPROC(memset)