Message ID | 1396926260-7705-1-git-send-email-victor.kamensky@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Apr 07, 2014 at 08:04:20PM -0700, Victor Kamensky wrote: > After instruction write into xol area, on ARM V7 > architecture code need to flush icache to sync up dcache > and icache. Having just 'flush_dcache_page(page)' call is > not enough - it is possible to have stale instruction > sitting in icache for given xol area slot address. > > Introduce arch_uprobe_xol_sync_dcache_icache weak function > that by default calls 'flush_dcache_page(page)' and on > ARM define one that calls __cpuc_coherent_user_range. You're right that something is needed. Would flush_icache_range() not be the correct thing to call here? Sync'ing the I and D sides for a whole page seems excessive to install a probe. I believe __cpuc_coherent_kern_range() is not intended as an API, and should only be used in special situations in arch-specific code. It's the correct operation, but arm provides flush_icache_range as a wrapper to call. flush_icache_range() is documented, with the correct effect, in Documentation/cachetlb.txt, so I believe that all arches should provide it, even if it's a no-op. Cheers ---Dave > > Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org> > --- > arch/arm/kernel/uprobes.c | 6 ++++++ > include/linux/uprobes.h | 3 +++ > kernel/events/uprobes.c | 20 +++++++++++++++----- > 3 files changed, 24 insertions(+), 5 deletions(-) > > diff --git a/arch/arm/kernel/uprobes.c b/arch/arm/kernel/uprobes.c > index f9bacee..841750b 100644 > --- a/arch/arm/kernel/uprobes.c > +++ b/arch/arm/kernel/uprobes.c > @@ -113,6 +113,12 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, > return 0; > } > > +void arch_uprobe_xol_sync_dcache_icache(struct page *page, > + unsigned long addr, unsigned long size) > +{ > + __cpuc_coherent_user_range(addr, size); > +} > + > int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs) > { > struct uprobe_task *utask = current->utask; > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index edff2b9..2fe3dfd 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -32,6 +32,7 @@ struct vm_area_struct; > struct mm_struct; > struct inode; > struct notifier_block; > +struct page; > > #define UPROBE_HANDLER_REMOVE 1 > #define UPROBE_HANDLER_MASK 1 > @@ -127,6 +128,8 @@ extern int arch_uprobe_exception_notify(struct notifier_block *self, unsigned l > extern void arch_uprobe_abort_xol(struct arch_uprobe *aup, struct pt_regs *regs); > extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs); > extern bool __weak arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); > +extern void __weak arch_uprobe_xol_sync_dcache_icache(struct page *page, > + unsigned long addr, unsigned long size); > #else /* !CONFIG_UPROBES */ > struct uprobes_state { > }; > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 04709b6..0027883 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -1299,11 +1299,9 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) > /* Initialize the slot */ > copy_to_page(area->page, xol_vaddr, > &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); > - /* > - * We probably need flush_icache_user_range() but it needs vma. > - * This should work on supported architectures too. > - */ > - flush_dcache_page(area->page); > + > + arch_uprobe_xol_sync_dcache_icache(area->page, > + xol_vaddr, sizeof(uprobe->arch.ixol)); > > return xol_vaddr; > } > @@ -1346,6 +1344,18 @@ static void xol_free_insn_slot(struct task_struct *tsk) > } > } > > +void __weak arch_uprobe_xol_sync_dcache_icache(struct page *page, > + unsigned long addr, unsigned long len) > +{ > + /* > + * We probably need flush_icache_user_range() but it needs vma. > + * This should work on most of architectures by default. If > + * architecture needs to do something different it can define > + * its own version of the function. > + */ > + flush_dcache_page(page); > +} > + > /** > * uprobe_get_swbp_addr - compute address of swbp given post-swbp regs > * @regs: Reflects the saved state of the task after it has hit a breakpoint > -- > 1.8.1.4 > > > _______________________________________________ > linaro-kernel mailing list > linaro-kernel@lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-kernel
Hi David, On 8 April 2014 06:05, David Long <dave.long@linaro.org> wrote: > > Victor, do you want to own the action of coming up with a patch? If not I'm > OK with taking this forward, although I suspect both of us will need help > testing for anything other than ARM. Please take it forward. I can do some testing of patches you will come up with. Thanks, Victor
On 8 April 2014 07:09, Victor Kamensky <victor.kamensky@linaro.org> wrote: > On 8 April 2014 06:30, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: >> On Tue, Apr 08, 2014 at 09:05:49AM -0400, David Long wrote: >>> Unfortunately copy_to_user_page() also needs a pointer to a vma struct >>> so, while it presumably provides the model to follow, it can't simply be >>> dropped in. >> >> Well, isn't this code doing the same thing as ptrace? It seems to want >> to modify a page in userspace of another process to change instructions >> that are going to be executed. That's what ptrace does, and ptrace >> already copes with all the issues there. > > As I see it, the difference between ptrace use and xol single stepping > it that in first all cores should be involved and potentially cache operation > would be broadcasted. In case of uprobes only local core is involved. > The way I read uprobes code xol slot will be used once by current core > to execute one instruction, then it will hit next breakpoint and xol slot > will be freed and potentially reused by next uprobe. > > I assume that while doing xol single stepping task cannot migrate to > another core, wondering whether it is absolutely true. It seems that > similar ppc code that calls flush_dcache_page follow this assumption. > If uprobes xol single stepping need to handle all cores it will be very > expensive. > > In ptrace case caches of all cores must be handled. > > Having said above I do agree that large portion of flush_ptrace_access > could be reused between these two use cases. Looking at flush_ptrace_access more closely. Now I am not sure that ptrace write code could easily reused. 1) flush_ptrace_access seems to handle both data and text segments write. In case of xol write we always know that it is code write 2) as I pointed before flush_ptrace_access handles smp case whereas xol write does not need to do that 3) flush_ptrace_access needs vma, which uprobe code does not have on layer where xol is installed. That is probably most critical issue that could stop suggested code sharing. And note that vma in flush_ptrace_access is needed only for cases cases 1) and 2) above, about which xol write does not really care. If we take only required pieces from flush_ptrace_access would not xol cache flush look like this: void arch_uprobe_xol_sync_dcache_icache(struct page *page, unsigned long addr, unsigned long size) { if (icache_is_vipt_aliasing()) flush_icache_alias(page_to_pfn(page), addr, size); else __cpuc_coherent_user_range(addr, size); } of course, above function should be properly places or flush_icache_alias function should be made globally visible. Note on the side: flush_ptrace_access works with user-land memory but it still uses __cpuc_coherent_kern_range. I think it should use __cpuc_coherent_user_range. It looks like implementation wise it is the same but user variant seems more appropriate in given situation. Or am I missing something here? Thanks, Victor > Thanks, > Victor > >> Given that we've already solved that problem, wouldn't it be a good idea >> if the tracing code would stop trying to reinvent broken solutions to >> problems we have already solved? >> >> -- >> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly >> improving, and getting towards what was expected from it.
On 8 April 2014 09:19, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Tue, Apr 08, 2014 at 08:35:01AM -0700, Victor Kamensky wrote: >> Looking at flush_ptrace_access more closely. Now I am not sure that >> ptrace write code could easily reused. >> >> 1) flush_ptrace_access seems to handle both data and text segments >> write. In case of xol write we always know that it is code write > > Of course it has to, but writing code is the harder of the two > problems. With writes to data segments, the only thing that has to > be dealt with is the data cache. With code, not only do you need to > deal with the data cache, but you also need to deal with the instruction > cache too. > >> 2) as I pointed before flush_ptrace_access handles smp case whereas >> xol write does not need to do that > > Are you sure about that? > > If I'm reading the code correctly, uprobes inserts a trapping instruction > into the userspace program. When that instruction is hit, it checks > whether the thread is the desired one, and may request a slot in this > magic page, which is when the write happens. > > The uprobes special page is shared across all threads which share the > mm_struct, so in the case of a multi-threaded program running on a SMP > machine, this page is visible to multiple CPUs. > > Is it possible for uprobes to be active on more than one thread at a > time? If so, because that page is shared, you could end up writing > to a partial cache line from two threads. From what I can see, ixol[] > is two words, and there's normally 8 works per cache line on ARM, or > occasionally 16. > > So, the question now is: is it possible to have uprobes active on more > than one thread, and for two threads to hit the uprobes processing, both > needing a slot in the page, hitting the same cache line? > > Now, what happens if thread 1 on CPU1 gets there first with its write. > Then thread 2 on CPU2 gets there, causing the cache line to migrate to > CPU2. Then CPU1 does it's (non-broadcasted) flush, meanwhile CPU2 then > gets preempted and goes off and does something else. > > Please tell me that can't happen. :) > >> 3) flush_ptrace_access needs vma, which uprobe code does not have on >> layer where xol is installed. That is probably most critical issue that >> could stop suggested code sharing. And note that vma in >> flush_ptrace_access is needed only for cases cases 1) and 2) above, >> about which xol write does not really care. If we take only required >> pieces from flush_ptrace_access would not xol cache flush look like >> this: >> >> void arch_uprobe_xol_sync_dcache_icache(struct page *page, >> unsigned long addr, unsigned long size) >> { >> if (icache_is_vipt_aliasing()) >> flush_icache_alias(page_to_pfn(page), addr, size); >> else >> __cpuc_coherent_user_range(addr, size); >> } > > And this is only half the story. > > What you're saying above is: > > - if the *instruction* cache is an aliasing VIPT cache, then we can > map the user page at the same colour and flush the data and instruction > caches according to that colour. > Problem: if you have a VIPT aliasing data cache, and you wrote > to the page via a mapping which had a different colour > (which is highly likely given you're using kmap_atomic()) > then you are not flushing the new instructions out of > the data cache. > - if the *instruction* cache is not an aliasing VIPT cache, then we > flush using the userspace mapping directly. > Problem: what if the data cache is aliasing... same problem > as the above case. > > This is why we have the check for cache_is_vipt_aliasing() in there - > that's not _just_ to deal with ptrace data modification, it's there for > text modification as well. Thanks! I got it now. Did not fully understand flush_ptrace_access code. And did not catch (misread) difference between cache_is_vipt_aliasing and icache_is_vipt_aliasing(). Probably some technique that retrieves required information from vma like 1) is_exec = vma->vm_flags & VM_EXEC 2) core_in_mm = cpumask_test_cpu(smp_processor_id(), mm_cpumask(vma->vm_mm)) 3) broadcast = cache_ops_need_broadcast() and passes it to common function would allow to share common code and relieve need to search for vma after xol write call. After xol write will call common function will be called with is_exec = 1, core_in_mm = 1, broadcast = 0. Although that common function will have seven parameters and will look a bit ugly and unnatural. It could be compensated by making it inline so flush_ptrace_access and flush_uprobe_xol_access function would effectively have its own copy and correctly optimized. Other option is to have flush_ptrace_access that takes vma and flush_uprobe_xol_access function side by side in the same file with comment that begs to keep common logic between two functions. But not sure how that would work in long term. Comments? Preferences? Any other suggestions how to deal with lack of vma in uprobe xol write case? Thanks, Victor >> Note on the side: flush_ptrace_access works with user-land memory >> but it still uses __cpuc_coherent_kern_range. I think it should use >> __cpuc_coherent_user_range. > > No, because __cpuc_coherent_kern_range() there is used on the *kernel* > mapping of the page, not on the *user* mapping (which may fault). The > kernel mapping will never fault. > >> It looks like implementation wise it is >> the same but user variant seems more appropriate in given situation. >> Or am I missing something here? > > They do seem to be the same for both cases, so we could probably remove > the distinction. It's unlikely we'll ever see an implementation needing > a difference between the two now. > > -- > FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly > improving, and getting towards what was expected from it.
=========== It seems to me that ARMv7 uprobes need proper icache flush after xol write. Please look at [1] discussion for similar issue on ppc. It seems that flush_dcache_page was sufficient for latter architectures of PPC but it does not look that it is good enough for ARMv7. AFAIK know ARM V7 does not have "snooping Harvard caches" and needs something like __cpuc_coherent_user_range function call to sync up icache and dcache after instruction write through dcache. Patch that I propose follows this cover letter. There I introduced weak arch_uprobe_xol_sync_dcache_icache function that does traditional flush_dcache_page call and I redefined this function to __cpuc_coherent_user_range call in ARM v7 > case. [1] http://linux-kernel.2935.n7.nabble.com/Re-PATCH-6-9-uprobes-flush-cache-after-xol-write-td216886.html Longer story ============ I was trying Dave's armv7 uprobes with SystemTap on Arndale board. I used Linaro linux branch 3.14 based that contained Dave's armv7 uprobes topic code. I believe it should be pretty much the same as armv7 uprobes code that went to Russell's tree. I was able to do one function simple test - it worked fine for me. But when I've tried to run many function like "probe process("foobar").function("*")" probe SystemTap my target process always crashed. After quite a bit of chasing the issue, I was able to come up with test case that shows several probes installed against 'ls' process. First probe placed at 'push {r4, r5, r6, r7, r8, r9, r10, r11, lr}' instruction, which is first in _getopt_initialize function, then script adds few more probes at _getopt_initialize addresses that are executed latter. And in those probes I dump registers set and top of stack. By looking at execution of script one may easily conclude that it looks like that for each probe 'push {r4, r5, r6, r7, r8, r9, r10, r11, lr}' instruction is always executed - one may see 36 bytes increase of stack size and see copy of corresponding registers on the stack. The code path is the following: handle_swbp -> pre_ssout pre_ssout -> xol_get_insn_slot xol_get_insn_slot -> copy_to_page xol_get_insn_slot -> flush_dcache_page pre_ssout -> arch_uprobe_pre_xol pre_ssout function calls xol_get_insn_slot which finds available slot in XOL area, that is mapped into user process and copies required instruction into xol slot. After that it calls flush_dcache_page, but icache is not flushed in ARM case by this function. So I think the following thing happens: first time first xol slot got 'push {r4, r5, r6, r7, r8, r9, r10, r11, lr}' instruction and it retrieved into icache. Latter when other probes are executed the same first slot of xol area it will get different instruction but because icache is not flushed CPU keep executing 'push' instruction. When I add the following testing patch that flush icache in arch_uprobe_pre_xol [kamensky@kamensky-w530 git]$ git diff diff --git a/arch/arm/kernel/uprobes.c b/arch/arm/kernel/uprobes.c index f9bacee..ef34623 100644 --- a/arch/arm/kernel/uprobes.c +++ b/arch/arm/kernel/uprobes.c @@ -117,6 +117,8 @@ int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs) { struct uprobe_task *utask = current->utask; + __flush_icache_all(); + if (auprobe->prehandler) auprobe->prehandler(auprobe, &utask->autask, regs);