Message ID | 58FF2C060200007800153D45@prv-mh.provo.novell.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: > Jann's explanation of the problem: > > "start situation: > - domain A and domain B are PV domains > - domain A and B both have currently scheduled vCPUs, and the vCPUs > are not scheduled away > - domain A has XSM_TARGET access to domain B > - page X is owned by domain B and has no mappings > - page X is zeroed > > steps: > - domain A uses do_mmu_update() to map page X in domain A as writable > - domain A accesses page X through the new PTE, creating a TLB entry > - domain A removes its mapping of page X > - type count of page X goes to 0 > - tlbflush_timestamp of page X is bumped > - domain B maps page X as L1 pagetable > - type of page X changes to PGT_l1_page_table > - TLB flush is forced using domain_dirty_cpumask of domain B > - page X is mapped as L1 pagetable in domain B > > At this point, domain B's vCPUs are guaranteed to have no > incorrectly-typed stale TLB entries for page X, but AFAICS domain A's > vCPUs can still have stale TLB entries that map page X as writable, > permitting domain A to control a live pagetable of domain B." AIUI this patch solves the problem by immediately flushing domain A's TLB entries at the point where domain A removes its mapping of page X. Could we, instead, bitwise OR domain A's domain_dirty_cpumask into domain B's domain_dirty_cpumask at the same point? Then when domain B flushes TLBs in the last step (in __get_page_type()) it will catch any stale TLB entries from domain A as well. But in the (hopefully common) case where there's a delay between domain A's __put_page_type() and domain B's __get_page_type(), the usual TLB timestamp filtering will suppress some of the IPIs/flushes. Cheers, Tim. > Domain A necessarily is Dom0 (DomU-s with XSM_TARGET permission are > being created only for HVM domains, but domain B needs to be PV here), > so this is not a security issue, but nevertheless seems desirable to > correct. > > Reported-by: Jann Horn <jannh@google.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > v2: Don't consider page's time stamp (relevant only for the owning > domain). > > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -1266,6 +1266,18 @@ void put_page_from_l1e(l1_pgentry_t l1e, > if ( (l1e_get_flags(l1e) & _PAGE_RW) && > ((l1e_owner == pg_owner) || !paging_mode_external(pg_owner)) ) > { > + /* > + * Don't leave stale writable TLB entries in the unmapping domain's > + * page tables, to prevent them allowing access to pages required to > + * be read-only (e.g. after pg_owner changed them to page table or > + * segment descriptor pages). > + */ > + if ( unlikely(l1e_owner != pg_owner) ) > + { > + perfc_incr(need_flush_tlb_flush); > + flush_tlb_mask(l1e_owner->domain_dirty_cpumask); > + } > + > put_page_and_type(page); > } > else
>>> On 25.04.17 at 12:59, <tim@xen.org> wrote: > Hi, > > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: >> Jann's explanation of the problem: >> >> "start situation: >> - domain A and domain B are PV domains >> - domain A and B both have currently scheduled vCPUs, and the vCPUs >> are not scheduled away >> - domain A has XSM_TARGET access to domain B >> - page X is owned by domain B and has no mappings >> - page X is zeroed >> >> steps: >> - domain A uses do_mmu_update() to map page X in domain A as writable >> - domain A accesses page X through the new PTE, creating a TLB entry >> - domain A removes its mapping of page X >> - type count of page X goes to 0 >> - tlbflush_timestamp of page X is bumped >> - domain B maps page X as L1 pagetable >> - type of page X changes to PGT_l1_page_table >> - TLB flush is forced using domain_dirty_cpumask of domain B >> - page X is mapped as L1 pagetable in domain B >> >> At this point, domain B's vCPUs are guaranteed to have no >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's >> vCPUs can still have stale TLB entries that map page X as writable, >> permitting domain A to control a live pagetable of domain B." > > AIUI this patch solves the problem by immediately flushing domain A's > TLB entries at the point where domain A removes its mapping of page X. > > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into > domain B's domain_dirty_cpumask at the same point? > > Then when domain B flushes TLBs in the last step (in __get_page_type()) > it will catch any stale TLB entries from domain A as well. But in the > (hopefully common) case where there's a delay between domain A's > __put_page_type() and domain B's __get_page_type(), the usual TLB > timestamp filtering will suppress some of the IPIs/flushes. Oh, I see. Yes, I think this would be fine. However, we don't have a suitable cpumask accessor allowing us to do this ORing atomically, so we'd have to open code it. Do you think such a slightly ugly approach would be worth it here? Foreign mappings shouldn't be _that_ performance critical... And then, considering that this will result in time stamp based filtering again, I'm no longer sure I was right to agree with Jann on the flush here needing to be unconditional. Regardless of page table owner matching page owner, the time stamp stored for the page will always be applicable (it's a global property). So we wouldn't even need to OR in the whole dirty mask here, but could already pre-filter (or if we stayed with the flush-on-put approach, then v1 would have been correct). Jan
At 05:59 -0600 on 25 Apr (1493099950), Jan Beulich wrote: > >>> On 25.04.17 at 12:59, <tim@xen.org> wrote: > > Hi, > > > > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: > >> Jann's explanation of the problem: > >> > >> "start situation: > >> - domain A and domain B are PV domains > >> - domain A and B both have currently scheduled vCPUs, and the vCPUs > >> are not scheduled away > >> - domain A has XSM_TARGET access to domain B > >> - page X is owned by domain B and has no mappings > >> - page X is zeroed > >> > >> steps: > >> - domain A uses do_mmu_update() to map page X in domain A as writable > >> - domain A accesses page X through the new PTE, creating a TLB entry > >> - domain A removes its mapping of page X > >> - type count of page X goes to 0 > >> - tlbflush_timestamp of page X is bumped > >> - domain B maps page X as L1 pagetable > >> - type of page X changes to PGT_l1_page_table > >> - TLB flush is forced using domain_dirty_cpumask of domain B > >> - page X is mapped as L1 pagetable in domain B > >> > >> At this point, domain B's vCPUs are guaranteed to have no > >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's > >> vCPUs can still have stale TLB entries that map page X as writable, > >> permitting domain A to control a live pagetable of domain B." > > > > AIUI this patch solves the problem by immediately flushing domain A's > > TLB entries at the point where domain A removes its mapping of page X. > > > > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into > > domain B's domain_dirty_cpumask at the same point? > > > > Then when domain B flushes TLBs in the last step (in __get_page_type()) > > it will catch any stale TLB entries from domain A as well. But in the > > (hopefully common) case where there's a delay between domain A's > > __put_page_type() and domain B's __get_page_type(), the usual TLB > > timestamp filtering will suppress some of the IPIs/flushes. > > Oh, I see. Yes, I think this would be fine. However, we don't have > a suitable cpumask accessor allowing us to do this ORing atomically, > so we'd have to open code it. Probably better to build the accessor than to open code here. :) > Do you think such a slightly ugly approach would be worth it here? > Foreign mappings shouldn't be _that_ performance critical.. I have no real idea, though there are quite a lot of them in domain building/migration. I can imagine a busy multi-vcpu dom0 could generate a lot of IPIs, almost all of which could be merged. > And then, considering that this will result in time stamp based filtering > again, I'm no longer sure I was right to agree with Jann on the flush > here needing to be unconditional. Regardless of page table owner > matching page owner, the time stamp stored for the page will always > be applicable (it's a global property). So we wouldn't even need to > OR in the whole dirty mask here, but could already pre-filter (or if we > stayed with the flush-on-put approach, then v1 would have been > correct). I don't think so. The page's timestamp is set when its typecount falls to zero, which hasn't happened yet -- we hold a typecount ourselves here. In theory we could filter the bits we're adding against a local timestamp, but that would have to be tlbflush_current_time() because the TLB entries we care about are live right now, and filtering against that is (probably) a noop. Tim.
>>> On 26.04.17 at 10:44, <tim@xen.org> wrote: > At 05:59 -0600 on 25 Apr (1493099950), Jan Beulich wrote: >> >>> On 25.04.17 at 12:59, <tim@xen.org> wrote: >> > Hi, >> > >> > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: >> >> Jann's explanation of the problem: >> >> >> >> "start situation: >> >> - domain A and domain B are PV domains >> >> - domain A and B both have currently scheduled vCPUs, and the vCPUs >> >> are not scheduled away >> >> - domain A has XSM_TARGET access to domain B >> >> - page X is owned by domain B and has no mappings >> >> - page X is zeroed >> >> >> >> steps: >> >> - domain A uses do_mmu_update() to map page X in domain A as writable >> >> - domain A accesses page X through the new PTE, creating a TLB entry >> >> - domain A removes its mapping of page X >> >> - type count of page X goes to 0 >> >> - tlbflush_timestamp of page X is bumped >> >> - domain B maps page X as L1 pagetable >> >> - type of page X changes to PGT_l1_page_table >> >> - TLB flush is forced using domain_dirty_cpumask of domain B >> >> - page X is mapped as L1 pagetable in domain B >> >> >> >> At this point, domain B's vCPUs are guaranteed to have no >> >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's >> >> vCPUs can still have stale TLB entries that map page X as writable, >> >> permitting domain A to control a live pagetable of domain B." >> > >> > AIUI this patch solves the problem by immediately flushing domain A's >> > TLB entries at the point where domain A removes its mapping of page X. >> > >> > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into >> > domain B's domain_dirty_cpumask at the same point? >> > >> > Then when domain B flushes TLBs in the last step (in __get_page_type()) >> > it will catch any stale TLB entries from domain A as well. But in the >> > (hopefully common) case where there's a delay between domain A's >> > __put_page_type() and domain B's __get_page_type(), the usual TLB >> > timestamp filtering will suppress some of the IPIs/flushes. >> >> Oh, I see. Yes, I think this would be fine. However, we don't have >> a suitable cpumask accessor allowing us to do this ORing atomically, >> so we'd have to open code it. > > Probably better to build the accessor than to open code here. :) Hmm, that would mean building a whole group of accessors (as I wouldn't want to introduce an atomic OR one without any of the others which exist as non-atomic ones). Plus there would be the question of how to name them - the current inconsistency (single bit operations being atomic unless prefixed by two underscores, while multi-bit operations are non-atomic despite their lack of leading underscores) doesn't really help here. IOW the original question wasn't really whether to introduce accessors, but whether the approach you suggest is worthwhile despite the lack of such accessors. Which I think you ... >> Do you think such a slightly ugly approach would be worth it here? >> Foreign mappings shouldn't be _that_ performance critical.. > > I have no real idea, though there are quite a lot of them in domain > building/migration. I can imagine a busy multi-vcpu dom0 could > generate a lot of IPIs, almost all of which could be merged. ... believe it would be. >> And then, considering that this will result in time stamp based filtering >> again, I'm no longer sure I was right to agree with Jann on the flush >> here needing to be unconditional. Regardless of page table owner >> matching page owner, the time stamp stored for the page will always >> be applicable (it's a global property). So we wouldn't even need to >> OR in the whole dirty mask here, but could already pre-filter (or if we >> stayed with the flush-on-put approach, then v1 would have been >> correct). > > I don't think so. The page's timestamp is set when its typecount > falls to zero, which hasn't happened yet -- we hold a typecount > ourselves here. > > In theory we could filter the bits we're adding against a local > timestamp, but that would have to be tlbflush_current_time() > because the TLB entries we care about are live right now, and > filtering against that is (probably) a noop. Good point. I guess I'll switch to the open coded merging approach then for v3. Jan
>>> On 25.04.17 at 12:59, <tim@xen.org> wrote: > Hi, > > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: >> Jann's explanation of the problem: >> >> "start situation: >> - domain A and domain B are PV domains >> - domain A and B both have currently scheduled vCPUs, and the vCPUs >> are not scheduled away >> - domain A has XSM_TARGET access to domain B >> - page X is owned by domain B and has no mappings >> - page X is zeroed >> >> steps: >> - domain A uses do_mmu_update() to map page X in domain A as writable >> - domain A accesses page X through the new PTE, creating a TLB entry >> - domain A removes its mapping of page X >> - type count of page X goes to 0 >> - tlbflush_timestamp of page X is bumped >> - domain B maps page X as L1 pagetable >> - type of page X changes to PGT_l1_page_table >> - TLB flush is forced using domain_dirty_cpumask of domain B >> - page X is mapped as L1 pagetable in domain B >> >> At this point, domain B's vCPUs are guaranteed to have no >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's >> vCPUs can still have stale TLB entries that map page X as writable, >> permitting domain A to control a live pagetable of domain B." > > AIUI this patch solves the problem by immediately flushing domain A's > TLB entries at the point where domain A removes its mapping of page X. > > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into > domain B's domain_dirty_cpumask at the same point? > > Then when domain B flushes TLBs in the last step (in __get_page_type()) > it will catch any stale TLB entries from domain A as well. But in the > (hopefully common) case where there's a delay between domain A's > __put_page_type() and domain B's __get_page_type(), the usual TLB > timestamp filtering will suppress some of the IPIs/flushes. So I've given this a try, and failed miserably (including losing an XFS volume on the test machine). The problem is the BUG_ON() at the top of domain_relinquish_resources() - there will, very likely, be bits remaining set if the code added to put_page_from_l1e() set some pretty recently (irrespective of avoiding to set any once ->is_dying has been set). Jan
At 08:07 -0600 on 26 Apr (1493194043), Jan Beulich wrote: > >>> On 25.04.17 at 12:59, <tim@xen.org> wrote: > > Hi, > > > > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: > >> Jann's explanation of the problem: > >> > >> "start situation: > >> - domain A and domain B are PV domains > >> - domain A and B both have currently scheduled vCPUs, and the vCPUs > >> are not scheduled away > >> - domain A has XSM_TARGET access to domain B > >> - page X is owned by domain B and has no mappings > >> - page X is zeroed > >> > >> steps: > >> - domain A uses do_mmu_update() to map page X in domain A as writable > >> - domain A accesses page X through the new PTE, creating a TLB entry > >> - domain A removes its mapping of page X > >> - type count of page X goes to 0 > >> - tlbflush_timestamp of page X is bumped > >> - domain B maps page X as L1 pagetable > >> - type of page X changes to PGT_l1_page_table > >> - TLB flush is forced using domain_dirty_cpumask of domain B > >> - page X is mapped as L1 pagetable in domain B > >> > >> At this point, domain B's vCPUs are guaranteed to have no > >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's > >> vCPUs can still have stale TLB entries that map page X as writable, > >> permitting domain A to control a live pagetable of domain B." > > > > AIUI this patch solves the problem by immediately flushing domain A's > > TLB entries at the point where domain A removes its mapping of page X. > > > > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into > > domain B's domain_dirty_cpumask at the same point? > > > > Then when domain B flushes TLBs in the last step (in __get_page_type()) > > it will catch any stale TLB entries from domain A as well. But in the > > (hopefully common) case where there's a delay between domain A's > > __put_page_type() and domain B's __get_page_type(), the usual TLB > > timestamp filtering will suppress some of the IPIs/flushes. > > So I've given this a try, and failed miserably (including losing an > XFS volume on the test machine). The problem is the BUG_ON() at > the top of domain_relinquish_resources() - there will, very likely, be > bits remaining set if the code added to put_page_from_l1e() set > some pretty recently (irrespective of avoiding to set any once > ->is_dying has been set). Yeah. :( Would it be correct to just remove that BUG_ON(), or replace it with an explicit check that there are no running vcpus? Or is using domain_dirty_cpumask like this too much of a stretch? E.g. PV TLB flushes use it, and would maybe be more expensive until the dom0 CPUs fall out of the mask (which isn't guaranteed to happen). We could add a new mask just for this case, and clear CPUs from it as they're flushed. But that sounds like a lot of work... Maybe worth measuring the impact of the current patch before going too far with this? Tim.
>>> On 26.04.17 at 16:25, <tim@xen.org> wrote: > At 08:07 -0600 on 26 Apr (1493194043), Jan Beulich wrote: >> >>> On 25.04.17 at 12:59, <tim@xen.org> wrote: >> > Hi, >> > >> > At 02:59 -0600 on 25 Apr (1493089158), Jan Beulich wrote: >> >> Jann's explanation of the problem: >> >> >> >> "start situation: >> >> - domain A and domain B are PV domains >> >> - domain A and B both have currently scheduled vCPUs, and the vCPUs >> >> are not scheduled away >> >> - domain A has XSM_TARGET access to domain B >> >> - page X is owned by domain B and has no mappings >> >> - page X is zeroed >> >> >> >> steps: >> >> - domain A uses do_mmu_update() to map page X in domain A as writable >> >> - domain A accesses page X through the new PTE, creating a TLB entry >> >> - domain A removes its mapping of page X >> >> - type count of page X goes to 0 >> >> - tlbflush_timestamp of page X is bumped >> >> - domain B maps page X as L1 pagetable >> >> - type of page X changes to PGT_l1_page_table >> >> - TLB flush is forced using domain_dirty_cpumask of domain B >> >> - page X is mapped as L1 pagetable in domain B >> >> >> >> At this point, domain B's vCPUs are guaranteed to have no >> >> incorrectly-typed stale TLB entries for page X, but AFAICS domain A's >> >> vCPUs can still have stale TLB entries that map page X as writable, >> >> permitting domain A to control a live pagetable of domain B." >> > >> > AIUI this patch solves the problem by immediately flushing domain A's >> > TLB entries at the point where domain A removes its mapping of page X. >> > >> > Could we, instead, bitwise OR domain A's domain_dirty_cpumask into >> > domain B's domain_dirty_cpumask at the same point? >> > >> > Then when domain B flushes TLBs in the last step (in __get_page_type()) >> > it will catch any stale TLB entries from domain A as well. But in the >> > (hopefully common) case where there's a delay between domain A's >> > __put_page_type() and domain B's __get_page_type(), the usual TLB >> > timestamp filtering will suppress some of the IPIs/flushes. >> >> So I've given this a try, and failed miserably (including losing an >> XFS volume on the test machine). The problem is the BUG_ON() at >> the top of domain_relinquish_resources() - there will, very likely, be >> bits remaining set if the code added to put_page_from_l1e() set >> some pretty recently (irrespective of avoiding to set any once >> ->is_dying has been set). > > Yeah. :( Would it be correct to just remove that BUG_ON(), or replace it > with an explicit check that there are no running vcpus? > > Or is using domain_dirty_cpumask like this too much of a stretch? > E.g. PV TLB flushes use it, and would maybe be more expensive until > the dom0 CPUs fall out of the mask (which isn't guaranteed to happen). Right, since effectively some of the bits may never clear (if pg_owner never runs on some pCPU l1e_owner does run on), I now think this model could even have introduced more overhead than the immediate flushing. > We could add a new mask just for this case, and clear CPUs from it as > they're flushed. But that sounds like a lot of work... Wouldn't it suffice to set bits in this mask in put_page_from_l1e() and consume/clear them in __get_page_type()? Right now I can't see it being necessary for correctness to fiddle with any of the other flushes using the domain dirty mask. But then again this may not be much of a win, unless the put operations come through in meaningful batches, not interleaved by any type changes (the latter ought to be guaranteed during domain construction and teardown at least, as the guest itself can't do anything at that time to effect type changes). Hence I wonder whether ... > Maybe worth measuring the impact of the current patch before going too > far with this? ... it wouldn't better be the other way around: We use the patch in its current (or even v1) form, and try to do something about performance only if we really find a case where it matters. To be honest, I'm not even sure how I could meaningfully measure the impact here: Simply counting how many extra flushes there would end up being wouldn't seem all that useful, and whether there would be any measurable difference in the overall execution time of e.g. domain creation I would highly doubt (but if it's that what you're after, I could certainly collect a few numbers). Jan
At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: > ... it wouldn't better be the other way around: We use the patch > in its current (or even v1) form, and try to do something about > performance only if we really find a case where it matters. To be > honest, I'm not even sure how I could meaningfully measure the > impact here: Simply counting how many extra flushes there would > end up being wouldn't seem all that useful, and whether there > would be any measurable difference in the overall execution time > of e.g. domain creation I would highly doubt (but if it's that what > you're after, I could certainly collect a few numbers). I think that would be a good idea, just as a sanity-check. But apart from that the patch looks correct to me, so: Reviewed-by: Tim Deegan <tim@xen.org> for v2 (not v1). Cheers, Tim.
>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: > At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >> ... it wouldn't better be the other way around: We use the patch >> in its current (or even v1) form, and try to do something about >> performance only if we really find a case where it matters. To be >> honest, I'm not even sure how I could meaningfully measure the >> impact here: Simply counting how many extra flushes there would >> end up being wouldn't seem all that useful, and whether there >> would be any measurable difference in the overall execution time >> of e.g. domain creation I would highly doubt (but if it's that what >> you're after, I could certainly collect a few numbers). > > I think that would be a good idea, just as a sanity-check. As it turns out there is a measurable effect: xc_dom_boot_image() for a 4Gb PV guest takes about 70% longer now. Otoh it is itself responsible for less than 10% of the overall time libxl__build_dom() takes, and that in turn is only a pretty small portion of the overall "xl create". Jan
At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: > >>> On 27.04.17 at 11:51, <tim@xen.org> wrote: > > At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: > >> ... it wouldn't better be the other way around: We use the patch > >> in its current (or even v1) form, and try to do something about > >> performance only if we really find a case where it matters. To be > >> honest, I'm not even sure how I could meaningfully measure the > >> impact here: Simply counting how many extra flushes there would > >> end up being wouldn't seem all that useful, and whether there > >> would be any measurable difference in the overall execution time > >> of e.g. domain creation I would highly doubt (but if it's that what > >> you're after, I could certainly collect a few numbers). > > > > I think that would be a good idea, just as a sanity-check. > > As it turns out there is a measurable effect: xc_dom_boot_image() > for a 4Gb PV guest takes about 70% longer now. Otoh it is itself > responsible for less than 10% of the overall time libxl__build_dom() > takes, and that in turn is only a pretty small portion of the overall > "xl create". Do you think that slowdown is OK? I'm not sure -- I'd be inclined to avoid it, but could be persuaded, and it's not me doing the work. :) Andrew, what do you think? Tim.
>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: > At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: >> >>> On 27.04.17 at 11:51, <tim@xen.org> wrote: >> > At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >> >> ... it wouldn't better be the other way around: We use the patch >> >> in its current (or even v1) form, and try to do something about >> >> performance only if we really find a case where it matters. To be >> >> honest, I'm not even sure how I could meaningfully measure the >> >> impact here: Simply counting how many extra flushes there would >> >> end up being wouldn't seem all that useful, and whether there >> >> would be any measurable difference in the overall execution time >> >> of e.g. domain creation I would highly doubt (but if it's that what >> >> you're after, I could certainly collect a few numbers). >> > >> > I think that would be a good idea, just as a sanity-check. >> >> As it turns out there is a measurable effect: xc_dom_boot_image() >> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself >> responsible for less than 10% of the overall time libxl__build_dom() >> takes, and that in turn is only a pretty small portion of the overall >> "xl create". > > Do you think that slowdown is OK? I'm not sure -- I'd be inclined to > avoid it, but could be persuaded, and it's not me doing the work. :) Well, if there was a way to avoid it in a clean way without too much code churn, I'd be all for avoiding it. The avenues we've explored so far either didn't work (using pg_owner's dirty mask) or didn't promise to actually reduce the flush overhead in a meaningful way (adding a separate mask to be merged into the mask used for the flush in __get_page_type()), unless - as has been the case before - I didn't fully understand your thoughts there. Jan
At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: > >>> On 02.05.17 at 10:32, <tim@xen.org> wrote: > > At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: > >> >>> On 27.04.17 at 11:51, <tim@xen.org> wrote: > >> > At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: > >> >> ... it wouldn't better be the other way around: We use the patch > >> >> in its current (or even v1) form, and try to do something about > >> >> performance only if we really find a case where it matters. To be > >> >> honest, I'm not even sure how I could meaningfully measure the > >> >> impact here: Simply counting how many extra flushes there would > >> >> end up being wouldn't seem all that useful, and whether there > >> >> would be any measurable difference in the overall execution time > >> >> of e.g. domain creation I would highly doubt (but if it's that what > >> >> you're after, I could certainly collect a few numbers). > >> > > >> > I think that would be a good idea, just as a sanity-check. > >> > >> As it turns out there is a measurable effect: xc_dom_boot_image() > >> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself > >> responsible for less than 10% of the overall time libxl__build_dom() > >> takes, and that in turn is only a pretty small portion of the overall > >> "xl create". > > > > Do you think that slowdown is OK? I'm not sure -- I'd be inclined to > > avoid it, but could be persuaded, and it's not me doing the work. :) > > Well, if there was a way to avoid it in a clean way without too much > code churn, I'd be all for avoiding it. The avenues we've explored so > far either didn't work (using pg_owner's dirty mask) or didn't promise > to actually reduce the flush overhead in a meaningful way (adding a > separate mask to be merged into the mask used for the flush in > __get_page_type()), unless - as has been the case before - I didn't > fully understand your thoughts there. Quoting your earlier response: > Wouldn't it suffice to set bits in this mask in put_page_from_l1e() > and consume/clear them in __get_page_type()? Right now I can't > see it being necessary for correctness to fiddle with any of the > other flushes using the domain dirty mask. > > But then again this may not be much of a win, unless the put > operations come through in meaningful batches, not interleaved > by any type changes (the latter ought to be guaranteed during > domain construction and teardown at least, as the guest itself > can't do anything at that time to effect type changes). I'm not sure how much batching there needs to be. I agree that the domain creation case should work well though. Let me think about the scenarios when dom B is live: 1. Dom A drops its foreign map of page X; dom B immediately changes the type of page X. This case isn't helped at all, but I don't see any way to improve it -- dom A's TLBs need to be flushed right away. 2. Dom A drops its foreign map of page X; dom B immediately changes the type of page Y. Now dom A's dirty CPUs are in the new map, but B may not need to flush them right away. B can filter by page Y's timestamp, and flush (and clear) only some of the cpus in the map. So that seems good, but then there's a risk that cpus never get cleared from the map, and __get_page_type() ends up doing a lot of unnecessary work filtering timestaps. When is it safe to remove a CPU from that map? - obvs safe if we IPI it to flush the TLB (though may need memory barriers -- need to think about a race with CPU C putting A _into_ the map at the same time...) - we could track the timestamp of the most recent addition to the map, and drop any CPU whose TLB has been flushed since that, but that still lets unrelated unmaps keep CPUs alive in the map... - we could double-buffer the map: always add CPUs to the active map; from time to time, swap maps and flush everything in the non-active map (filtered by the TLB timestamp when we last swapped over). Bah, this is turning into a tar pit. Let's stick to the v2 patch as being (relatively) simple and correct, and revisit this if it causes trouble. :) Thanks, Tim.
On 02/05/17 10:43, Tim Deegan wrote: > At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: >>>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: >>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: >>>>>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: >>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >>>>>> ... it wouldn't better be the other way around: We use the patch >>>>>> in its current (or even v1) form, and try to do something about >>>>>> performance only if we really find a case where it matters. To be >>>>>> honest, I'm not even sure how I could meaningfully measure the >>>>>> impact here: Simply counting how many extra flushes there would >>>>>> end up being wouldn't seem all that useful, and whether there >>>>>> would be any measurable difference in the overall execution time >>>>>> of e.g. domain creation I would highly doubt (but if it's that what >>>>>> you're after, I could certainly collect a few numbers). >>>>> I think that would be a good idea, just as a sanity-check. >>>> As it turns out there is a measurable effect: xc_dom_boot_image() >>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself >>>> responsible for less than 10% of the overall time libxl__build_dom() >>>> takes, and that in turn is only a pretty small portion of the overall >>>> "xl create". >>> Do you think that slowdown is OK? I'm not sure -- I'd be inclined to >>> avoid it, but could be persuaded, and it's not me doing the work. :) >> Well, if there was a way to avoid it in a clean way without too much >> code churn, I'd be all for avoiding it. The avenues we've explored so >> far either didn't work (using pg_owner's dirty mask) or didn't promise >> to actually reduce the flush overhead in a meaningful way (adding a >> separate mask to be merged into the mask used for the flush in >> __get_page_type()), unless - as has been the case before - I didn't >> fully understand your thoughts there. > Quoting your earlier response: > >> Wouldn't it suffice to set bits in this mask in put_page_from_l1e() >> and consume/clear them in __get_page_type()? Right now I can't >> see it being necessary for correctness to fiddle with any of the >> other flushes using the domain dirty mask. >> >> But then again this may not be much of a win, unless the put >> operations come through in meaningful batches, not interleaved >> by any type changes (the latter ought to be guaranteed during >> domain construction and teardown at least, as the guest itself >> can't do anything at that time to effect type changes). > I'm not sure how much batching there needs to be. I agree that the > domain creation case should work well though. Let me think about the > scenarios when dom B is live: > > 1. Dom A drops its foreign map of page X; dom B immediately changes the > type of page X. This case isn't helped at all, but I don't see any > way to improve it -- dom A's TLBs need to be flushed right away. > > 2. Dom A drops its foreign map of page X; dom B immediately changes > the type of page Y. Now dom A's dirty CPUs are in the new map, but B > may not need to flush them right away. B can filter by page Y's > timestamp, and flush (and clear) only some of the cpus in the map. > > So that seems good, but then there's a risk that cpus never get > cleared from the map, and __get_page_type() ends up doing a lot of > unnecessary work filtering timestaps. When is it safe to remove a CPU > from that map? > - obvs safe if we IPI it to flush the TLB (though may need memory > barriers -- need to think about a race with CPU C putting A _into_ > the map at the same time...) > - we could track the timestamp of the most recent addition to the > map, and drop any CPU whose TLB has been flushed since that, > but that still lets unrelated unmaps keep CPUs alive in the map... > - we could double-buffer the map: always add CPUs to the active map; > from time to time, swap maps and flush everything in the non-active > map (filtered by the TLB timestamp when we last swapped over). > > Bah, this is turning into a tar pit. Let's stick to the v2 patch as > being (relatively) simple and correct, and revisit this if it causes > trouble. :) :( A 70% performance hit for guest creation is certainly going to cause problems, but we obviously need to prioritise correctness in this case. ~Andrew
>>> On 02.05.17 at 19:37, <andrew.cooper3@citrix.com> wrote: > On 02/05/17 10:43, Tim Deegan wrote: >> At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: >>>>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: >>>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: >>>>>>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: >>>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >>>>>>> ... it wouldn't better be the other way around: We use the patch >>>>>>> in its current (or even v1) form, and try to do something about >>>>>>> performance only if we really find a case where it matters. To be >>>>>>> honest, I'm not even sure how I could meaningfully measure the >>>>>>> impact here: Simply counting how many extra flushes there would >>>>>>> end up being wouldn't seem all that useful, and whether there >>>>>>> would be any measurable difference in the overall execution time >>>>>>> of e.g. domain creation I would highly doubt (but if it's that what >>>>>>> you're after, I could certainly collect a few numbers). >>>>>> I think that would be a good idea, just as a sanity-check. >>>>> As it turns out there is a measurable effect: xc_dom_boot_image() >>>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself >>>>> responsible for less than 10% of the overall time libxl__build_dom() >>>>> takes, and that in turn is only a pretty small portion of the overall >>>>> "xl create". >>>> Do you think that slowdown is OK? I'm not sure -- I'd be inclined to >>>> avoid it, but could be persuaded, and it's not me doing the work. :) >>> Well, if there was a way to avoid it in a clean way without too much >>> code churn, I'd be all for avoiding it. The avenues we've explored so >>> far either didn't work (using pg_owner's dirty mask) or didn't promise >>> to actually reduce the flush overhead in a meaningful way (adding a >>> separate mask to be merged into the mask used for the flush in >>> __get_page_type()), unless - as has been the case before - I didn't >>> fully understand your thoughts there. >> Quoting your earlier response: >> >>> Wouldn't it suffice to set bits in this mask in put_page_from_l1e() >>> and consume/clear them in __get_page_type()? Right now I can't >>> see it being necessary for correctness to fiddle with any of the >>> other flushes using the domain dirty mask. >>> >>> But then again this may not be much of a win, unless the put >>> operations come through in meaningful batches, not interleaved >>> by any type changes (the latter ought to be guaranteed during >>> domain construction and teardown at least, as the guest itself >>> can't do anything at that time to effect type changes). >> I'm not sure how much batching there needs to be. I agree that the >> domain creation case should work well though. Let me think about the >> scenarios when dom B is live: >> >> 1. Dom A drops its foreign map of page X; dom B immediately changes the >> type of page X. This case isn't helped at all, but I don't see any >> way to improve it -- dom A's TLBs need to be flushed right away. >> >> 2. Dom A drops its foreign map of page X; dom B immediately changes >> the type of page Y. Now dom A's dirty CPUs are in the new map, but B >> may not need to flush them right away. B can filter by page Y's >> timestamp, and flush (and clear) only some of the cpus in the map. >> >> So that seems good, but then there's a risk that cpus never get >> cleared from the map, and __get_page_type() ends up doing a lot of >> unnecessary work filtering timestaps. When is it safe to remove a CPU >> from that map? >> - obvs safe if we IPI it to flush the TLB (though may need memory >> barriers -- need to think about a race with CPU C putting A _into_ >> the map at the same time...) >> - we could track the timestamp of the most recent addition to the >> map, and drop any CPU whose TLB has been flushed since that, >> but that still lets unrelated unmaps keep CPUs alive in the map... >> - we could double-buffer the map: always add CPUs to the active map; >> from time to time, swap maps and flush everything in the non-active >> map (filtered by the TLB timestamp when we last swapped over). >> >> Bah, this is turning into a tar pit. Let's stick to the v2 patch as >> being (relatively) simple and correct, and revisit this if it causes >> trouble. :) > > :( > > A 70% performance hit for guest creation is certainly going to cause > problems, but we obviously need to prioritise correctness in this case. Hmm, you did understand that the 70% hit is on a specific sub-part of the overall process, not guest creation as a whole? Anyway, your reply is neither an ack nor a nak nor an indication of what needs to change ... Jan
On 03/05/17 08:21, Jan Beulich wrote: >>>> On 02.05.17 at 19:37, <andrew.cooper3@citrix.com> wrote: >> On 02/05/17 10:43, Tim Deegan wrote: >>> At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: >>>>>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: >>>>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: >>>>>>>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: >>>>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >>>>>>>> ... it wouldn't better be the other way around: We use the patch >>>>>>>> in its current (or even v1) form, and try to do something about >>>>>>>> performance only if we really find a case where it matters. To be >>>>>>>> honest, I'm not even sure how I could meaningfully measure the >>>>>>>> impact here: Simply counting how many extra flushes there would >>>>>>>> end up being wouldn't seem all that useful, and whether there >>>>>>>> would be any measurable difference in the overall execution time >>>>>>>> of e.g. domain creation I would highly doubt (but if it's that what >>>>>>>> you're after, I could certainly collect a few numbers). >>>>>>> I think that would be a good idea, just as a sanity-check. >>>>>> As it turns out there is a measurable effect: xc_dom_boot_image() >>>>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself >>>>>> responsible for less than 10% of the overall time libxl__build_dom() >>>>>> takes, and that in turn is only a pretty small portion of the overall >>>>>> "xl create". >>>>> Do you think that slowdown is OK? I'm not sure -- I'd be inclined to >>>>> avoid it, but could be persuaded, and it's not me doing the work. :) >>>> Well, if there was a way to avoid it in a clean way without too much >>>> code churn, I'd be all for avoiding it. The avenues we've explored so >>>> far either didn't work (using pg_owner's dirty mask) or didn't promise >>>> to actually reduce the flush overhead in a meaningful way (adding a >>>> separate mask to be merged into the mask used for the flush in >>>> __get_page_type()), unless - as has been the case before - I didn't >>>> fully understand your thoughts there. >>> Quoting your earlier response: >>> >>>> Wouldn't it suffice to set bits in this mask in put_page_from_l1e() >>>> and consume/clear them in __get_page_type()? Right now I can't >>>> see it being necessary for correctness to fiddle with any of the >>>> other flushes using the domain dirty mask. >>>> >>>> But then again this may not be much of a win, unless the put >>>> operations come through in meaningful batches, not interleaved >>>> by any type changes (the latter ought to be guaranteed during >>>> domain construction and teardown at least, as the guest itself >>>> can't do anything at that time to effect type changes). >>> I'm not sure how much batching there needs to be. I agree that the >>> domain creation case should work well though. Let me think about the >>> scenarios when dom B is live: >>> >>> 1. Dom A drops its foreign map of page X; dom B immediately changes the >>> type of page X. This case isn't helped at all, but I don't see any >>> way to improve it -- dom A's TLBs need to be flushed right away. >>> >>> 2. Dom A drops its foreign map of page X; dom B immediately changes >>> the type of page Y. Now dom A's dirty CPUs are in the new map, but B >>> may not need to flush them right away. B can filter by page Y's >>> timestamp, and flush (and clear) only some of the cpus in the map. >>> >>> So that seems good, but then there's a risk that cpus never get >>> cleared from the map, and __get_page_type() ends up doing a lot of >>> unnecessary work filtering timestaps. When is it safe to remove a CPU >>> from that map? >>> - obvs safe if we IPI it to flush the TLB (though may need memory >>> barriers -- need to think about a race with CPU C putting A _into_ >>> the map at the same time...) >>> - we could track the timestamp of the most recent addition to the >>> map, and drop any CPU whose TLB has been flushed since that, >>> but that still lets unrelated unmaps keep CPUs alive in the map... >>> - we could double-buffer the map: always add CPUs to the active map; >>> from time to time, swap maps and flush everything in the non-active >>> map (filtered by the TLB timestamp when we last swapped over). >>> >>> Bah, this is turning into a tar pit. Let's stick to the v2 patch as >>> being (relatively) simple and correct, and revisit this if it causes >>> trouble. :) >> :( >> >> A 70% performance hit for guest creation is certainly going to cause >> problems, but we obviously need to prioritise correctness in this case. > Hmm, you did understand that the 70% hit is on a specific sub-part > of the overall process, not guest creation as a whole? Anyway, > your reply is neither an ack nor a nak nor an indication of what needs > to change ... Yes - I realise it isn't all of domain creation, but this performance hit will also hit migration, qemu DMA mappings, etc. XenServer has started a side-by-side performance work-up of this change, as presented at the root of this thread. We should hopefully have some number in the next day or two. ~Andrew
On 03/05/17 10:56, Andrew Cooper wrote: > On 03/05/17 08:21, Jan Beulich wrote: > >>>> On 02.05.17 at 19:37, <andrew.cooper3@citrix.com> wrote: > >> On 02/05/17 10:43, Tim Deegan wrote: > >>> At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: > >>>>>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: > >>>>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: > >>>>>>>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: > >>>>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: > >>>>>>>> ... it wouldn't better be the other way around: We use the > >>>>>>>> patch in its current (or even v1) form, and try to do something > >>>>>>>> about performance only if we really find a case where it > >>>>>>>> matters. To be honest, I'm not even sure how I could > >>>>>>>> meaningfully measure the impact here: Simply counting how > many > >>>>>>>> extra flushes there would end up being wouldn't seem all that > >>>>>>>> useful, and whether there would be any measurable difference in > >>>>>>>> the overall execution time of e.g. domain creation I would > >>>>>>>> highly doubt (but if it's that what you're after, I could certainly > collect a few numbers). > >>>>>>> I think that would be a good idea, just as a sanity-check. > >>>>>> As it turns out there is a measurable effect: xc_dom_boot_image() > >>>>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself > >>>>>> responsible for less than 10% of the overall time > >>>>>> libxl__build_dom() takes, and that in turn is only a pretty small > >>>>>> portion of the overall "xl create". > >>>>> Do you think that slowdown is OK? I'm not sure -- I'd be inclined > >>>>> to avoid it, but could be persuaded, and it's not me doing the > >>>>> work. :) > >>>> Well, if there was a way to avoid it in a clean way without too > >>>> much code churn, I'd be all for avoiding it. The avenues we've > >>>> explored so far either didn't work (using pg_owner's dirty mask) or > >>>> didn't promise to actually reduce the flush overhead in a > >>>> meaningful way (adding a separate mask to be merged into the mask > >>>> used for the flush in __get_page_type()), unless - as has been the > >>>> case before - I didn't fully understand your thoughts there. > >>> Quoting your earlier response: > >>> > >>>> Wouldn't it suffice to set bits in this mask in put_page_from_l1e() > >>>> and consume/clear them in __get_page_type()? Right now I can't see > >>>> it being necessary for correctness to fiddle with any of the other > >>>> flushes using the domain dirty mask. > >>>> > >>>> But then again this may not be much of a win, unless the put > >>>> operations come through in meaningful batches, not interleaved by > >>>> any type changes (the latter ought to be guaranteed during domain > >>>> construction and teardown at least, as the guest itself can't do > >>>> anything at that time to effect type changes). > >>> I'm not sure how much batching there needs to be. I agree that the > >>> domain creation case should work well though. Let me think about > >>> the scenarios when dom B is live: > >>> > >>> 1. Dom A drops its foreign map of page X; dom B immediately changes > >>> the type of page X. This case isn't helped at all, but I don't see > >>> any way to improve it -- dom A's TLBs need to be flushed right away. > >>> > >>> 2. Dom A drops its foreign map of page X; dom B immediately changes > >>> the type of page Y. Now dom A's dirty CPUs are in the new map, but > >>> B may not need to flush them right away. B can filter by page Y's > >>> timestamp, and flush (and clear) only some of the cpus in the map. > >>> > >>> So that seems good, but then there's a risk that cpus never get > >>> cleared from the map, and __get_page_type() ends up doing a lot of > >>> unnecessary work filtering timestaps. When is it safe to remove a > >>> CPU from that map? > >>> - obvs safe if we IPI it to flush the TLB (though may need memory > >>> barriers -- need to think about a race with CPU C putting A _into_ > >>> the map at the same time...) > >>> - we could track the timestamp of the most recent addition to the > >>> map, and drop any CPU whose TLB has been flushed since that, > >>> but that still lets unrelated unmaps keep CPUs alive in the map... > >>> - we could double-buffer the map: always add CPUs to the active map; > >>> from time to time, swap maps and flush everything in the non-active > >>> map (filtered by the TLB timestamp when we last swapped over). > >>> > >>> Bah, this is turning into a tar pit. Let's stick to the v2 patch as > >>> being (relatively) simple and correct, and revisit this if it causes > >>> trouble. :) > >> :( > >> > >> A 70% performance hit for guest creation is certainly going to cause > >> problems, but we obviously need to prioritise correctness in this case. > > Hmm, you did understand that the 70% hit is on a specific sub-part of > > the overall process, not guest creation as a whole? Anyway, your reply > > is neither an ack nor a nak nor an indication of what needs to change > > ... > > Yes - I realise it isn't all of domain creation, but this performance hit will also > hit migration, qemu DMA mappings, etc. > > XenServer has started a side-by-side performance work-up of this change, as > presented at the root of this thread. We should hopefully have some > number in the next day or two. > I did some measurements on two builds of a recent version of XenServer using Xen upstream 4.9.0-3.0. The only difference between the builds was the patch x86-put-l1e-foreign-flush.patch in https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02945.html. I observed no measurable difference between these builds with a guest RAM value of 4G, 8G and 14G for the following operations: - time xe vm-start - time xe vm-shutdown - vm downtime during "xe vm-migration" (as measured by pinging the vm during migration and verifying for how long pings would fail when both domains are paused) - time xe vm-migrate # for HVM guests (eg. win7 and win10) But I observed a difference for the duration of "time xe vm-migrate" for PV guests (eg. centos68, debian70, ubuntu1204). For centos68, for instance, I obtained the following values on a machine with a Intel E3-1281v3 3.7Ghz CPU, averaged over 10 runs for each data point: | Guest RAM | no patch | with patch | difference | diff/RAM | | 14GB | 10.44s | 13.46s | 3.02s | 0.22s/GB | | 8GB | 6.46s | 8.28s | 1.82s | 0.23s/GB | | 4GB | 3.85s | 4.74s | 0.89s | 0.22s/GB | From these numbers, if the patch is present, it looks like VM migration of a PV guest would take an extra 1s for each extra 5GB of guest RAM. The VMs are mostly idle during migration. At this point, it's not clear to me why this difference is only visible on VM migration (as opposed to VM start for example), and only on a PV guest (as opposed to an HVM). Marcus
On 05/05/17 19:16, Marcus Granado wrote: > On 03/05/17 10:56, Andrew Cooper wrote: >> On 03/05/17 08:21, Jan Beulich wrote: >>>>>> On 02.05.17 at 19:37, <andrew.cooper3@citrix.com> wrote: >>>> On 02/05/17 10:43, Tim Deegan wrote: >>>>> At 02:50 -0600 on 02 May (1493693403), Jan Beulich wrote: >>>>>>>>> On 02.05.17 at 10:32, <tim@xen.org> wrote: >>>>>>> At 04:52 -0600 on 28 Apr (1493355160), Jan Beulich wrote: >>>>>>>>>>> On 27.04.17 at 11:51, <tim@xen.org> wrote: >>>>>>>>> At 03:23 -0600 on 27 Apr (1493263380), Jan Beulich wrote: >>>>>>>>>> ... it wouldn't better be the other way around: We use the >>>>>>>>>> patch in its current (or even v1) form, and try to do something >>>>>>>>>> about performance only if we really find a case where it >>>>>>>>>> matters. To be honest, I'm not even sure how I could >>>>>>>>>> meaningfully measure the impact here: Simply counting how >> many >>>>>>>>>> extra flushes there would end up being wouldn't seem all that >>>>>>>>>> useful, and whether there would be any measurable difference in >>>>>>>>>> the overall execution time of e.g. domain creation I would >>>>>>>>>> highly doubt (but if it's that what you're after, I could certainly >> collect a few numbers). >>>>>>>>> I think that would be a good idea, just as a sanity-check. >>>>>>>> As it turns out there is a measurable effect: xc_dom_boot_image() >>>>>>>> for a 4Gb PV guest takes about 70% longer now. Otoh it is itself >>>>>>>> responsible for less than 10% of the overall time >>>>>>>> libxl__build_dom() takes, and that in turn is only a pretty small >>>>>>>> portion of the overall "xl create". >>>>>>> Do you think that slowdown is OK? I'm not sure -- I'd be inclined >>>>>>> to avoid it, but could be persuaded, and it's not me doing the >>>>>>> work. :) >>>>>> Well, if there was a way to avoid it in a clean way without too >>>>>> much code churn, I'd be all for avoiding it. The avenues we've >>>>>> explored so far either didn't work (using pg_owner's dirty mask) or >>>>>> didn't promise to actually reduce the flush overhead in a >>>>>> meaningful way (adding a separate mask to be merged into the mask >>>>>> used for the flush in __get_page_type()), unless - as has been the >>>>>> case before - I didn't fully understand your thoughts there. >>>>> Quoting your earlier response: >>>>> >>>>>> Wouldn't it suffice to set bits in this mask in put_page_from_l1e() >>>>>> and consume/clear them in __get_page_type()? Right now I can't see >>>>>> it being necessary for correctness to fiddle with any of the other >>>>>> flushes using the domain dirty mask. >>>>>> >>>>>> But then again this may not be much of a win, unless the put >>>>>> operations come through in meaningful batches, not interleaved by >>>>>> any type changes (the latter ought to be guaranteed during domain >>>>>> construction and teardown at least, as the guest itself can't do >>>>>> anything at that time to effect type changes). >>>>> I'm not sure how much batching there needs to be. I agree that the >>>>> domain creation case should work well though. Let me think about >>>>> the scenarios when dom B is live: >>>>> >>>>> 1. Dom A drops its foreign map of page X; dom B immediately changes >>>>> the type of page X. This case isn't helped at all, but I don't see >>>>> any way to improve it -- dom A's TLBs need to be flushed right away. >>>>> >>>>> 2. Dom A drops its foreign map of page X; dom B immediately changes >>>>> the type of page Y. Now dom A's dirty CPUs are in the new map, but >>>>> B may not need to flush them right away. B can filter by page Y's >>>>> timestamp, and flush (and clear) only some of the cpus in the map. >>>>> >>>>> So that seems good, but then there's a risk that cpus never get >>>>> cleared from the map, and __get_page_type() ends up doing a lot of >>>>> unnecessary work filtering timestaps. When is it safe to remove a >>>>> CPU from that map? >>>>> - obvs safe if we IPI it to flush the TLB (though may need memory >>>>> barriers -- need to think about a race with CPU C putting A _into_ >>>>> the map at the same time...) >>>>> - we could track the timestamp of the most recent addition to the >>>>> map, and drop any CPU whose TLB has been flushed since that, >>>>> but that still lets unrelated unmaps keep CPUs alive in the map... >>>>> - we could double-buffer the map: always add CPUs to the active map; >>>>> from time to time, swap maps and flush everything in the non-active >>>>> map (filtered by the TLB timestamp when we last swapped over). >>>>> >>>>> Bah, this is turning into a tar pit. Let's stick to the v2 patch as >>>>> being (relatively) simple and correct, and revisit this if it causes >>>>> trouble. :) >>>> :( >>>> >>>> A 70% performance hit for guest creation is certainly going to cause >>>> problems, but we obviously need to prioritise correctness in this case. >>> Hmm, you did understand that the 70% hit is on a specific sub-part of >>> the overall process, not guest creation as a whole? Anyway, your reply >>> is neither an ack nor a nak nor an indication of what needs to change >>> ... >> Yes - I realise it isn't all of domain creation, but this performance hit will also >> hit migration, qemu DMA mappings, etc. >> >> XenServer has started a side-by-side performance work-up of this change, as >> presented at the root of this thread. We should hopefully have some >> number in the next day or two. >> > I did some measurements on two builds of a recent version of XenServer using Xen upstream 4.9.0-3.0. The upstream base of this XenServer build was c/s ba39e9b, one change past 4.9.0-rc3 > The only difference between the builds was the patch x86-put-l1e-foreign-flush.patch in https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02945.html. > > I observed no measurable difference between these builds with a guest RAM value of 4G, 8G and 14G for the following operations: > - time xe vm-start > - time xe vm-shutdown > - vm downtime during "xe vm-migration" (as measured by pinging the vm during migration and verifying for how long pings would fail when both domains are paused) > - time xe vm-migrate # for HVM guests (eg. win7 and win10) > > But I observed a difference for the duration of "time xe vm-migrate" for PV guests (eg. centos68, debian70, ubuntu1204). For centos68, for instance, I obtained the following values on a machine with a Intel E3-1281v3 3.7Ghz CPU, averaged over 10 runs for each data point: > | Guest RAM | no patch | with patch | difference | diff/RAM | > | 14GB | 10.44s | 13.46s | 3.02s | 0.22s/GB | > | 8GB | 6.46s | 8.28s | 1.82s | 0.23s/GB | > | 4GB | 3.85s | 4.74s | 0.89s | 0.22s/GB | > > From these numbers, if the patch is present, it looks like VM migration of a PV guest would take an extra 1s for each extra 5GB of guest RAM. The VMs are mostly idle during migration. At this point, it's not clear to me why this difference is only visible on VM migration (as opposed to VM start for example), and only on a PV guest (as opposed to an HVM). The difference between start and migrate can be explained. During domain creation, we only have to foreign map the areas of the guest we need to write into (guest kernel/initrd, or hvmloader/acpi tables), which is independent of the quantity of RAM the guest has. During migration, we must foreign map all guest RAM. Furthermore, we unmap and potentially remap again later if the RAM gets dirtied. (A 64bit toolstack process could potentially foreign map the entire guest and reuse the mappings. A 32bit toolstack process very definitely can't, so the migration logic uses the simpler approach of not reusing mappings at all.) I am at a loss to explain why the overhead is only observed when migrating PV guests. I would expect migrating HVM guests to have identical properties in this regard. ~Andrew
--- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1266,6 +1266,18 @@ void put_page_from_l1e(l1_pgentry_t l1e, if ( (l1e_get_flags(l1e) & _PAGE_RW) && ((l1e_owner == pg_owner) || !paging_mode_external(pg_owner)) ) { + /* + * Don't leave stale writable TLB entries in the unmapping domain's + * page tables, to prevent them allowing access to pages required to + * be read-only (e.g. after pg_owner changed them to page table or + * segment descriptor pages). + */ + if ( unlikely(l1e_owner != pg_owner) ) + { + perfc_incr(need_flush_tlb_flush); + flush_tlb_mask(l1e_owner->domain_dirty_cpumask); + } + put_page_and_type(page); } else