Message ID | 74048f89-fee7-06c2-ffd5-6e5a14bdf440@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] gnttab: defer allocation of status frame tracking array | expand |
Hi Jan, On 04/05/2021 09:42, Jan Beulich wrote: > This array can be large when many grant frames are permitted; avoid > allocating it when it's not going to be used anyway, by doing this only > in gnttab_populate_status_frames(). While the delaying of the respective > memory allocation adds possible reasons for failure of the respective > enclosing operations, there are other memory allocations there already, > so callers can't expect these operations to always succeed anyway. > > As to the re-ordering at the end of gnttab_unpopulate_status_frames(), > this is merely to represent intended order of actions (shrink array > bound, then free higher array entries). If there were racing accesses, > suitable barriers would need adding in addition. Please drop the last sentence, this is at best misleading because you can't just add barriers to make it race free (see the discussion on v2 for more details). With the sentence dropped: Reviewed-by: Julien Grall <jgrall@amazon.com> Cheers, [1] <f82ddfe7-853d-ca15-2373-a38068f65ef7@xen.org>
On 04/05/2021 09:42, Jan Beulich wrote: > This array can be large when many grant frames are permitted; avoid > allocating it when it's not going to be used anyway, by doing this only > in gnttab_populate_status_frames(). While the delaying of the respective > memory allocation adds possible reasons for failure of the respective > enclosing operations, there are other memory allocations there already, > so callers can't expect these operations to always succeed anyway. > > As to the re-ordering at the end of gnttab_unpopulate_status_frames(), > this is merely to represent intended order of actions (shrink array > bound, then free higher array entries). If there were racing accesses, > suitable barriers would need adding in addition. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Nack. The argument you make says that the grant status frames is "large enough" to care about not allocating. (I frankly disagree, but that isn't relevant to my to my nack). The change in logic here moves a failure in DOMCTL_createdomain, to GNTTABOP_setversion. We know, from the last minute screwups in XSA-226, that there are versions of Windows and Linux in the field, used by customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed. You're literally adding even more complexity to the grant table, to also increase the chance of regressing VMs in the wild. This is not ok. The only form of this patch which is in any way acceptable, is to avoid the allocation when you know *in DOMCTL_createdomain* that it will never be needed by the VM. So far, that is from Kconfig and/or the command line options. ~Andrew
On 05.05.2021 12:49, Andrew Cooper wrote: > On 04/05/2021 09:42, Jan Beulich wrote: >> This array can be large when many grant frames are permitted; avoid >> allocating it when it's not going to be used anyway, by doing this only >> in gnttab_populate_status_frames(). While the delaying of the respective >> memory allocation adds possible reasons for failure of the respective >> enclosing operations, there are other memory allocations there already, >> so callers can't expect these operations to always succeed anyway. >> >> As to the re-ordering at the end of gnttab_unpopulate_status_frames(), >> this is merely to represent intended order of actions (shrink array >> bound, then free higher array entries). If there were racing accesses, >> suitable barriers would need adding in addition. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Nack. > > The argument you make says that the grant status frames is "large > enough" to care about not allocating. (I frankly disagree, but that > isn't relevant to my to my nack). > > The change in logic here moves a failure in DOMCTL_createdomain, to > GNTTABOP_setversion. We know, from the last minute screwups in XSA-226, > that there are versions of Windows and Linux in the field, used by > customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed. > > You're literally adding even more complexity to the grant table, to also > increase the chance of regressing VMs in the wild. This is not ok. To me, arguing like this is not okay. The code should have been written like this in the first place. There's absolutely no reason to allocate memory up front which is never going to be used. > The only form of this patch which is in any way acceptable, is to avoid > the allocation when you know *in DOMCTL_createdomain* that it will never > be needed by the VM. So far, that is from Kconfig and/or the command > line options. Well, may I remind you that this is how this patch had started? And may I further remind you that it was Julien who asked me to convert to this model? And may I then remind you that I already did suggest that the two of you should come to an agreement? I'm not going to act as a moderator in this process. But I insist that it is really bad practice to drop the ball and leave patches (and alike) hanging in the air. Just to be clear - in case I don't observe the two of you agreeing on whichever outcome in the foreseeable future, I'm going to make attempts to get another opinion from yet another REST maintainer. Since I can live with both forms of the change (but would prefer the more aggressive savings as done in this version), I can also live with whatever 4th opinion would surface. But in case the result was not in your favor, I'd then consider your Nack overruled. Jan
Hi Andrew, On 05/05/2021 11:49, Andrew Cooper wrote: > On 04/05/2021 09:42, Jan Beulich wrote: >> This array can be large when many grant frames are permitted; avoid >> allocating it when it's not going to be used anyway, by doing this only >> in gnttab_populate_status_frames(). While the delaying of the respective >> memory allocation adds possible reasons for failure of the respective >> enclosing operations, there are other memory allocations there already, >> so callers can't expect these operations to always succeed anyway. >> >> As to the re-ordering at the end of gnttab_unpopulate_status_frames(), >> this is merely to represent intended order of actions (shrink array >> bound, then free higher array entries). If there were racing accesses, >> suitable barriers would need adding in addition. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Nack. > > The argument you make says that the grant status frames is "large > enough" to care about not allocating. (I frankly disagree, but that > isn't relevant to my to my nack). > > The change in logic here moves a failure in DOMCTL_createdomain, to > GNTTABOP_setversion. We know, from the last minute screwups in XSA-226, > that there are versions of Windows and Linux in the field, used by > customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed. Unfortunately, my reply to this on the v2 was left unanswered by you. So I will rewrite it here with more details in the hope this can lead to a consensus now. From a discussion during January's community call, the behavior was related to an old version version of Windows PV drivers. Newer version will not use Grant Table v2. However, your comment seems to suggest that GNTTABOP_setversion will never fail today. AFAICT, this is not true today because Xen will return -ENOMEM when trying to populate the status frame (see the call to gnttab_populate_status_frame()). Therefore... > > You're literally adding even more complexity to the grant table, to also > increase the chance of regressing VMs in the wild. This is not ok. ... I am not sure why you are saying this is a regression. > The only form of this patch which is in any way acceptable, is to avoid > the allocation when you know *in DOMCTL_createdomain* that it will never > be needed by the VM. So far, that is from Kconfig and/or the command > line options. I can see how allocating memory upfront is easier for accounting purpose. However, it also means we may not be able to reduce the footprint if we don't know what the guest OS will use it. This can happen for instance, if you let your customers to run random OSes and never restrict them to v1. I believe that in most of the cases v1 will be used by the guest. But we technically can't restrict it after the fact (e.g. on reboot) as this may regress customers workload. That's where the current approach is quite appealing because the allocation is done a runtime. So it does cater a lot more uses cases than your suggestion. Cheers,
On 05.05.2021 12:49, Andrew Cooper wrote: > On 04/05/2021 09:42, Jan Beulich wrote: >> This array can be large when many grant frames are permitted; avoid >> allocating it when it's not going to be used anyway, by doing this only >> in gnttab_populate_status_frames(). While the delaying of the respective >> memory allocation adds possible reasons for failure of the respective >> enclosing operations, there are other memory allocations there already, >> so callers can't expect these operations to always succeed anyway. >> >> As to the re-ordering at the end of gnttab_unpopulate_status_frames(), >> this is merely to represent intended order of actions (shrink array >> bound, then free higher array entries). If there were racing accesses, >> suitable barriers would need adding in addition. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Nack. > > The argument you make says that the grant status frames is "large > enough" to care about not allocating. (I frankly disagree, but that > isn't relevant to my to my nack). > > The change in logic here moves a failure in DOMCTL_createdomain, to > GNTTABOP_setversion. We know, from the last minute screwups in XSA-226, > that there are versions of Windows and Linux in the field, used by > customers, which will BUG()/BSOD when GNTTABOP_setversion doesn't succeed. So after you did re-state the Linux aspect of this on the call yesterday I went and looked. In the first phase of Linux supporting v2 at all (3.3 - 3.16) there indeed was a panic() upon certain kinds of failures. Only up to 3.13 was there an actual attempt to use v2. Also, while there was some code there to support actual v2 features (sub-page or transitive grants), all of this was dead. Hence their claim /* * If we've already used version 2 features, * but then suddenly discover that they're not * available (e.g. migrating to an older * version of Xen), almost unbounded badness * can happen. */ was bogus at best. If there was no way at all for set_version to fail prior to the change, here I could probably accept your position. But as said before by Julien - there are pre-existing memory allocations (of typically larger size) there, and hence any guest assuming the call can't fail is flawed already anyway. And no, I don't view this as a reason for us to try to eliminate all memory allocations from that hypercall (which would in particular mean populating status frames irrespective of whether a guest would ever switch to v2). Guest flaws would better be addressed in the guests. Upon re-introduction in 4.15 no such fatal behavior was put back in place. I notice though that even up-to-date Linux has problematic behavior around GNTTABOP_setup_table - the call is followed (after an explicit -ENOSYS check) by "BUG_ON(rc || setup.status)". The amount of memory needed here (including the status table) is potentially far higher than for set_version. And what's important: setup_table gets invoked only after set_version, so any table expansion would happen here, with - if any table growing is needed at all - a far higher risk for failure. This ordering of operations (set_version before setup_table) as well as the "error checking" after setup_table was also in effect up to 3.13. Therefore the same conclusion can be drawn there: The main risk for crashing the guest due to memory shortage in Xen is there, not with the version switch. What you've observed with XSA-226 (or rather, I assume, with a custom extension of yours at the time, to prohibit use of v2, which was upstreamed only later) was entirely unrelated to memory shortage, but was instead a result of v2 suddenly becoming unavailable altogether. As to Windows, in the pvdrivers/win/ git repos I haven't been able to find any use of GrantTableSetVersion(). Of course I can't exclude other versions or other driver variants making use of set_version in an inappropriate way. But as long as they don't grow the number of grant table entries before such a call, the main risk of memory allocation failure would again be with other hypercalls. Jan
--- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1747,6 +1747,17 @@ gnttab_populate_status_frames(struct dom /* Make sure, prior version checks are architectural visible */ block_speculation(); + if ( gt->status == ZERO_BLOCK_PTR ) + { + gt->status = xzalloc_array(grant_status_t *, + grant_to_status_frames(gt->max_grant_frames)); + if ( !gt->status ) + { + gt->status = ZERO_BLOCK_PTR; + return -ENOMEM; + } + } + for ( i = nr_status_frames(gt); i < req_status_frames; i++ ) { if ( (gt->status[i] = alloc_xenheap_page()) == NULL ) @@ -1767,18 +1778,25 @@ status_alloc_failed: free_xenheap_page(gt->status[i]); gt->status[i] = NULL; } + + if ( !nr_status_frames(gt) ) + { + xfree(gt->status); + gt->status = ZERO_BLOCK_PTR; + } + return -ENOMEM; } static int gnttab_unpopulate_status_frames(struct domain *d, struct grant_table *gt) { - unsigned int i; + unsigned int i, n = nr_status_frames(gt); /* Make sure, prior version checks are architectural visible */ block_speculation(); - for ( i = 0; i < nr_status_frames(gt); i++ ) + for ( i = 0; i < n; i++ ) { struct page_info *pg = virt_to_page(gt->status[i]); gfn_t gfn = gnttab_get_frame_gfn(gt, true, i); @@ -1833,12 +1849,11 @@ gnttab_unpopulate_status_frames(struct d page_set_owner(pg, NULL); } - for ( i = 0; i < nr_status_frames(gt); i++ ) - { - free_xenheap_page(gt->status[i]); - gt->status[i] = NULL; - } gt->nr_status_frames = 0; + for ( i = 0; i < n; i++ ) + free_xenheap_page(gt->status[i]); + xfree(gt->status); + gt->status = ZERO_BLOCK_PTR; return 0; } @@ -1969,11 +1984,11 @@ int grant_table_init(struct domain *d, i if ( gt->shared_raw == NULL ) goto out; - /* Status pages for grant table - for version 2 */ - gt->status = xzalloc_array(grant_status_t *, - grant_to_status_frames(gt->max_grant_frames)); - if ( gt->status == NULL ) - goto out; + /* + * Status page tracking array for v2 gets allocated on demand. But don't + * leave a NULL pointer there. + */ + gt->status = ZERO_BLOCK_PTR; grant_write_lock(gt); @@ -4047,11 +4062,13 @@ int gnttab_acquire_resource( if ( gt->gt_version != 2 ) break; + /* This may change gt->status, so has to happen before setting vaddrs. */ + rc = gnttab_get_status_frame_mfn(d, final_frame, &tmp); + /* Check that void ** is a suitable representation for gt->status. */ BUILD_BUG_ON(!__builtin_types_compatible_p( typeof(gt->status), grant_status_t **)); vaddrs = (void **)gt->status; - rc = gnttab_get_status_frame_mfn(d, final_frame, &tmp); break; }
This array can be large when many grant frames are permitted; avoid allocating it when it's not going to be used anyway, by doing this only in gnttab_populate_status_frames(). While the delaying of the respective memory allocation adds possible reasons for failure of the respective enclosing operations, there are other memory allocations there already, so callers can't expect these operations to always succeed anyway. As to the re-ordering at the end of gnttab_unpopulate_status_frames(), this is merely to represent intended order of actions (shrink array bound, then free higher array entries). If there were racing accesses, suitable barriers would need adding in addition. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v4: Add a comment. Add a few blank lines. Extend description. v3: Drop smp_wmb(). Re-base. v2: Defer allocation to when a domain actually switches to the v2 grant API.