From patchwork Wed Feb 8 23:43:23 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Laszlo Ersek X-Patchwork-Id: 9563797 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id BA1F760146 for ; Wed, 8 Feb 2017 23:43:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 52F5028535 for ; Wed, 8 Feb 2017 23:43:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 46F8D28539; Wed, 8 Feb 2017 23:43:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 0C0D928535 for ; Wed, 8 Feb 2017 23:43:53 +0000 (UTC) Received: from localhost ([::1]:33795 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cbbtv-0000PS-2g for patchwork-qemu-devel@patchwork.kernel.org; Wed, 08 Feb 2017 18:43:51 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49581) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cbbtc-0000PI-56 for qemu-devel@nongnu.org; Wed, 08 Feb 2017 18:43:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cbbtX-0007yd-Tg for qemu-devel@nongnu.org; Wed, 08 Feb 2017 18:43:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:38540) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cbbtX-0007yQ-KL for qemu-devel@nongnu.org; Wed, 08 Feb 2017 18:43:27 -0500 Received: from smtp.corp.redhat.com (int-mx16.intmail.prod.int.phx2.redhat.com [10.5.11.28]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8740FC05AA41; Wed, 8 Feb 2017 23:43:27 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-31.phx2.redhat.com [10.3.116.31]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1236D660C0; Wed, 8 Feb 2017 23:43:25 +0000 (UTC) To: Ben Warren References: <28ff6ff023dd041fc7557955dea916adce72efea.1486285434.git.ben@skyportsystems.com> <14f224ed-08e2-cbad-9d1d-8f559cd399a6@redhat.com> From: Laszlo Ersek Message-ID: Date: Thu, 9 Feb 2017 00:43:23 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.74 on 10.5.11.28 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Wed, 08 Feb 2017 23:43:27 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: Re: [Qemu-devel] [PATCH v5 05/10] ACPI: Add Virtual Machine Generation ID support X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Igor Mammedov , qemu-devel@nongnu.org, "Michael S. Tsirkin" Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP On 02/08/17 23:34, Ben Warren wrote: > >> On Feb 7, 2017, at 4:48 PM, Laszlo Ersek > > wrote: >> >> On 02/05/17 10:12, ben@skyportsystems.com >> wrote: >>> From: Ben Warren > >>> >>> This implements the VM Generation ID feature by passing a 128-bit >>> GUID to the guest via a fw_cfg blob. >>> Any time the GUID changes, an ACPI notify event is sent to the guest >>> >>> The user interface is a simple device with one parameter: >>> - guid (string, must be "auto" or in UUID format >>> xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) >>> >>> Signed-off-by: Ben Warren >> > >>> --- >>> default-configs/i386-softmmu.mak | 1 + >>> default-configs/x86_64-softmmu.mak | 1 + >>> hw/acpi/Makefile.objs | 1 + >>> hw/acpi/vmgenid.c | 206 >>> +++++++++++++++++++++++++++++++++++ >>> hw/i386/acpi-build.c | 10 ++ >>> include/hw/acpi/acpi_dev_interface.h | 1 + >>> include/hw/acpi/vmgenid.h | 37 +++++++ >>> 7 files changed, 257 insertions(+) >>> create mode 100644 hw/acpi/vmgenid.c >>> create mode 100644 include/hw/acpi/vmgenid.h >> >> [snip code] >> >> So, I'm late to the game. I can't possibly comment on all the concerns >> that have been raised scattered around the thread, exactly *where* they >> have been raised. However, I will try to collect the concerns here. >> >> >> (1) Byte swapping for the UUID. >> >> The idea of QemuUUID is that wherever it is stored in a non-ephemeral >> fashion, it is in BE representation. The guest needs it in LE, so we >> should convert it temporarily -- using a local variable -- just before >> writing it to guest memory, and then forget about the LE representation >> promptly. >> >> As far as I understand, we all agree on this (Michael, Ben, myself -- I >> think Igor hasn't commented on this). >> >> > Sure, will rework. >> (2) Structure of the vmgenid fw_cfg blob, AKA "why that darn offset 40 >> decimal". >> >> I explained it under points (6) and (7) in the following message: >> >> Message-Id: > > >> URL: https://www.mail-archive.com/qemu-devel@nongnu.org/msg425226.html >> >> The story is that *wherever* an ADD_POINTER command points to -- that >> is, at the *exact* target address --, OVMF will look for an ACPI table >> header, somewhat heuristically. If it finds a byte pattern that (a) fits >> into the remaining blob and (b) passes some superficial ACPI table >> header checks, such as length and checksum, then OVMF assumes that the >> blob contains an ACPI table there, and passes the address (again, the >> exact, relocated, absolute target address of ADD_POINTER) to >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(). >> >> We want to disable this heuristic for the vmgenid blob. *If* the blob >> contained only 16 bytes (for the GUID), then the heuristic would >> automatically disable itself, because the ACPI table header (36 bytes) >> is larger than 16 bytes, so OVMF wouldn't even look. However, for the >> caching and other VMGENID requirements, we need to allocate a full page >> with the blob (4KB), hence OVMF *will* look. If we placed the GUID right >> at the start of the page, then OVMF would sanity-check it as an ACPI >> table header. The check would *most likely* not pass, so things would be >> fine in practice, but we can do better than that: just put 40 zero bytes >> at the front of the blob. >> >> And this is why the ADD_POINTER command has to point to the beginning of >> the blob (i.e., 36 zero bytes), to "suppress the OVMF ACPI SDT header >> detection". (The other 4 bytes are for arriving at an address divisible >> by 8, which is a VMGENID requirement for the GUID.) >> >> The consequence is that both the ADDR method and QEMU's guest memory >> access code have to add 40 manually. >> > Igor has recommended that I have the add_pointer() call that patches AML > have an offset of 40 from the start of the source file, which will > result in the VGIA AML variable pointing to the GUID, not offset by 40. > I assume this isn’t a problem for you, but please confirm. No, that would be a problem. The explanation as to why is right above. Please process the rest of this sub-thread; Igor has agreed that my request makes sense, he'd just like to see the reasoning behind it documented: http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01640.html To which I suggested that you please capture the above argument in a new subsection in the docs file: http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01642.html To which Igor replied that we should reference that documentation from the exact spot in the code, where 40 decimal is added: http://lists.nongnu.org/archive/html/qemu-devel/2017-02/msg01647.html That's apparently our agreement on this. :) >> >> (3) Whether the VGIA named object should be a DWORD or a QWORD in ACPI. >> Personally, I'm fine with either, but I see that DWORD is more >> compatible with old OSes. >> >> What I really care about though is the new WRITE_POINTER command >> structure. That should support 8 bytes as well. >> >> So this is exactly how Michael summarized it ultimately: >> - use a DWORD for VGIA in ACPI, >> - with a matching 4-byte wide ADD_POINTER command; >> - write the address with a 4-byte wide WRITE_POINTER command into >> fw_cfg, >> - *but* the WRITE_POINTER command struct should support 8-byte as well, >> - and the fw_cfg file that is written (vmgenid_addr) should have size 8 >> from the start. (The guest will simply write offsets 0..3 inclusive, >> in LE byte order.) >> > This is an easy change, WRITE_POINTER already supports 8-byte writes. >> >> (4) IIRC Igor asked if sizing the blob to 4KB would guarantee that the >> GUID ends up in a cacheable page -- yes (see also Michael's followup). >> This size guarantees that the GUID will have a full page to itself, and >> its allocated from normal guest RAM, as Reserved memory (SeaBIOS, IIRC) >> or EfiACPIMemoryNVS memory (OVMF). >> >> Note that the VMGENID spec says, in ACPI terms, "It must not be in >> ranges reported as AddressRangeMemory or AddressRangeACPI", but that's >> fine: >> >> ACPI speak UEFI speak suitable for VMGENID? >> ------------------ --------------------- --------------------- >> AddressRangeMemory EfiConventionalMemory no >> AddressRangeACPI EfiACPIReclaimMemory no >> AddressRangeNVS EfiACPIMemoryNVS yes, used by OVMF >> AddressRangeReserved EfiReservedMemoryType yes, used by SeaBIOS >> >> >> (5) The race and fw_cfg callbacks. >> >> (a) Michael and Ben identified a problem where >> - QEMU places the initial GUID in the "vmgenid" fw_cfg blob, >> - the guest downloads it, >> - the host admin changes the GUID via the monitor, >> - and the guest writes the address to "vmgenid_addr" *only* after that. >> >> In other words, the monitor action occurs between the guest's processing >> of the ALLOCATE and WRITE_POINTER linker/loader commands. >> >> (b) A similar problem is rebooting the guest. >> - The guest starts up fine, >> - the OS runs, >> - the host admin changes the GUID via the monitor, >> - and the guest sees the update fine. >> - At this point the guest is rebooted (within the same QEMU instance). >> >> Since the fw_cfg blob is not rebuilt -- more precisely, the blob *is* >> rebuilt, but it is then thrown away in acpi_build_update() --, the guest >> will see a different GUID from the last value set on the monitor. (And >> querying the monitor will actually return that last-set value, which is >> no longer known by the guest.) >> >> >> The suggestion for these problems was to add a callback to "one" of the >> fw_cfg files in question. >> >> Let's investigate the writeable "vmgenid_addr" file first. Here the idea >> is to rewrite the GUID in guest RAM as soon as the address is known from >> the guest, regardless of what GUID the guest placed there originally, >> through downloading the "vmgenid" blob. >> >> (i) We have no write callbacks at the moment. Before we removed the data >> port based write support in QEMU 2.4, the write callback used to be >> invoked when the end of the fw_cfg blob was reached. Since we intend to >> pack several addresses in the same writeable fw_cfg blob down the road, >> at different offsets, this method wouldn't work; the final offset in the >> blob might not be reached at all. (Another thing that wouldn't work is >> an 8-byte writeable fw_cfg blob, and a 4-byte WRITE_POINTER command that >> only rewrites offsets 0..3 inclusve -- see (3) above.) >> >> So I don't think that a write callback is viable, at least with the >> "many addresses at different offsets in the same blob" feature. Unless >> we want to register different callbacks (or different callback >> arguments) for different offsets. And then a large write could trigger a >> series of callbacks, even, if it covers several special offsets. This >> looks too complex to me. >> >> (ii) What we do have now is a select callback. Unfortunately, the guest >> is not required to select the item and atomically rewrite the blob, in >> the same fw_cfg DMA operation. The guest is allowed to do the selection >> with the IO port method, and then write the blob separately with DMA. >> (This is what OVMF does.) IOW, when we'd get the callback (for the >> select), the address would not have been written yet. >> >> So I don't think that a select callback for the writeable "vmgenid_addr" >> file is viable either. >> >> (iii) Let's see a select callback for the "vmgenid" blob itself! The >> idea is, whenever the guest selects this blob, immediately refresh the >> blob from the last set GUID. Then let the guest download the >> just-refreshed blob. >> > I like this approach, and have it working. Cool! :) > I assume by “refresh the > blob” you mean making a call to fw_cfg_modify_file(). If that is the > case, I need to modify the function fw_cfg_modify_bytes_read() because > of the following commented-out problem lines (656:657 in fw_cfg.c): > > /* return the old data to the function caller, avoid memory leak */ > ptr = s->entries[arch][key].data; > s->entries[arch][key].data = data; > s->entries[arch][key].len = len; > // s->entries[arch][key].callback_opaque = NULL; > // s->entries[arch][key].allow_write = false; > > As you can see, this function modifies not only a fw_cfg object’s data, > but also its metadata, wiping out the callback parameter and forcing it > to read-only. I don’t know the history of this function, so want to > tread lightly. The second parameter doesn’t impact me, since this > fw_cfg object is read-only, but the callback parameter one does. It is justified to tread lightly :), because you really don't need to call fw_cfg_modify_file(). Instead, in vmgenid_add_fw_cfg(), please do this: and then, whenever necessary, you can poke the GUID right into "vms->guid_blob" (i.e., the fw_cfg data), at the appropriate offset (40). No reallocation / metadata changes are necessary. However, this means that you have to extend vmgenid_post_load() *too*, with the same fw_cfg blob update. (You don't have to migrate the blob, it is a whole bunch of zeroes with a GUID in the middle that has to be refreshed on the target host anyway.) > >> The main observation here is that we can *fail* QMP commands. That's >> good enough if we can tell *when* to fail them. >> >> Under this idea, we would fail the "set-vm-generation-id" QMP command >> with "EAGAIN" if the quest had selected the "vmgenid" blob since machine >> reset -- this can be tracked with a latch -- *but* the particular offset >> range in the "vmgenid_addr" blob were still zero. >> >> This is exactly the window between the ALLOCATE and WRITE_POINTER >> commands where updates from the monitor would be lost. So let's just >> tell the management layer to try again later (like 1 second later). By >> that time, any sane guest will have written the address into >> "vmgenid_addr”. >> > This seems sane. I’ll just return EAGAIN if vmgenid_addr is zero, > meaning the guest is not fully synchronized yet so don’t accept input. Right, it means that the guest is temporarily in a state where it cannot accept a new GUID. Regarding the error structure -- we have structured errors! --, please don't use error_setg(), nor error_setg_errno(). "EAGAIN" above was just an example. Both of the aforementioned functions format an ERROR_CLASS_GENERIC_ERROR, which is not really useful for libvirt if libvirt wants to treat this condition specially. Instead, we have another error class that looks appropriate: ERROR_CLASS_DEVICE_NOT_ACTIVE. If you grep the source for it, you'll find two independent users, which is (IMO) license to use it for our purposes as well. error_set(errp, ERROR_CLASS_DEVICE_NOT_ACTIVE, "", arguments ...); See also "@DeviceNotActive" under "@QapiErrorClass" in "qapi/common.json". If you grep the libvirt source code for "DeviceNotActive", there are several hits (with custom error handling) in "src/qemu/qemu_monitor_json.c". > > One other more radical idea I had was to just remove the monitor GUID > modification capability completely. You are going to laugh, but this was actually what I was about to suggest while writing the email you've just replied to :) I could not figure out *why* the management layer would want to change the VMGUID for an already running domain. Generate a random one at startup, or at incoming migration (hence, on the command line!), which also covers restart from snapshot I guess, is fully justified, but why mess with it otherwise? Where does this requirement come from? In the docs patch, you mention Hyper-V and Xen as other VMMs that support vmgenid. I didn't look into Hyper-V, but googling Xen, it looked like they had a similar management interface (for runtime). At that point I said to myself, "okay, so this is for 'feature parity' for all I know", and deleted like three paragraphs :) > The VM Generation ID needs to > change only in the following situations: > - new VM (command line processed) > - start from snapshot (command line processed, VGIA restored from VmState. (this is also "incoming migration") > - VM recovered from backup (like a new VM, command line processed). > - VM imported, copied or cloned (like a new VM, command line processed). > - failed over from a disaster recovery environment. Here I’m not sure. > It’s possible I suppose that one would have a DR VM in hot standby, but > would want to change its VM Generation ID in order to force Windows to > re-synchronize Active Directory, crypto seeds etc. Here the monitor > might be necessary. Well, I really don't know. I guess we can follow approach (iii), just to be safe. Thanks! Laszlo diff --git a/hw/acpi/vmgenid.c b/hw/acpi/vmgenid.c index af8b35cebbe7..3860e9717e12 100644 --- a/hw/acpi/vmgenid.c +++ b/hw/acpi/vmgenid.c @@ -111,6 +111,8 @@ void vmgenid_add_fw_cfg(FWCfgState *s, GArray *guid) assert(obj); VmGenIdState *vms = VMGENID(obj); + vms->guid_blob = guid->data; + /* Create a read-only fw_cfg file for GUID */ fw_cfg_add_file(s, VMGENID_GUID_FW_CFG_FILE, guid->data, VMGENID_FW_CFG_SIZE);