diff mbox

[RFC,1/2] misc: Add vboxguest driver for Virtual Box Guest integration

Message ID 82e9619b-0e6d-ae10-28c8-87295ae85389@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Hans de Goede Aug. 12, 2017, 9:56 p.m. UTC
Hi,

On 11-08-17 23:23, Arnd Bergmann wrote:
> On Fri, Aug 11, 2017 at 3:23 PM, Hans de Goede <hdegoede@redhat.com> wrote:
>> This commit adds a driver for the Virtual Box Guest PCI device used in
>> Virtual Box virtual machines. Enabling this driver will add support for
>> Virtual Box Guest integration features such as copy-and-paste, seamless
>> mode and OpenGL pass-through.
>>
>> This driver also offers vboxguest IPC functionality which is needed
>> for the vboxfs driver which offers folder sharing support.
>>
>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> 
> Since you are still working on coding style cleanups, I'll comment mainly on
> the ioctl interface for now.

Thank you for the taking the time to review the ioctl interface.

> Overall it already looks much better than
> I expected
> from previous experience with the vbox source.

I'm glad you like it, cleaning it up has been quite a bit of work,
so I'm happy to see this being appreciated.

> Most of the issues I would normally comment on are already moot based
> on the assumption that we won't be able to make substantial changes to
> either the user space portion or the hypervisor side.
> 
>> +/**
>> + * Inflate the balloon by one chunk.
>> + *
>> + * The caller owns the balloon mutex.
>> + *
>> + * @returns VBox status code
>> + * @param   gdev        The Guest extension device.
>> + * @param   chunk_idx  Index of the chunk.
>> + */
>> +static int vbg_balloon_inflate(PVBOXGUESTDEVEXT gdev, u32 chunk_idx)
>> +{
>> +       VMMDevChangeMemBalloon *req = gdev->mem_balloon.change_req;
>> +       struct page **pages;
>> +       int i, rc;
>> +
>> +       pages = kmalloc(sizeof(*pages) * VMMDEV_MEMORY_BALLOON_CHUNK_PAGES,
>> +                       GFP_KERNEL | __GFP_NOWARN);
>> +       if (!pages)
>> +               return VERR_NO_MEMORY;
>> +
>> +       req->header.size = sizeof(*req);
>> +       req->inflate = true;
>> +       req->pages = VMMDEV_MEMORY_BALLOON_CHUNK_PAGES;
> 
> The balloon section seems to be rather simplistic with its ioctl interface.
> Ideally this should use CONFIG_BALLOON_COMPACTION and an
> oom notifier like the virtio_balloon driver does. Yes, I realize that only
> one of the six (or more) balloon drivers we have in the kernel does it ;-)

The way the balloon code works is that the baloon-size is fully controlled
by the host, all the guest-additions do is wait for an event indicating that
the host wants to change it and then call this ioctl which will get the
desired balloon size from the host and tries to adjust the size accordingly.

Hooking into the oom killer is not really useful because there is no way we
can indicate to the host we are running low on mem and I don't think that
trying to shrink the balloon to be smaller then the host requested size
is a good idea.

> 
>> +/**
>> + * Callback for heartbeat timer.
>> + */
>> +static void vbg_heartbeat_timer(unsigned long data)
>> +{
>> +       PVBOXGUESTDEVEXT gdev = (PVBOXGUESTDEVEXT)data;
>> +
>> +       vbg_req_perform(gdev, gdev->guest_heartbeat_req);
>> +       mod_timer(&gdev->heartbeat_timer,
>> +                 msecs_to_jiffies(gdev->heartbeat_interval_ms));
>> +}
> 
> This looks like a watchdog and should use the drivers/watchdog
> subsystem interfaces.

This is not really a watchdog, this also is fully under host control,
the host controls if the guest should send a heartbeat and if so at
which interval, where normally with a watchdog starting the timer
and deciding the interval is under control of the watchdog user.

Also if we miss to generate the heartbeat then the host will just
log a message, not reset the system as one would expect with a
watchdog. TL;DR: I understand why suggest this but the watchdog
interface seems a poor match for this.

> 
>> +/**
>> + * vbg_query_host_version try get the host feature mask and version information
>> + * (vbg_host_version).
>> + *
>> + * @returns 0 or negative errno value (ignored).
>> + * @param   gdev         The Guest extension device.
>> + */
>> +static int vbg_query_host_version(PVBOXGUESTDEVEXT gdev)
>> +{
>> +       VMMDevReqHostVersion *req;
>> +       int rc, ret;
>> +
>> +       req = vbg_req_alloc(sizeof(*req), VMMDevReq_GetHostVersion);
>> +       if (!req)
>> +               return -ENOMEM;
> 
> While I understand you want to keep the ioctl interface, the version information
> looks like something that we should also have in sysfs in an appropriate
> location.

Yes I can add a sysfs attribute for the version and flags :)

>> diff --git a/drivers/misc/vboxguest/vboxguest_linux.c b/drivers/misc/vboxguest/vboxguest_linux.c
>> new file mode 100644
>> index 000000000000..8468c7139b98
>> --- /dev/null
>> +++ b/drivers/misc/vboxguest/vboxguest_linux.c
> 
> This looks like a fairly short file, and could be merged into the core file.

I would prefer to keep it separate to keep more or less 1 : 1 mapping
between mainline kernel and vbox upstream svn files.

>> +/** The file_operations structures. */
>> +static const struct file_operations vbg_misc_device_fops = {
>> +       .owner                  = THIS_MODULE,
>> +       .open                   = vbg_misc_device_open,
>> +       .release                = vbg_misc_device_close,
>> +       .unlocked_ioctl         = vgdrvLinuxIOCtl,
>> +};
>> +static const struct file_operations vbg_misc_device_user_fops = {
>> +       .owner                  = THIS_MODULE,
>> +       .open                   = vbg_misc_device_user_open,
>> +       .release                = vbg_misc_device_close,
>> +       .unlocked_ioctl         = vgdrvLinuxIOCtl,
>> +};
> 
> These lack a .compat_ioctl callback.

Good catch, so I guess that the code paths for 32 bit OpenGL pass
on 64 bits have never been used / tested under Linux, I will take
a look at this.

>> +/**
>> + * Device I/O Control entry point.
>> + *
>> + * @returns -ENOMEM or -EFAULT for errors inside the ioctl callback; 0
>> + * on success, or a positive VBox status code on vbox guest-device errors.
>> + *
>> + * @param   pInode      Associated inode pointer.
>> + * @param   pFilp       Associated file pointer.
>> + * @param   uCmd        The function specified to ioctl().
>> + * @param   ulArg       The argument specified to ioctl().
>> + */
>> +static long vgdrvLinuxIOCtl(struct file *pFilp, unsigned int uCmd,
>> +                           unsigned long ulArg)
>> +{
>> +       PVBOXGUESTSESSION session = (PVBOXGUESTSESSION) pFilp->private_data;
>> +       u32 cbData = _IOC_SIZE(uCmd);
>> +       void *pvBufFree;
>> +       void *pvBuf;
>> +       int rc, ret = 0;
>> +       u64 au64Buf[32 / sizeof(u64)];
>> +
>> +       /*
>> +        * For small amounts of data being passed we use a stack based buffer
>> +        * except for VMMREQUESTs where the data must not be on the stack.
>> +        */
>> +       if (cbData <= sizeof(au64Buf) &&
>> +           VBOXGUEST_IOCTL_STRIP_SIZE(uCmd) !=
>> +           VBOXGUEST_IOCTL_STRIP_SIZE(VBOXGUEST_IOCTL_VMMREQUEST(0))) {
>> +               pvBufFree = NULL;
>> +               pvBuf = &au64Buf[0];
>> +       } else {
>> +               /* __GFP_DMA32 for VBOXGUEST_IOCTL_VMMREQUEST */
>> +               pvBufFree = pvBuf = kmalloc(cbData, GFP_KERNEL | __GFP_DMA32);
>> +               if (!pvBuf)
>> +                       return -ENOMEM;
>> +       }
>> +       if (copy_from_user(pvBuf, (void *)ulArg, cbData) == 0) {
> 
> There should be a check for a maximum argument size.

Good point.

> I'd also change the commands
> to not always do both read and write, but only whichever applies. This function
> would then do the copy_from_user/copy_to_user conditionally.

This is actually an upstream TODO item I believe. One which would
probably be best fixed by using _IOR/_IOW/_IORW IOCTL macros, but
that needs to be coordinated with upstream.

>> +static int hgcm_call_preprocess_linaddr(const HGCMFunctionParameter *src_parm,
>> +                                       bool is_user, void **bounce_buf_ret,
>> +                                       size_t *pcb_extra)
>> +{
>> +       void *buf, *bounce_buf;
>> +       bool copy_in;
>> +       u32 len;
>> +       int ret;
>> +
>> +       buf = (void *)src_parm->u.Pointer.u.linearAddr;
>> +       len = src_parm->u.Pointer.size;
>> +       copy_in = src_parm->type != VMMDevHGCMParmType_LinAddr_Out;
>> +
>> +       if (!is_user) {
>> +               if (WARN_ON(len > VBGLR0_MAX_HGCM_KERNEL_PARM))
>> +                       return VERR_OUT_OF_RANGE;
>> +
>> +               hgcm_call_inc_pcb_extra(buf, len, pcb_extra);
>> +               return VINF_SUCCESS;
>> +       }
>> +
>> +       if (len > VBGLR0_MAX_HGCM_USER_PARM)
>> +               return VERR_OUT_OF_RANGE;
>> +
>> +       if (len <= PAGE_SIZE * 2)
>> +               bounce_buf = kmalloc(len, GFP_KERNEL);
>> +       else
>> +               bounce_buf = vmalloc(len);
>> +
> 
> we have kvmalloc() for this now

Cool I did not know that, will fix.

> 
> 
>> +#ifdef CONFIG_X86_64
>> +int vbg_hgcm_call32(VBOXGUESTDEVEXT *gdev, VBoxGuestHGCMCallInfo * pCallInfo,
>> +                   u32 cbCallInfo, u32 timeout_ms, bool is_user)
>> +{
>> +       VBoxGuestHGCMCallInfo *pCallInfo64 = NULL;
>> +       HGCMFunctionParameter *pParm64 = NULL;
>> +       HGCMFunctionParameter32 *pParm32 = NULL;
>> +       u32 cParms = pCallInfo->cParms;
>> +       u32 iParm;
>> +       int rc = VINF_SUCCESS;
>> +
>> +       /*
>> +        * The simple approach, allocate a temporary request and convert the parameters.
>> +        */
>> +       pCallInfo64 = kzalloc(sizeof(*pCallInfo64) +
>> +                             cParms * sizeof(HGCMFunctionParameter),
>> +                             GFP_KERNEL);
>> +       if (!pCallInfo64)
>> +               return VERR_NO_MEMORY;
>> +
>> +       *pCallInfo64 = *pCallInfo;
>> +       pParm32 = VBOXGUEST_HGCM_CALL_PARMS32(pCallInfo);
>> +       pParm64 = VBOXGUEST_HGCM_CALL_PARMS(pCallInfo64);
>> +       for (iParm = 0; iParm < cParms; iParm++, pParm32++, pParm64++) {
>> +               switch (pParm32->type) {
>> +               case VMMDevHGCMParmType_32bit:
>> +                       pParm64->type = VMMDevHGCMParmType_32bit;
>> +                       pParm64->u.value32 = pParm32->u.value32;
>> +                       break;
>> +
>> +               case VMMDevHGCMParmType_64bit:
>> +                       pParm64->type = VMMDevHGCMParmType_64bit;
>> +                       pParm64->u.value64 = pParm32->u.value64;
>> +                       break;
>> +
>> +               case VMMDevHGCMParmType_LinAddr_Out:
>> +               case VMMDevHGCMParmType_LinAddr:
>> +               case VMMDevHGCMParmType_LinAddr_In:
>> +                       pParm64->type = pParm32->type;
>> +                       pParm64->u.Pointer.size = pParm32->u.Pointer.size;
>> +                       pParm64->u.Pointer.u.linearAddr =
>> +                           pParm32->u.Pointer.u.linearAddr;
>> +                       break;
> 
> It would be good to clean this up and always use the same structure here.

The HGCMFunctionParameter struct describes function-parameters
in a Host-Guest-Control-Method function call, so this is part of the
host ABI interface and we cannot change this.

Any 32 bit apps running on a 64 bit kernel will be using the 32 bit
version of this struct and we need to translate.

> 
>> diff --git a/include/linux/vbox_err.h b/include/linux/vbox_err.h
>> new file mode 100644
>> index 000000000000..906ff7d2585d
>> --- /dev/null
>> +++ b/include/linux/vbox_err.h
>> @@ -0,0 +1,6 @@
>> +#ifndef __VBOX_ERR_H__
>> +#define __VBOX_ERR_H__
>> +
>> +#include <uapi/linux/vbox_err.h>
>> +
>> +#endif
>> diff --git a/include/linux/vbox_ostypes.h b/include/linux/vbox_ostypes.h
>> new file mode 100644
>> index 000000000000..ea2a391f135f
>> --- /dev/null
>> +++ b/include/linux/vbox_ostypes.h
>> @@ -0,0 +1,6 @@
>> +#ifndef __VBOX_OSTYPES_H__
>> +#define __VBOX_OSTYPES_H__
>> +
>> +#include <uapi/linux/vbox_ostypes.h>
>> +
>> +#endif
>> diff --git a/include/linux/vboxguest.h b/include/linux/vboxguest.h
>> new file mode 100644
>> index 000000000000..fca5d199a884
>> --- /dev/null
>> +++ b/include/linux/vboxguest.h
>> @@ -0,0 +1,6 @@
>> +#ifndef __VBOXGUEST_H__
>> +#define __VBOXGUEST_H__
>> +
>> +#include <uapi/linux/vboxguest.h>
>> +
>> +#endif
> 
> This should not be needed, we add -I${srctree}/include/uapi/ to the include path
> already.

Ok.

> 
>> +/**
>> + * The layout of VMMDEV RAM region that contains information for guest.
>> + */
>> +typedef struct VMMDevMemory {
>> +       /** The size of this structure. */
>> +       u32 u32Size;
>> +       /** The structure version. (VMMDEV_MEMORY_VERSION) */
>> +       u32 u32Version;
>> +
>> +       union {
>> +               struct {
>> +                       /** Flag telling that VMMDev has events pending. */
>> +                       bool fHaveEvents;
>> +               } V1_04;
>> +
> 
> As this is a hardware interface, maybe use u32 instead of 'bool'.

I guess that should work here, since we only every do "if (fHaveEvents)"
but in other cases where we also set bool variables in the hardware
interface structs we also need to know which bit(s) get set on true
to move this over to non bool use.

I agree that the choice of bool in a (virtual) hardware interface
is unfortunate, but we should not be trying to change the (virtual)
hardware definition IMHO.

> Strictly speaking, the ring buffer structures would even have to
> be __le32 and use byteorder specific read/write access, but that
> could perhaps be skipped here when you add a Kconfig
> dependency on !CONFIG_CPU_BIG_ENDIAN (which won't
> ever be set on x86).
> 
>> diff --git a/include/uapi/linux/vbox_err.h b/include/uapi/linux/vbox_err.h
>> new file mode 100644
>> index 000000000000..e6e7ba835e36
>> --- /dev/null
>> +++ b/include/uapi/linux/vbox_err.h
>> diff --git a/include/uapi/linux/vbox_ostypes.h b/include/uapi/linux/vbox_ostypes.h
>> new file mode 100644
>> index 000000000000..abe9a38ebfbd
>> --- /dev/null
>> +++ b/include/uapi/linux/vbox_ostypes.h
>> @@ -0,0 +1,158 @@
> 
> Some of these headers are not really ABI between the kernel and user
> space but are between the vbox host and guest, so include/uapi is maybe
> not the best place for them.

I assume you refer mainly to vbox_vmmdev.h here, the problem is that
the ioctl API leaks a lot of vbox_vmmdev.h things through
VBOXGUEST_IOCTL_VMMREQUEST which allows userspace to submit arbitrary
vmmdev-requests. Note that the vboxguest driver does do several sanity
checks on these, including limiting them to a specific list of allowed
requests. But as said this does leak almost all of the vbox_vmmdev.h
hardware interface into the userspace API.

I checked and VBOXGUEST_IOCTL_VMMREQUEST is used in various places,
so I cannot just drop it.

> Do we need all of this both in the kernel and in the header that actually
> defines the kernel/user interface?

Yes, see above.

>> +/**
>> + * @name VBoxGuest IOCTL codes and structures.
>> + *
>> + * The range 0..15 is for basic driver communication.
>> + * The range 16..31 is for HGCM communication.
>> + * The range 32..47 is reserved for future use.
>> + * The range 48..63 is for OS specific communication.
>> + * The 7th bit is reserved for future hacks.
>> + * The 8th bit is reserved for distinguishing between 32-bit and 64-bit
>> + * processes in future 64-bit guest additions.
>> + * @{
>> + */
>> +#if __BITS_PER_LONG == 64
>> +#define VBOXGUEST_IOCTL_FLAG           128
>> +#else
>> +#define VBOXGUEST_IOCTL_FLAG           0
>> +#endif
>> +/** @} */
> 
> This seems misguided, we already know the difference between
> 32-bit and 64-bit tasks based on the entry point, and if the
> interface is properly designed then we don't care about this
> difference at all.

I guess (and really guess at that) that this is there because on
some guest OS-es supported by vbox the distinction between a
32 bit or 64 bit app calling the ioctl cannot be made in another
way.

But I would not mind dropping this flag for linux.
Michael / Knut would something like this be acceptable ?


(extend with some other changes such as properly adding a .compat_ioctl
file-op ?


>> +#define VBOXGUEST_IOCTL_CODE_(function, size) \
>> +       _IOC(_IOC_READ|_IOC_WRITE, 'V', (function), (size))
>> +#define VBOXGUEST_IOCTL_STRIP_SIZE(code) \
>> +       VBOXGUEST_IOCTL_CODE_(_IOC_NR((code)), 0)
>> +
>> +#define VBOXGUEST_IOCTL_CODE(function, size) \
>> +       VBOXGUEST_IOCTL_CODE_((function) | VBOXGUEST_IOCTL_FLAG, size)
>> +/* Define 32 bit codes to support 32 bit applications in 64 bit guest driver. */
>> +#define VBOXGUEST_IOCTL_CODE_32(function, size) \
>> +VBOXGUEST_IOCTL_CODE_(function, size)
> 
> If the command numbers can be changed wihtout causing too many
> problems, I'd just do away with these wrappers and use _IOR/_IOW/_IORW
> as needed.

The upstream vboxguest.h header defines VBOXGUEST_IOCTL_CODE for many
different guest platforms. I guess not all of them have the
_IOR/_IOW/_IORW flags embedded in the ioctl-code concept and I believe
some of them only have a small amount of bits which can actually be
used for the code, so we cannot just cram these flags in there.

I think that instead we need an ioctl_direction_flags table which can
be used to determine if we need todo the copy_from_user and friends,
without changing the codes.

> 
>> +/** Input and output buffers layout of the IOCTL_VBOXGUEST_WAITEVENT */
>> +typedef struct VBoxGuestWaitEventInfo {
>> +       /** timeout in milliseconds */
>> +       u32 u32TimeoutIn;
>> +       /** events to wait for */
>> +       u32 u32EventMaskIn;
>> +       /** result code */
>> +       u32 u32Result;
>> +       /** events occurred */
>> +       u32 u32EventFlagsOut;
>> +} VBoxGuestWaitEventInfo;
>> +VBOXGUEST_ASSERT_SIZE(VBoxGuestWaitEventInfo, 16);
> 
> 'u32' is not an approprioate type for a kernel header, use '__u32'
> instead.

Huh I thought that u32 was preferred, but I guess that it is not allowed
in uapi headers due to potential conflicts and it should be __u32 in uapi
headers ?

Regards,

Hans

Comments

Greg KH Aug. 12, 2017, 11:22 p.m. UTC | #1
On Sat, Aug 12, 2017 at 11:56:22PM +0200, Hans de Goede wrote:
> > 'u32' is not an approprioate type for a kernel header, use '__u32'
> > instead.
> 
> Huh I thought that u32 was preferred, but I guess that it is not allowed
> in uapi headers due to potential conflicts and it should be __u32 in uapi
> headers ?

The __ version should always be used in structures/variables that cross
the user/kernel boundry as they are guaranteed to be correct that way.
That is a requirement.

Use the "normal" non __ versions for in-kernel only variables.

hope this helps,

greg k-h
Arnd Bergmann Aug. 14, 2017, 9:30 a.m. UTC | #2
On Sat, Aug 12, 2017 at 11:56 PM, Hans de Goede <hdegoede@redhat.com> wrote:
> On 11-08-17 23:23, Arnd Bergmann wrote:
>> On Fri, Aug 11, 2017 at 3:23 PM, Hans de Goede <hdegoede@redhat.com>
>> wrote:
>> Most of the issues I would normally comment on are already moot based
>> on the assumption that we won't be able to make substantial changes to
>> either the user space portion or the hypervisor side.
>>
>>> +/**
>>> + * Inflate the balloon by one chunk.
>>> + *
>>> + * The caller owns the balloon mutex.
>>> + *
>>> + * @returns VBox status code
>>> + * @param   gdev        The Guest extension device.
>>> + * @param   chunk_idx  Index of the chunk.
>>> + */
>>> +static int vbg_balloon_inflate(PVBOXGUESTDEVEXT gdev, u32 chunk_idx)
>>> +{
>>> +       VMMDevChangeMemBalloon *req = gdev->mem_balloon.change_req;
>>> +       struct page **pages;
>>> +       int i, rc;
>>> +
>>> +       pages = kmalloc(sizeof(*pages) *
>>> VMMDEV_MEMORY_BALLOON_CHUNK_PAGES,
>>> +                       GFP_KERNEL | __GFP_NOWARN);
>>> +       if (!pages)
>>> +               return VERR_NO_MEMORY;
>>> +
>>> +       req->header.size = sizeof(*req);
>>> +       req->inflate = true;
>>> +       req->pages = VMMDEV_MEMORY_BALLOON_CHUNK_PAGES;
>>
>>
>> The balloon section seems to be rather simplistic with its ioctl
>> interface.
>> Ideally this should use CONFIG_BALLOON_COMPACTION and an
>> oom notifier like the virtio_balloon driver does. Yes, I realize that only
>> one of the six (or more) balloon drivers we have in the kernel does it ;-)
>
>
> The way the balloon code works is that the baloon-size is fully controlled
> by the host, all the guest-additions do is wait for an event indicating that
> the host wants to change it and then call this ioctl which will get the
> desired balloon size from the host and tries to adjust the size accordingly.
>
> Hooking into the oom killer is not really useful because there is no way we
> can indicate to the host we are running low on mem and I don't think that
> trying to shrink the balloon to be smaller then the host requested size
> is a good idea.

Are you sure the guest can't just claim the memory back by using it?
Usually that is how the balloon drivers work: the host asks the guest
to free up some memory, and the guest frees up memory that it
tells to the host, but if the guest runs out of memory, it just starts using
it again and the host faults in empty pages.

For CONFIG_BALLOON_COMPACTION, you don't even need that,
you just put some memory into the balloon before you take some
other page out that is required for compaction. This might be less
of an issue here when the balloon always operates on 1MB chunks.

>>> +/**
>>> + * Callback for heartbeat timer.
>>> + */
>>> +static void vbg_heartbeat_timer(unsigned long data)
>>> +{
>>> +       PVBOXGUESTDEVEXT gdev = (PVBOXGUESTDEVEXT)data;
>>> +
>>> +       vbg_req_perform(gdev, gdev->guest_heartbeat_req);
>>> +       mod_timer(&gdev->heartbeat_timer,
>>> +                 msecs_to_jiffies(gdev->heartbeat_interval_ms));
>>> +}
>>
>>
>> This looks like a watchdog and should use the drivers/watchdog
>> subsystem interfaces.
>
>
> This is not really a watchdog, this also is fully under host control,
> the host controls if the guest should send a heartbeat and if so at
> which interval, where normally with a watchdog starting the timer
> and deciding the interval is under control of the watchdog user.
>
> Also if we miss to generate the heartbeat then the host will just
> log a message, not reset the system as one would expect with a
> watchdog. TL;DR: I understand why suggest this but the watchdog
> interface seems a poor match for this.

Ok.

>>> diff --git a/drivers/misc/vboxguest/vboxguest_linux.c
>>> b/drivers/misc/vboxguest/vboxguest_linux.c
>>> new file mode 100644
>>> index 000000000000..8468c7139b98
>>> --- /dev/null
>>> +++ b/drivers/misc/vboxguest/vboxguest_linux.c
>>
>>
>> This looks like a fairly short file, and could be merged into the core
>> file.
>
>
> I would prefer to keep it separate to keep more or less 1 : 1 mapping
> between mainline kernel and vbox upstream svn files.

ok.

>>> +        */
>>> +       if (cbData <= sizeof(au64Buf) &&
>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(uCmd) !=
>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(VBOXGUEST_IOCTL_VMMREQUEST(0))) {
>>> +               pvBufFree = NULL;
>>> +               pvBuf = &au64Buf[0];
>>> +       } else {
>>> +               /* __GFP_DMA32 for VBOXGUEST_IOCTL_VMMREQUEST */
>>> +               pvBufFree = pvBuf = kmalloc(cbData, GFP_KERNEL |
>>> __GFP_DMA32);
>>> +               if (!pvBuf)
>>> +                       return -ENOMEM;
>>> +       }
>>> +       if (copy_from_user(pvBuf, (void *)ulArg, cbData) == 0) {

>
>> I'd also change the commands
>> to not always do both read and write, but only whichever applies. This
>> function
>> would then do the copy_from_user/copy_to_user conditionally.
>
>
> This is actually an upstream TODO item I believe. One which would
> probably be best fixed by using _IOR/_IOW/_IORW IOCTL macros, but
> that needs to be coordinated with upstream.
>

>> It would be good to clean this up and always use the same structure here.
>
>
> The HGCMFunctionParameter struct describes function-parameters
> in a Host-Guest-Control-Method function call, so this is part of the
> host ABI interface and we cannot change this.
>
> Any 32 bit apps running on a 64 bit kernel will be using the 32 bit
> version of this struct and we need to translate.

I don't see the difference between changing the command codes and
changing the structure layout here. In both cases, that is an incompatible
ABI change but shouldn't much impact the source-level ABI if you do
it right.

>>> +/**
>>> + * The layout of VMMDEV RAM region that contains information for guest.
>>> + */
>>> +typedef struct VMMDevMemory {
>>> +       /** The size of this structure. */
>>> +       u32 u32Size;
>>> +       /** The structure version. (VMMDEV_MEMORY_VERSION) */
>>> +       u32 u32Version;
>>> +
>>> +       union {
>>> +               struct {
>>> +                       /** Flag telling that VMMDev has events pending.
>>> */
>>> +                       bool fHaveEvents;
>>> +               } V1_04;
>>> +
>>
>>
>> As this is a hardware interface, maybe use u32 instead of 'bool'.
>
>
> I guess that should work here, since we only every do "if (fHaveEvents)"
> but in other cases where we also set bool variables in the hardware
> interface structs we also need to know which bit(s) get set on true
> to move this over to non bool use.
>
> I agree that the choice of bool in a (virtual) hardware interface
> is unfortunate, but we should not be trying to change the (virtual)
> hardware definition IMHO.

According to the x86 psABI (https://www.uclibc.org/docs/psABI-x86_64.pdf),
a _Bool should be represented in memory like an 'unsigned char' but only
contain the values 0 or 1 (the upper 7 bits are zero).

Before linux-2.6.18, we defined 'bool' as 'typedef enum { false, true}
__packed bool;', which would make it a 32-bit integer that can take the
values 0 and 1.

Inside of the union, the two have the same binary layout
on little-endian architectures, as the information is still stored in
the low bit of the first of four bytes.

I think if you do

       union {
               struct {
                       /** Flag telling that VMMDev has events pending. */
                       u8 fHaveEvents;
                       u8 pad[3];
               } V1_04;

that will be a portable representation if the existing ABI.

>>
>>
>> Some of these headers are not really ABI between the kernel and user
>> space but are between the vbox host and guest, so include/uapi is maybe
>> not the best place for them.
>
>
> I assume you refer mainly to vbox_vmmdev.h here, the problem is that
> the ioctl API leaks a lot of vbox_vmmdev.h things through
> VBOXGUEST_IOCTL_VMMREQUEST which allows userspace to submit arbitrary
> vmmdev-requests. Note that the vboxguest driver does do several sanity
> checks on these, including limiting them to a specific list of allowed
> requests. But as said this does leak almost all of the vbox_vmmdev.h
> hardware interface into the userspace API.
>
> I checked and VBOXGUEST_IOCTL_VMMREQUEST is used in various places,
> so I cannot just drop it.

I see.

>>> +#define VBOXGUEST_IOCTL_CODE(function, size) \
>>> +       VBOXGUEST_IOCTL_CODE_((function) | VBOXGUEST_IOCTL_FLAG, size)
>>> +/* Define 32 bit codes to support 32 bit applications in 64 bit guest
>>> driver. */
>>> +#define VBOXGUEST_IOCTL_CODE_32(function, size) \
>>> +VBOXGUEST_IOCTL_CODE_(function, size)
>>
>>
>> If the command numbers can be changed wihtout causing too many
>> problems, I'd just do away with these wrappers and use _IOR/_IOW/_IORW
>> as needed.
>
>
> The upstream vboxguest.h header defines VBOXGUEST_IOCTL_CODE for many
> different guest platforms. I guess not all of them have the
> _IOR/_IOW/_IORW flags embedded in the ioctl-code concept and I believe
> some of them only have a small amount of bits which can actually be
> used for the code, so we cannot just cram these flags in there.

Do you have a specific example? I think the BSDs and derived operating
systems are actually much stricter with those flags than Linux is, and the
macros I suggested originate in BSD.

> I think that instead we need an ioctl_direction_flags table which can
> be used to determine if we need todo the copy_from_user and friends,
> without changing the codes.

I would not bother with that. Either we fix the command codes to correctly
reflect the direction, or we should actually copy the data both ways as the
current implementation does.

       Arnd
Hans de Goede Aug. 14, 2017, 12:15 p.m. UTC | #3
Hi,

On 14-08-17 11:30, Arnd Bergmann wrote:
> On Sat, Aug 12, 2017 at 11:56 PM, Hans de Goede <hdegoede@redhat.com> wrote:
>> On 11-08-17 23:23, Arnd Bergmann wrote:
>>> On Fri, Aug 11, 2017 at 3:23 PM, Hans de Goede <hdegoede@redhat.com>
>>> wrote:
>>> Most of the issues I would normally comment on are already moot based
>>> on the assumption that we won't be able to make substantial changes to
>>> either the user space portion or the hypervisor side.
>>>
>>>> +/**
>>>> + * Inflate the balloon by one chunk.
>>>> + *
>>>> + * The caller owns the balloon mutex.
>>>> + *
>>>> + * @returns VBox status code
>>>> + * @param   gdev        The Guest extension device.
>>>> + * @param   chunk_idx  Index of the chunk.
>>>> + */
>>>> +static int vbg_balloon_inflate(PVBOXGUESTDEVEXT gdev, u32 chunk_idx)
>>>> +{
>>>> +       VMMDevChangeMemBalloon *req = gdev->mem_balloon.change_req;
>>>> +       struct page **pages;
>>>> +       int i, rc;
>>>> +
>>>> +       pages = kmalloc(sizeof(*pages) *
>>>> VMMDEV_MEMORY_BALLOON_CHUNK_PAGES,
>>>> +                       GFP_KERNEL | __GFP_NOWARN);
>>>> +       if (!pages)
>>>> +               return VERR_NO_MEMORY;
>>>> +
>>>> +       req->header.size = sizeof(*req);
>>>> +       req->inflate = true;
>>>> +       req->pages = VMMDEV_MEMORY_BALLOON_CHUNK_PAGES;
>>>
>>>
>>> The balloon section seems to be rather simplistic with its ioctl
>>> interface.
>>> Ideally this should use CONFIG_BALLOON_COMPACTION and an
>>> oom notifier like the virtio_balloon driver does. Yes, I realize that only
>>> one of the six (or more) balloon drivers we have in the kernel does it ;-)
>>
>>
>> The way the balloon code works is that the baloon-size is fully controlled
>> by the host, all the guest-additions do is wait for an event indicating that
>> the host wants to change it and then call this ioctl which will get the
>> desired balloon size from the host and tries to adjust the size accordingly.
>>
>> Hooking into the oom killer is not really useful because there is no way we
>> can indicate to the host we are running low on mem and I don't think that
>> trying to shrink the balloon to be smaller then the host requested size
>> is a good idea.
> 
> Are you sure the guest can't just claim the memory back by using it?

I don't think so the original code certainly does not do anything of
the sorts, the balloon code uses alloc_page(GFP_KERNEL | __GFP_NOWARN)
and only calls __free_page() again on the pages after a successful
deflate request to the host.

So first we tell the host we want a chunk (1MiB worth of pages) back and
only if the host says ok do we call __free_page which means it can
get used for something else again.

Michael, Knut can you shed some light on this ?

> Usually that is how the balloon drivers work: the host asks the guest
> to free up some memory, and the guest frees up memory that it
> tells to the host, but if the guest runs out of memory, it just starts using
> it again and the host faults in empty pages.
> 
> For CONFIG_BALLOON_COMPACTION, you don't even need that,
> you just put some memory into the balloon before you take some
> other page out that is required for compaction. This might be less
> of an issue here when the balloon always operates on 1MB chunks.

<snip>

>>>> +        */
>>>> +       if (cbData <= sizeof(au64Buf) &&
>>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(uCmd) !=
>>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(VBOXGUEST_IOCTL_VMMREQUEST(0))) {
>>>> +               pvBufFree = NULL;
>>>> +               pvBuf = &au64Buf[0];
>>>> +       } else {
>>>> +               /* __GFP_DMA32 for VBOXGUEST_IOCTL_VMMREQUEST */
>>>> +               pvBufFree = pvBuf = kmalloc(cbData, GFP_KERNEL |
>>>> __GFP_DMA32);
>>>> +               if (!pvBuf)
>>>> +                       return -ENOMEM;
>>>> +       }
>>>> +       if (copy_from_user(pvBuf, (void *)ulArg, cbData) == 0) {
> 
>>
>>> I'd also change the commands
>>> to not always do both read and write, but only whichever applies. This
>>> function
>>> would then do the copy_from_user/copy_to_user conditionally.
>>
>>
>> This is actually an upstream TODO item I believe. One which would
>> probably be best fixed by using _IOR/_IOW/_IORW IOCTL macros, but
>> that needs to be coordinated with upstream.
>>
> 
>>> It would be good to clean this up and always use the same structure here.
>>
>>
>> The HGCMFunctionParameter struct describes function-parameters
>> in a Host-Guest-Control-Method function call, so this is part of the
>> host ABI interface and we cannot change this.
>>
>> Any 32 bit apps running on a 64 bit kernel will be using the 32 bit
>> version of this struct and we need to translate. >
> I don't see the difference between changing the command codes and
> changing the structure layout here. In both cases, that is an incompatible
> ABI change but shouldn't much impact the source-level ABI if you do
> it right.

This struct is defined in vmmdev.h because it is part of the hardware
interface, a 32 bit host expects the 32 bit version of the struct
when calling into the host, where as a 64 bit host expects the 64 bit
version.

The hgcm-call ioctl uses this struct _directly_, so a 64 bit app
will pass in the 64 bit version and a 32 bit app the 32 bit version,
and on 64 bit we need to translate the 32 bit version coming from
a 32 bit app.

The only way around this would be to make userspace always see /
use the 64 bit version but then we would need to always translate
the structs back to their 32 bit version on 32 bit hosts, so we
would still end up with a translation function and now we are
calling it a lot more often.


>>>> + * The layout of VMMDEV RAM region that contains information for guest.
>>>> + */
>>>> +typedef struct VMMDevMemory {
>>>> +       /** The size of this structure. */
>>>> +       u32 u32Size;
>>>> +       /** The structure version. (VMMDEV_MEMORY_VERSION) */
>>>> +       u32 u32Version;
>>>> +
>>>> +       union {
>>>> +               struct {
>>>> +                       /** Flag telling that VMMDev has events pending.
>>>> */
>>>> +                       bool fHaveEvents;
>>>> +               } V1_04;
>>>> +
>>>
>>>
>>> As this is a hardware interface, maybe use u32 instead of 'bool'.
>>
>>
>> I guess that should work here, since we only every do "if (fHaveEvents)"
>> but in other cases where we also set bool variables in the hardware
>> interface structs we also need to know which bit(s) get set on true
>> to move this over to non bool use.
>>
>> I agree that the choice of bool in a (virtual) hardware interface
>> is unfortunate, but we should not be trying to change the (virtual)
>> hardware definition IMHO.
> 
> According to the x86 psABI (https://www.uclibc.org/docs/psABI-x86_64.pdf),
> a _Bool should be represented in memory like an 'unsigned char' but only
> contain the values 0 or 1 (the upper 7 bits are zero).
> 
> Before linux-2.6.18, we defined 'bool' as 'typedef enum { false, true}
> __packed bool;', which would make it a 32-bit integer that can take the
> values 0 and 1.
> 
> Inside of the union, the two have the same binary layout
> on little-endian architectures, as the information is still stored in
> the low bit of the first of four bytes.
> 
> I think if you do
> 
>         union {
>                 struct {
>                         /** Flag telling that VMMDev has events pending. */
>                         u8 fHaveEvents;
>                         u8 pad[3];
>                 } V1_04;
> 
> that will be a portable representation if the existing ABI.

Ok, looking at the various VMMDEV_ASSERT_SIZE calls in there bool indeed
seems to be u8, so I should be able to fix this.

Michael, Knut any input on this ?  Do you want me to submit a similar
change upstream ?

<snip>

>>>> +#define VBOXGUEST_IOCTL_CODE(function, size) \
>>>> +       VBOXGUEST_IOCTL_CODE_((function) | VBOXGUEST_IOCTL_FLAG, size)
>>>> +/* Define 32 bit codes to support 32 bit applications in 64 bit guest
>>>> driver. */
>>>> +#define VBOXGUEST_IOCTL_CODE_32(function, size) \
>>>> +VBOXGUEST_IOCTL_CODE_(function, size)
>>>
>>>
>>> If the command numbers can be changed wihtout causing too many
>>> problems, I'd just do away with these wrappers and use _IOR/_IOW/_IORW
>>> as needed.
>>
>>
>> The upstream vboxguest.h header defines VBOXGUEST_IOCTL_CODE for many
>> different guest platforms. I guess not all of them have the
>> _IOR/_IOW/_IORW flags embedded in the ioctl-code concept and I believe
>> some of them only have a small amount of bits which can actually be
>> used for the code, so we cannot just cram these flags in there.
> 
> Do you have a specific example? I think the BSDs and derived operating
> systems are actually much stricter with those flags than Linux is, and the
> macros I suggested originate in BSD.

So the vbox-upstream header has this:

/** @name VBoxGuest IOCTL codes and structures.
  *
  * The range 0..15 is for basic driver communication.
  * The range 16..31 is for HGCM communication.
  * The range 32..47 is reserved for future use.
  * The range 48..63 is for OS specific communication.
  * The 7th bit is reserved for future hacks.
  * The 8th bit is reserved for distinguishing between 32-bit and 64-bit
  * processes in future 64-bit guest additions.
  *
  * @remarks When creating new IOCtl interfaces keep in mind that not all OSes supports
  *          reporting back the output size. (This got messed up a little bit in VBoxDrv.)
  *
  *          The request size is also a little bit tricky as it's passed as part of the
  *          request code on unix. The size field is 14 bits on Linux, 12 bits on *BSD,
  *          13 bits Darwin, and 8-bits on Solaris. All the BSDs and Darwin kernels
  *          will make use of the size field, while Linux and Solaris will not. We're of
  *          course using the size to validate and/or map/lock the request, so it has
  *          to be valid.
  *
  *          For Solaris we will have to do something special though, 255 isn't
  *          sufficient for all we need. A 4KB restriction (BSD) is probably not
  *          too problematic (yet) as a general one.
  *
  *          More info can be found in SUPDRVIOC.h and related sources.
  *
  * @remarks If adding interfaces that only has input or only has output, some new macros
  *          needs to be created so the most efficient IOCtl data buffering method can be
  *          used.
  * @{
  */

Reading this a second time I think that the reasons why there are no
separate R / W / RW versions of the macros is because all current ioctls
are RW and from the top of my mind (I've not done a full audit) that seems
correct.

So yes we should be able to just use _IOWR instead of _IOC(_IOC_READ|_IOC_WRITE
but depending on how discussion on the 64 bit flag goes we may still
need a wrapper to add in the 64 bit flag, in which case not much will change.

Hmm, rereading the header I see that the 64 bit flag is not used everywhere, a bunch
of ioctls use VBOXGUEST_IOCTL_CODE_ instead of VBOXGUEST_IOCTL_CODE, so they never
get the flag (really 2 different macros with only a _ difference in name is a bad
idea).

So the flag is only used on the following calls:

VBOXGUEST_IOCTL_GETVMMDEVPORT

VBOXGUEST_IOCTL_WRITE_CORE_DUMP
VBOXGUEST_IOCTL_HGCM_CONNECT
VBOXGUEST_IOCTL_HGCM_DISCONNECT
VBOXGUEST_IOCTL_HGCM_CALL
VBOXGUEST_IOCTL_HGCM_CALL_TIMED

VBOXGUEST_IOCTL_HGCM_CALL_USERDATA
VBOXGUEST_IOCTL_SET_MOUSE_NOTIFY_CALLBACK
VBOXGUEST_IOCTL_GUEST_CAPS_ACQUIRE

Of which the first one and the last 3 are not used under Linux.

So would could use _IOWR('V', ... directly for most of the ioctls
and then keep the 64 bit flag and have:

VBOXGUEST_IOCTL_CODE
VBOXGUEST_IOCTL_CODE32

Just for the HGCM calls.

Also I believe it would be good to move the following over to
not using the 64 bit flag:

VBOXGUEST_IOCTL_WRITE_CORE_DUMP
VBOXGUEST_IOCTL_GETVMMDEVPORT
VBOXGUEST_IOCTL_HGCM_CALL_USERDATA
VBOXGUEST_IOCTL_SET_MOUSE_NOTIFY_CALLBACK
VBOXGUEST_IOCTL_GUEST_CAPS_ACQUIRE

Since they have no 32 bit equivalent. Michael, Knut would you
be willing to take a patch moving those to using
VBOXGUEST_IOCTL_CODE_ and thus not using the 64 bit flag?

>> I think that instead we need an ioctl_direction_flags table which can
>> be used to determine if we need todo the copy_from_user and friends,
>> without changing the codes.
> 
> I would not bother with that. Either we fix the command codes to correctly
> reflect the direction, or we should actually copy the data both ways as the
> current implementation does.

Ok.

Regards,

Hans
Hans de Goede Aug. 21, 2017, 11:43 a.m. UTC | #4
Hi,

On 14-08-17 14:15, Hans de Goede wrote:
> Hi,
> 
> On 14-08-17 11:30, Arnd Bergmann wrote:
>> On Sat, Aug 12, 2017 at 11:56 PM, Hans de Goede <hdegoede@redhat.com> wrote:
>>> On 11-08-17 23:23, Arnd Bergmann wrote:
>>>> On Fri, Aug 11, 2017 at 3:23 PM, Hans de Goede <hdegoede@redhat.com>
>>>> wrote:
>>>> Most of the issues I would normally comment on are already moot based
>>>> on the assumption that we won't be able to make substantial changes to
>>>> either the user space portion or the hypervisor side.
>>>>
>>>>> +/**
>>>>> + * Inflate the balloon by one chunk.
>>>>> + *
>>>>> + * The caller owns the balloon mutex.
>>>>> + *
>>>>> + * @returns VBox status code
>>>>> + * @param   gdev        The Guest extension device.
>>>>> + * @param   chunk_idx  Index of the chunk.
>>>>> + */
>>>>> +static int vbg_balloon_inflate(PVBOXGUESTDEVEXT gdev, u32 chunk_idx)
>>>>> +{
>>>>> +       VMMDevChangeMemBalloon *req = gdev->mem_balloon.change_req;
>>>>> +       struct page **pages;
>>>>> +       int i, rc;
>>>>> +
>>>>> +       pages = kmalloc(sizeof(*pages) *
>>>>> VMMDEV_MEMORY_BALLOON_CHUNK_PAGES,
>>>>> +                       GFP_KERNEL | __GFP_NOWARN);
>>>>> +       if (!pages)
>>>>> +               return VERR_NO_MEMORY;
>>>>> +
>>>>> +       req->header.size = sizeof(*req);
>>>>> +       req->inflate = true;
>>>>> +       req->pages = VMMDEV_MEMORY_BALLOON_CHUNK_PAGES;
>>>>
>>>>
>>>> The balloon section seems to be rather simplistic with its ioctl
>>>> interface.
>>>> Ideally this should use CONFIG_BALLOON_COMPACTION and an
>>>> oom notifier like the virtio_balloon driver does. Yes, I realize that only
>>>> one of the six (or more) balloon drivers we have in the kernel does it ;-)
>>>
>>>
>>> The way the balloon code works is that the baloon-size is fully controlled
>>> by the host, all the guest-additions do is wait for an event indicating that
>>> the host wants to change it and then call this ioctl which will get the
>>> desired balloon size from the host and tries to adjust the size accordingly.
>>>
>>> Hooking into the oom killer is not really useful because there is no way we
>>> can indicate to the host we are running low on mem and I don't think that
>>> trying to shrink the balloon to be smaller then the host requested size
>>> is a good idea.
>>
>> Are you sure the guest can't just claim the memory back by using it?
> 
> I don't think so the original code certainly does not do anything of
> the sorts, the balloon code uses alloc_page(GFP_KERNEL | __GFP_NOWARN)
> and only calls __free_page() again on the pages after a successful
> deflate request to the host.
> 
> So first we tell the host we want a chunk (1MiB worth of pages) back and
> only if the host says ok do we call __free_page which means it can
> get used for something else again.

Ok, so while working on the final cleanups of the code I realized that
the kernel driver will actually re-claim the memory without the host
indicating that this is ok when the vboxservice which monitors for
host events to change the balloon size dies. When it dies the kernel
tries to reclaim all the memory from the host and this seems to work fine.

So we could hook up to an OOM signal and try to reclaim some memory here
then, but the out-of-tree version of the driver does not do this.

Michael, Knut, what is the intended behavior of the kernel here?

I'm also thinking that under Linux since we never use userspace to
alloc memory for the balloon, we don't need to involve userspace at
all, the drivers interrupt handler could detect the event that the
host wants to change the balloonsize itself and start a workqueue
item which then asks for what the new size should be and do the
adjustment all inside the kernel.  I think that will even lead to
simpler code as we no longer need to track the balloon userspace
owner and do reclaim when it dies.

Typing this out I really see no reason to not do this inside the
kernel, so I think I'm going to implement that right away / for
the [RFC v2] I plan to post later today or tomorrow.

Regards,

Hans




> 
> Michael, Knut can you shed some light on this ?
> 
>> Usually that is how the balloon drivers work: the host asks the guest
>> to free up some memory, and the guest frees up memory that it
>> tells to the host, but if the guest runs out of memory, it just starts using
>> it again and the host faults in empty pages.
>>
>> For CONFIG_BALLOON_COMPACTION, you don't even need that,
>> you just put some memory into the balloon before you take some
>> other page out that is required for compaction. This might be less
>> of an issue here when the balloon always operates on 1MB chunks.
> 
> <snip>
> 
>>>>> +        */
>>>>> +       if (cbData <= sizeof(au64Buf) &&
>>>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(uCmd) !=
>>>>> +           VBOXGUEST_IOCTL_STRIP_SIZE(VBOXGUEST_IOCTL_VMMREQUEST(0))) {
>>>>> +               pvBufFree = NULL;
>>>>> +               pvBuf = &au64Buf[0];
>>>>> +       } else {
>>>>> +               /* __GFP_DMA32 for VBOXGUEST_IOCTL_VMMREQUEST */
>>>>> +               pvBufFree = pvBuf = kmalloc(cbData, GFP_KERNEL |
>>>>> __GFP_DMA32);
>>>>> +               if (!pvBuf)
>>>>> +                       return -ENOMEM;
>>>>> +       }
>>>>> +       if (copy_from_user(pvBuf, (void *)ulArg, cbData) == 0) {
>>
>>>
>>>> I'd also change the commands
>>>> to not always do both read and write, but only whichever applies. This
>>>> function
>>>> would then do the copy_from_user/copy_to_user conditionally.
>>>
>>>
>>> This is actually an upstream TODO item I believe. One which would
>>> probably be best fixed by using _IOR/_IOW/_IORW IOCTL macros, but
>>> that needs to be coordinated with upstream.
>>>
>>
>>>> It would be good to clean this up and always use the same structure here.
>>>
>>>
>>> The HGCMFunctionParameter struct describes function-parameters
>>> in a Host-Guest-Control-Method function call, so this is part of the
>>> host ABI interface and we cannot change this.
>>>
>>> Any 32 bit apps running on a 64 bit kernel will be using the 32 bit
>>> version of this struct and we need to translate. >
>> I don't see the difference between changing the command codes and
>> changing the structure layout here. In both cases, that is an incompatible
>> ABI change but shouldn't much impact the source-level ABI if you do
>> it right.
> 
> This struct is defined in vmmdev.h because it is part of the hardware
> interface, a 32 bit host expects the 32 bit version of the struct
> when calling into the host, where as a 64 bit host expects the 64 bit
> version.
> 
> The hgcm-call ioctl uses this struct _directly_, so a 64 bit app
> will pass in the 64 bit version and a 32 bit app the 32 bit version,
> and on 64 bit we need to translate the 32 bit version coming from
> a 32 bit app.
> 
> The only way around this would be to make userspace always see /
> use the 64 bit version but then we would need to always translate
> the structs back to their 32 bit version on 32 bit hosts, so we
> would still end up with a translation function and now we are
> calling it a lot more often.
> 
> 
>>>>> + * The layout of VMMDEV RAM region that contains information for guest.
>>>>> + */
>>>>> +typedef struct VMMDevMemory {
>>>>> +       /** The size of this structure. */
>>>>> +       u32 u32Size;
>>>>> +       /** The structure version. (VMMDEV_MEMORY_VERSION) */
>>>>> +       u32 u32Version;
>>>>> +
>>>>> +       union {
>>>>> +               struct {
>>>>> +                       /** Flag telling that VMMDev has events pending.
>>>>> */
>>>>> +                       bool fHaveEvents;
>>>>> +               } V1_04;
>>>>> +
>>>>
>>>>
>>>> As this is a hardware interface, maybe use u32 instead of 'bool'.
>>>
>>>
>>> I guess that should work here, since we only every do "if (fHaveEvents)"
>>> but in other cases where we also set bool variables in the hardware
>>> interface structs we also need to know which bit(s) get set on true
>>> to move this over to non bool use.
>>>
>>> I agree that the choice of bool in a (virtual) hardware interface
>>> is unfortunate, but we should not be trying to change the (virtual)
>>> hardware definition IMHO.
>>
>> According to the x86 psABI (https://www.uclibc.org/docs/psABI-x86_64.pdf),
>> a _Bool should be represented in memory like an 'unsigned char' but only
>> contain the values 0 or 1 (the upper 7 bits are zero).
>>
>> Before linux-2.6.18, we defined 'bool' as 'typedef enum { false, true}
>> __packed bool;', which would make it a 32-bit integer that can take the
>> values 0 and 1.
>>
>> Inside of the union, the two have the same binary layout
>> on little-endian architectures, as the information is still stored in
>> the low bit of the first of four bytes.
>>
>> I think if you do
>>
>>         union {
>>                 struct {
>>                         /** Flag telling that VMMDev has events pending. */
>>                         u8 fHaveEvents;
>>                         u8 pad[3];
>>                 } V1_04;
>>
>> that will be a portable representation if the existing ABI.
> 
> Ok, looking at the various VMMDEV_ASSERT_SIZE calls in there bool indeed
> seems to be u8, so I should be able to fix this.
> 
> Michael, Knut any input on this ?  Do you want me to submit a similar
> change upstream ?
> 
> <snip>
> 
>>>>> +#define VBOXGUEST_IOCTL_CODE(function, size) \
>>>>> +       VBOXGUEST_IOCTL_CODE_((function) | VBOXGUEST_IOCTL_FLAG, size)
>>>>> +/* Define 32 bit codes to support 32 bit applications in 64 bit guest
>>>>> driver. */
>>>>> +#define VBOXGUEST_IOCTL_CODE_32(function, size) \
>>>>> +VBOXGUEST_IOCTL_CODE_(function, size)
>>>>
>>>>
>>>> If the command numbers can be changed wihtout causing too many
>>>> problems, I'd just do away with these wrappers and use _IOR/_IOW/_IORW
>>>> as needed.
>>>
>>>
>>> The upstream vboxguest.h header defines VBOXGUEST_IOCTL_CODE for many
>>> different guest platforms. I guess not all of them have the
>>> _IOR/_IOW/_IORW flags embedded in the ioctl-code concept and I believe
>>> some of them only have a small amount of bits which can actually be
>>> used for the code, so we cannot just cram these flags in there.
>>
>> Do you have a specific example? I think the BSDs and derived operating
>> systems are actually much stricter with those flags than Linux is, and the
>> macros I suggested originate in BSD.
> 
> So the vbox-upstream header has this:
> 
> /** @name VBoxGuest IOCTL codes and structures.
>   *
>   * The range 0..15 is for basic driver communication.
>   * The range 16..31 is for HGCM communication.
>   * The range 32..47 is reserved for future use.
>   * The range 48..63 is for OS specific communication.
>   * The 7th bit is reserved for future hacks.
>   * The 8th bit is reserved for distinguishing between 32-bit and 64-bit
>   * processes in future 64-bit guest additions.
>   *
>   * @remarks When creating new IOCtl interfaces keep in mind that not all OSes supports
>   *          reporting back the output size. (This got messed up a little bit in VBoxDrv.)
>   *
>   *          The request size is also a little bit tricky as it's passed as part of the
>   *          request code on unix. The size field is 14 bits on Linux, 12 bits on *BSD,
>   *          13 bits Darwin, and 8-bits on Solaris. All the BSDs and Darwin kernels
>   *          will make use of the size field, while Linux and Solaris will not. We're of
>   *          course using the size to validate and/or map/lock the request, so it has
>   *          to be valid.
>   *
>   *          For Solaris we will have to do something special though, 255 isn't
>   *          sufficient for all we need. A 4KB restriction (BSD) is probably not
>   *          too problematic (yet) as a general one.
>   *
>   *          More info can be found in SUPDRVIOC.h and related sources.
>   *
>   * @remarks If adding interfaces that only has input or only has output, some new macros
>   *          needs to be created so the most efficient IOCtl data buffering method can be
>   *          used.
>   * @{
>   */
> 
> Reading this a second time I think that the reasons why there are no
> separate R / W / RW versions of the macros is because all current ioctls
> are RW and from the top of my mind (I've not done a full audit) that seems
> correct.
> 
> So yes we should be able to just use _IOWR instead of _IOC(_IOC_READ|_IOC_WRITE
> but depending on how discussion on the 64 bit flag goes we may still
> need a wrapper to add in the 64 bit flag, in which case not much will change.
> 
> Hmm, rereading the header I see that the 64 bit flag is not used everywhere, a bunch
> of ioctls use VBOXGUEST_IOCTL_CODE_ instead of VBOXGUEST_IOCTL_CODE, so they never
> get the flag (really 2 different macros with only a _ difference in name is a bad
> idea).
> 
> So the flag is only used on the following calls:
> 
> VBOXGUEST_IOCTL_GETVMMDEVPORT
> 
> VBOXGUEST_IOCTL_WRITE_CORE_DUMP
> VBOXGUEST_IOCTL_HGCM_CONNECT
> VBOXGUEST_IOCTL_HGCM_DISCONNECT
> VBOXGUEST_IOCTL_HGCM_CALL
> VBOXGUEST_IOCTL_HGCM_CALL_TIMED
> 
> VBOXGUEST_IOCTL_HGCM_CALL_USERDATA
> VBOXGUEST_IOCTL_SET_MOUSE_NOTIFY_CALLBACK
> VBOXGUEST_IOCTL_GUEST_CAPS_ACQUIRE
> 
> Of which the first one and the last 3 are not used under Linux.
> 
> So would could use _IOWR('V', ... directly for most of the ioctls
> and then keep the 64 bit flag and have:
> 
> VBOXGUEST_IOCTL_CODE
> VBOXGUEST_IOCTL_CODE32
> 
> Just for the HGCM calls.
> 
> Also I believe it would be good to move the following over to
> not using the 64 bit flag:
> 
> VBOXGUEST_IOCTL_WRITE_CORE_DUMP
> VBOXGUEST_IOCTL_GETVMMDEVPORT
> VBOXGUEST_IOCTL_HGCM_CALL_USERDATA
> VBOXGUEST_IOCTL_SET_MOUSE_NOTIFY_CALLBACK
> VBOXGUEST_IOCTL_GUEST_CAPS_ACQUIRE
> 
> Since they have no 32 bit equivalent. Michael, Knut would you
> be willing to take a patch moving those to using
> VBOXGUEST_IOCTL_CODE_ and thus not using the 64 bit flag?
> 
>>> I think that instead we need an ioctl_direction_flags table which can
>>> be used to determine if we need todo the copy_from_user and friends,
>>> without changing the codes.
>>
>> I would not bother with that. Either we fix the command codes to correctly
>> reflect the direction, or we should actually copy the data both ways as the
>> current implementation does.
> 
> Ok.
> 
> Regards,
> 
> Hans
Hans de Goede Aug. 21, 2017, 12:04 p.m. UTC | #5
Hi,

On 21-08-17 13:43, Hans de Goede wrote:

<snip>

> I'm also thinking that under Linux since we never use userspace to
> alloc memory for the balloon, we don't need to involve userspace at
> all, the drivers interrupt handler could detect the event that the
> host wants to change the balloonsize itself and start a workqueue
> item which then asks for what the new size should be and do the
> adjustment all inside the kernel.  I think that will even lead to
> simpler code as we no longer need to track the balloon userspace
> owner and do reclaim when it dies.
> 
> Typing this out I really see no reason to not do this inside the
> kernel, so I think I'm going to implement that right away / for
> the [RFC v2] I plan to post later today or tomorrow.

Done:

https://github.com/jwrdegoede/vboxguest/commit/2920882cb0ce9af11a843affc58d4a08ef47e16b

Regards,

Hans
diff mbox

Patch

--- a/include/VBox/VBoxGuest.h
+++ b/include/VBox/VBoxGuest.h
@@ -123,9 +123,9 @@ 
   *          used.
   * @{
   */
-#if defined(RT_ARCH_AMD64) || defined(RT_ARCH_SPARC64)
+#if (defined(RT_ARCH_AMD64) || defined(RT_ARCH_SPARC64)) && !defined(RT_OS_LINUX)
  # define VBOXGUEST_IOCTL_FLAG     128
-#elif defined(RT_ARCH_X86) || defined(RT_ARCH_SPARC)
+#elif defined(RT_ARCH_X86) || defined(RT_ARCH_SPARC) || defined(RT_OS_LINUX)
  # define VBOXGUEST_IOCTL_FLAG     0
  #else
  # error "dunno which arch this is!"