[for-2.7,v2,05/17] raw-posix: Implement .bdrv_lockf

Message ID	1460690887-32751-6-git-send-email-famz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Fam Zheng <famz@redhat.com> To: qemu-devel@nongnu.org Date: Fri, 15 Apr 2016 11:27:55 +0800 Message-Id: <1460690887-32751-6-git-send-email-famz@redhat.com> In-Reply-To: <1460690887-32751-1-git-send-email-famz@redhat.com> References: <1460690887-32751-1-git-send-email-famz@redhat.com> Subject: [Qemu-devel] [PATCH for-2.7 v2 05/17] raw-posix: Implement .bdrv_lockf Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org, Jeff Cody <jcody@redhat.com>, Markus Armbruster <armbru@redhat.com>, Max Reitz <mreitz@redhat.com>, den@openvz.org, pbonzini@redhat.com, John Snow <jsnow@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Fam Zheng April 15, 2016, 3:27 a.m. UTC

virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
the intervene.

Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Denis V. Lunev April 16, 2016, 1:29 p.m. UTC | #1

On 04/15/2016 06:27 AM, Fam Zheng wrote:
> virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> the intervene.
>
> Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
>   1 file changed, 35 insertions(+)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 906d5c9..3a2c17f 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -35,6 +35,7 @@
>   #include "raw-aio.h"
>   #include "qapi/util.h"
>   #include "qapi/qmp/qstring.h"
> +#include "glib.h"
>   
>   #if defined(__APPLE__) && (__MACH__)
>   #include <paths.h>
> @@ -397,6 +398,38 @@ static void raw_attach_aio_context(BlockDriverState *bs,
>   #endif
>   }
>   
> +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> +{
> +
> +    BDRVRawState *s = bs->opaque;
> +    int ret;
> +    struct flock fl = (struct flock) {
> +        .l_whence  = SEEK_SET,
> +        /* Locking byte 1 avoids interfereing with virtlockd. */
> +        .l_start = 1,
> +        .l_len = 1,
> +    };
> +
> +    switch (cmd) {
> +    case BDRV_LOCKF_RWLOCK:
> +        fl.l_type = F_WRLCK;
> +        break;
> +    case BDRV_LOCKF_ROLOCK:
> +        fl.l_type = F_RDLCK;
> +        break;
> +    case BDRV_LOCKF_UNLOCK:
> +        fl.l_type = F_UNLCK;
> +        break;
> +    default:
> +        abort();
> +    }
> +    ret = fcntl(s->fd, F_SETLK, &fl);
> +    if (ret) {
> +        ret = -errno;
> +    }
> +    return ret;
> +}
> +
>   #ifdef CONFIG_LINUX_AIO
>   static int raw_set_aio(void **aio_ctx, int *use_aio, int bdrv_flags)
>   {
> @@ -1960,6 +1993,8 @@ BlockDriver bdrv_file = {
>       .bdrv_detach_aio_context = raw_detach_aio_context,
>       .bdrv_attach_aio_context = raw_attach_aio_context,
>   
> +    .bdrv_lockf = raw_lockf,
> +
>       .create_opts = &raw_create_opts,
>   };
>   
would it be better to use

        int flock(int fd, int operation);

DESCRIPTION
        Apply or remove an advisory lock on the open file specified by 
fd.  The
        argument operation is one of the following:

            LOCK_SH  Place a shared lock.  More than one  process may  
hold  a
                     shared lock for a given file at a given time.

            LOCK_EX  Place  an  exclusive  lock.   Only one process may 
hold an
                     exclusive lock for a given file at a given time.

            LOCK_UN  Remove an existing lock held by this process.

for this purpose?

Sorry that missed this point in the initial review...
We will not intersect with libvirt for the case.

Den

Richard W.M. Jones April 17, 2016, 7:27 p.m. UTC | #2

On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> the intervene.
> +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> +{
> +
> +    BDRVRawState *s = bs->opaque;
> +    int ret;
> +    struct flock fl = (struct flock) {
> +        .l_whence  = SEEK_SET,
> +        /* Locking byte 1 avoids interfereing with virtlockd. */
> +        .l_start = 1,
> +        .l_len = 1,
> +    };
> +
> +    switch (cmd) {
> +    case BDRV_LOCKF_RWLOCK:
> +        fl.l_type = F_WRLCK;
> +        break;
> +    case BDRV_LOCKF_ROLOCK:
> +        fl.l_type = F_RDLCK;
> +        break;
> +    case BDRV_LOCKF_UNLOCK:
> +        fl.l_type = F_UNLCK;
> +        break;

My understanding is this prevents libguestfs from working on live disk
images -- we want to be able to read live disk images (using a
writable overlay and the real disk image as a read-only backing file).

So please don't do this; or suggest how we can ignore the lock for
backing files.

Rich.

Fam Zheng April 18, 2016, 1:10 a.m. UTC | #3

On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > the intervene.
> > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > +{
> > +
> > +    BDRVRawState *s = bs->opaque;
> > +    int ret;
> > +    struct flock fl = (struct flock) {
> > +        .l_whence  = SEEK_SET,
> > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > +        .l_start = 1,
> > +        .l_len = 1,
> > +    };
> > +
> > +    switch (cmd) {
> > +    case BDRV_LOCKF_RWLOCK:
> > +        fl.l_type = F_WRLCK;
> > +        break;
> > +    case BDRV_LOCKF_ROLOCK:
> > +        fl.l_type = F_RDLCK;
> > +        break;
> > +    case BDRV_LOCKF_UNLOCK:
> > +        fl.l_type = F_UNLCK;
> > +        break;
> 
> My understanding is this prevents libguestfs from working on live disk
> images -- we want to be able to read live disk images (using a
> writable overlay and the real disk image as a read-only backing file).

Do you lock the live image or the backing file? If not, you can just read/write
as before, as what the -L option does in this series. Otherwise, you should use
an RO lock on the backing image, which still works.

Fam

Fam Zheng April 18, 2016, 1:12 a.m. UTC | #4

On Sat, 04/16 16:29, Denis V. Lunev wrote:
> On 04/15/2016 06:27 AM, Fam Zheng wrote:
> >virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> >the intervene.
> >
> >Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
> >Signed-off-by: Fam Zheng <famz@redhat.com>
> >---
> >  block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
> >  1 file changed, 35 insertions(+)
> >
> >diff --git a/block/raw-posix.c b/block/raw-posix.c
> >index 906d5c9..3a2c17f 100644
> >--- a/block/raw-posix.c
> >+++ b/block/raw-posix.c
> >@@ -35,6 +35,7 @@
> >  #include "raw-aio.h"
> >  #include "qapi/util.h"
> >  #include "qapi/qmp/qstring.h"
> >+#include "glib.h"
> >  #if defined(__APPLE__) && (__MACH__)
> >  #include <paths.h>
> >@@ -397,6 +398,38 @@ static void raw_attach_aio_context(BlockDriverState *bs,
> >  #endif
> >  }
> >+static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> >+{
> >+
> >+    BDRVRawState *s = bs->opaque;
> >+    int ret;
> >+    struct flock fl = (struct flock) {
> >+        .l_whence  = SEEK_SET,
> >+        /* Locking byte 1 avoids interfereing with virtlockd. */
> >+        .l_start = 1,
> >+        .l_len = 1,
> >+    };
> >+
> >+    switch (cmd) {
> >+    case BDRV_LOCKF_RWLOCK:
> >+        fl.l_type = F_WRLCK;
> >+        break;
> >+    case BDRV_LOCKF_ROLOCK:
> >+        fl.l_type = F_RDLCK;
> >+        break;
> >+    case BDRV_LOCKF_UNLOCK:
> >+        fl.l_type = F_UNLCK;
> >+        break;
> >+    default:
> >+        abort();
> >+    }
> >+    ret = fcntl(s->fd, F_SETLK, &fl);
> >+    if (ret) {
> >+        ret = -errno;
> >+    }
> >+    return ret;
> >+}
> >+
> >  #ifdef CONFIG_LINUX_AIO
> >  static int raw_set_aio(void **aio_ctx, int *use_aio, int bdrv_flags)
> >  {
> >@@ -1960,6 +1993,8 @@ BlockDriver bdrv_file = {
> >      .bdrv_detach_aio_context = raw_detach_aio_context,
> >      .bdrv_attach_aio_context = raw_attach_aio_context,
> >+    .bdrv_lockf = raw_lockf,
> >+
> >      .create_opts = &raw_create_opts,
> >  };
> would it be better to use
> 
>        int flock(int fd, int operation);
> 
> DESCRIPTION
>        Apply or remove an advisory lock on the open file specified by fd.
> The
>        argument operation is one of the following:
> 
>            LOCK_SH  Place a shared lock.  More than one  process may  hold
> a
>                     shared lock for a given file at a given time.
> 
>            LOCK_EX  Place  an  exclusive  lock.   Only one process may hold
> an
>                     exclusive lock for a given file at a given time.
> 
>            LOCK_UN  Remove an existing lock held by this process.
> 
> for this purpose?
> 
> Sorry that missed this point in the initial review...
> We will not intersect with libvirt for the case.
> 

As noted in the cover letter, flock() is nop on NFS mount points on Linux, so
fcntl is safer.

Fam

Denis V. Lunev April 18, 2016, 5:30 a.m. UTC | #5

On 04/18/2016 04:12 AM, Fam Zheng wrote:
> On Sat, 04/16 16:29, Denis V. Lunev wrote:
>> On 04/15/2016 06:27 AM, Fam Zheng wrote:
>>> virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
>>> the intervene.
>>>
>>> Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
>>> Signed-off-by: Fam Zheng <famz@redhat.com>
>>> ---
>>>   block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
>>>   1 file changed, 35 insertions(+)
>>>
>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>> index 906d5c9..3a2c17f 100644
>>> --- a/block/raw-posix.c
>>> +++ b/block/raw-posix.c
>>> @@ -35,6 +35,7 @@
>>>   #include "raw-aio.h"
>>>   #include "qapi/util.h"
>>>   #include "qapi/qmp/qstring.h"
>>> +#include "glib.h"
>>>   #if defined(__APPLE__) && (__MACH__)
>>>   #include <paths.h>
>>> @@ -397,6 +398,38 @@ static void raw_attach_aio_context(BlockDriverState *bs,
>>>   #endif
>>>   }
>>> +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
>>> +{
>>> +
>>> +    BDRVRawState *s = bs->opaque;
>>> +    int ret;
>>> +    struct flock fl = (struct flock) {
>>> +        .l_whence  = SEEK_SET,
>>> +        /* Locking byte 1 avoids interfereing with virtlockd. */
>>> +        .l_start = 1,
>>> +        .l_len = 1,
>>> +    };
>>> +
>>> +    switch (cmd) {
>>> +    case BDRV_LOCKF_RWLOCK:
>>> +        fl.l_type = F_WRLCK;
>>> +        break;
>>> +    case BDRV_LOCKF_ROLOCK:
>>> +        fl.l_type = F_RDLCK;
>>> +        break;
>>> +    case BDRV_LOCKF_UNLOCK:
>>> +        fl.l_type = F_UNLCK;
>>> +        break;
>>> +    default:
>>> +        abort();
>>> +    }
>>> +    ret = fcntl(s->fd, F_SETLK, &fl);
>>> +    if (ret) {
>>> +        ret = -errno;
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>>   #ifdef CONFIG_LINUX_AIO
>>>   static int raw_set_aio(void **aio_ctx, int *use_aio, int bdrv_flags)
>>>   {
>>> @@ -1960,6 +1993,8 @@ BlockDriver bdrv_file = {
>>>       .bdrv_detach_aio_context = raw_detach_aio_context,
>>>       .bdrv_attach_aio_context = raw_attach_aio_context,
>>> +    .bdrv_lockf = raw_lockf,
>>> +
>>>       .create_opts = &raw_create_opts,
>>>   };
>> would it be better to use
>>
>>         int flock(int fd, int operation);
>>
>> DESCRIPTION
>>         Apply or remove an advisory lock on the open file specified by fd.
>> The
>>         argument operation is one of the following:
>>
>>             LOCK_SH  Place a shared lock.  More than one  process may  hold
>> a
>>                      shared lock for a given file at a given time.
>>
>>             LOCK_EX  Place  an  exclusive  lock.   Only one process may hold
>> an
>>                      exclusive lock for a given file at a given time.
>>
>>             LOCK_UN  Remove an existing lock held by this process.
>>
>> for this purpose?
>>
>> Sorry that missed this point in the initial review...
>> We will not intersect with libvirt for the case.
>>
> As noted in the cover letter, flock() is nop on NFS mount points on Linux, so
> fcntl is safer.
>
> Fam
>
that seems STRANGE to me. For the time being flock() was
working for Linux NFS server. Here is the implementation.

int nfs_flock(struct file *filp, int cmd, struct file_lock *fl)
{
         struct inode *inode = filp->f_mapping->host;
         int is_local = 0;

         dprintk("NFS: flock(%pD2, t=%x, fl=%x)\n",
                         filp, fl->fl_type, fl->fl_flags);

         if (!(fl->fl_flags & FL_FLOCK))
                 return -ENOLCK;

         /*
          * The NFSv4 protocol doesn't support LOCK_MAND, which is not 
part of
          * any standard. In principle we might be able to support LOCK_MAND
          * on NFSv2/3 since NLMv3/4 support DOS share modes, but for 
now the
          * NFS code is not set up for it.
          */
         if (fl->fl_type & LOCK_MAND)
                 return -EINVAL;

         if (NFS_SERVER(inode)->flags & NFS_MOUNT_LOCAL_FLOCK)
                 is_local = 1;

         /* We're simulating flock() locks using posix locks on the 
server */
         if (fl->fl_type == F_UNLCK)
                 return do_unlk(filp, cmd, fl, is_local);
         return do_setlk(filp, cmd, fl, is_local);
}

Den

Richard W.M. Jones April 18, 2016, 8:04 a.m. UTC | #6

On Mon, Apr 18, 2016 at 09:10:36AM +0800, Fam Zheng wrote:
> On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> > On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > > the intervene.
> > > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > > +{
> > > +
> > > +    BDRVRawState *s = bs->opaque;
> > > +    int ret;
> > > +    struct flock fl = (struct flock) {
> > > +        .l_whence  = SEEK_SET,
> > > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > > +        .l_start = 1,
> > > +        .l_len = 1,
> > > +    };
> > > +
> > > +    switch (cmd) {
> > > +    case BDRV_LOCKF_RWLOCK:
> > > +        fl.l_type = F_WRLCK;
> > > +        break;
> > > +    case BDRV_LOCKF_ROLOCK:
> > > +        fl.l_type = F_RDLCK;
> > > +        break;
> > > +    case BDRV_LOCKF_UNLOCK:
> > > +        fl.l_type = F_UNLCK;
> > > +        break;
> > 
> > My understanding is this prevents libguestfs from working on live disk
> > images -- we want to be able to read live disk images (using a
> > writable overlay and the real disk image as a read-only backing file).
> 
> Do you lock the live image or the backing file?

At the moment we don't need to do anything, but we do have the problem
that if someone uses libguestfs in write mode on a live image, then
obviously they end up with a corrupted guest.

However in this email I'm talking about using libguestfs in
"read-only" mode on a live guest, which is a completely different
thing, and should not be prevented.

My assumption [with this patch applied] is the live image (being live)
is locked by some other qemu.

Now libguestfs creates a temporary overlay with the live image as
backing, using a command similar to:

  qemu-img create -f qcow2 -b live.img tmp-overlay.img

and opens 'tmp-overlay.img'.  But since the live image has an
exclusive lock, the open will *fail*, and that is a problem.

> If not, you can just read/write as before, as what the -L option
> does in this series. Otherwise, you should use an RO lock on the
> backing image, which still works.

fcntl-locks don't allow a shared lock to share with an exclusive lock.

Rich.

Daniel P. Berrangé April 18, 2016, 9:34 a.m. UTC | #7

On Mon, Apr 18, 2016 at 09:12:44AM +0800, Fam Zheng wrote:
> On Sat, 04/16 16:29, Denis V. Lunev wrote:
> > On 04/15/2016 06:27 AM, Fam Zheng wrote:
> > >virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > >the intervene.
> > >
> > >Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
> > >Signed-off-by: Fam Zheng <famz@redhat.com>
> > >---
> > >  block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
> > >  1 file changed, 35 insertions(+)


> > >@@ -1960,6 +1993,8 @@ BlockDriver bdrv_file = {
> > >      .bdrv_detach_aio_context = raw_detach_aio_context,
> > >      .bdrv_attach_aio_context = raw_attach_aio_context,
> > >+    .bdrv_lockf = raw_lockf,
> > >+
> > >      .create_opts = &raw_create_opts,
> > >  };
> > would it be better to use
> > 
> >        int flock(int fd, int operation);
> > 
> > DESCRIPTION
> >        Apply or remove an advisory lock on the open file specified by fd.
> > The
> >        argument operation is one of the following:
> > 
> >            LOCK_SH  Place a shared lock.  More than one  process may  hold
> > a
> >                     shared lock for a given file at a given time.
> > 
> >            LOCK_EX  Place  an  exclusive  lock.   Only one process may hold
> > an
> >                     exclusive lock for a given file at a given time.
> > 
> >            LOCK_UN  Remove an existing lock held by this process.
> > 
> > for this purpose?
> > 
> > Sorry that missed this point in the initial review...
> > We will not intersect with libvirt for the case.
> 
> As noted in the cover letter, flock() is nop on NFS mount points on Linux, so
> fcntl is safer.

Actually on Modern linux flock is implemented in terms of fcntl
on Linux. Flock does not do byte-range locking so by using
flock on NFS, IIUC you'll get a fcntl lock for the entire file
range on the server, which will clash with the fcntl lock that
virtlogd has acquired. On older linux flock is a no-op on NFS.
Other UNIX have varying degrees of usable for flock on NFS. So
in general fcntl is a better bet for NFS compatibility and will
not clash with libvirt.

The only minor issue is that some versions of OCFS do not support
fcntl locks at all, only flock.

Regards,
Daniel

Denis V. Lunev April 18, 2016, 9:38 a.m. UTC | #8

On 04/18/2016 12:34 PM, Daniel P. Berrange wrote:
> On Mon, Apr 18, 2016 at 09:12:44AM +0800, Fam Zheng wrote:
>> On Sat, 04/16 16:29, Denis V. Lunev wrote:
>>> On 04/15/2016 06:27 AM, Fam Zheng wrote:
>>>> virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
>>>> the intervene.
>>>>
>>>> Suggested-by: "Daniel P. Berrange" <berrange@redhat.com>
>>>> Signed-off-by: Fam Zheng <famz@redhat.com>
>>>> ---
>>>>   block/raw-posix.c | 35 +++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 35 insertions(+)
>
>>>> @@ -1960,6 +1993,8 @@ BlockDriver bdrv_file = {
>>>>       .bdrv_detach_aio_context = raw_detach_aio_context,
>>>>       .bdrv_attach_aio_context = raw_attach_aio_context,
>>>> +    .bdrv_lockf = raw_lockf,
>>>> +
>>>>       .create_opts = &raw_create_opts,
>>>>   };
>>> would it be better to use
>>>
>>>         int flock(int fd, int operation);
>>>
>>> DESCRIPTION
>>>         Apply or remove an advisory lock on the open file specified by fd.
>>> The
>>>         argument operation is one of the following:
>>>
>>>             LOCK_SH  Place a shared lock.  More than one  process may  hold
>>> a
>>>                      shared lock for a given file at a given time.
>>>
>>>             LOCK_EX  Place  an  exclusive  lock.   Only one process may hold
>>> an
>>>                      exclusive lock for a given file at a given time.
>>>
>>>             LOCK_UN  Remove an existing lock held by this process.
>>>
>>> for this purpose?
>>>
>>> Sorry that missed this point in the initial review...
>>> We will not intersect with libvirt for the case.
>> As noted in the cover letter, flock() is nop on NFS mount points on Linux, so
>> fcntl is safer.
> Actually on Modern linux flock is implemented in terms of fcntl
> on Linux. Flock does not do byte-range locking so by using
> flock on NFS, IIUC you'll get a fcntl lock for the entire file
> range on the server, which will clash with the fcntl lock that
> virtlogd has acquired. On older linux flock is a no-op on NFS.
> Other UNIX have varying degrees of usable for flock on NFS. So
> in general fcntl is a better bet for NFS compatibility and will
> not clash with libvirt.
>
> The only minor issue is that some versions of OCFS do not support
> fcntl locks at all, only flock.
>
> Regards,
> Daniel
thank you for clarification

this is not a bid deal at all :) This approach would work fine.

Den

Fam Zheng April 19, 2016, 12:37 p.m. UTC | #9

On Mon, 04/18 09:04, Richard W.M. Jones wrote:
> On Mon, Apr 18, 2016 at 09:10:36AM +0800, Fam Zheng wrote:
> > On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> > > On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > > > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > > > the intervene.
> > > > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > > > +{
> > > > +
> > > > +    BDRVRawState *s = bs->opaque;
> > > > +    int ret;
> > > > +    struct flock fl = (struct flock) {
> > > > +        .l_whence  = SEEK_SET,
> > > > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > > > +        .l_start = 1,
> > > > +        .l_len = 1,
> > > > +    };
> > > > +
> > > > +    switch (cmd) {
> > > > +    case BDRV_LOCKF_RWLOCK:
> > > > +        fl.l_type = F_WRLCK;
> > > > +        break;
> > > > +    case BDRV_LOCKF_ROLOCK:
> > > > +        fl.l_type = F_RDLCK;
> > > > +        break;
> > > > +    case BDRV_LOCKF_UNLOCK:
> > > > +        fl.l_type = F_UNLCK;
> > > > +        break;
> > > 
> > > My understanding is this prevents libguestfs from working on live disk
> > > images -- we want to be able to read live disk images (using a
> > > writable overlay and the real disk image as a read-only backing file).
> > 
> > Do you lock the live image or the backing file?
> 
> At the moment we don't need to do anything, but we do have the problem
> that if someone uses libguestfs in write mode on a live image, then
> obviously they end up with a corrupted guest.
> 
> However in this email I'm talking about using libguestfs in
> "read-only" mode on a live guest, which is a completely different
> thing, and should not be prevented.
> 
> My assumption [with this patch applied] is the live image (being live)
> is locked by some other qemu.
> 
> Now libguestfs creates a temporary overlay with the live image as
> backing, using a command similar to:
> 
>   qemu-img create -f qcow2 -b live.img tmp-overlay.img
> 
> and opens 'tmp-overlay.img'.  But since the live image has an
> exclusive lock, the open will *fail*, and that is a problem.

I'm not sure that is what we want to support. The image is being read-write
open by the VM, any other accessing is a bad idea.

Fam

> 
> > If not, you can just read/write as before, as what the -L option
> > does in this series. Otherwise, you should use an RO lock on the
> > backing image, which still works.
> 
> fcntl-locks don't allow a shared lock to share with an exclusive lock.
> 
> Rich.
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-df lists disk usage of guests without needing to install any
> software inside the virtual machine.  Supports Linux and Windows.
> http://people.redhat.com/~rjones/virt-df/

Richard W.M. Jones April 19, 2016, 1:07 p.m. UTC | #10

On Tue, Apr 19, 2016 at 08:37:14PM +0800, Fam Zheng wrote:
> On Mon, 04/18 09:04, Richard W.M. Jones wrote:
> > On Mon, Apr 18, 2016 at 09:10:36AM +0800, Fam Zheng wrote:
> > > On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> > > > On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > > > > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > > > > the intervene.
> > > > > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > > > > +{
> > > > > +
> > > > > +    BDRVRawState *s = bs->opaque;
> > > > > +    int ret;
> > > > > +    struct flock fl = (struct flock) {
> > > > > +        .l_whence  = SEEK_SET,
> > > > > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > > > > +        .l_start = 1,
> > > > > +        .l_len = 1,
> > > > > +    };
> > > > > +
> > > > > +    switch (cmd) {
> > > > > +    case BDRV_LOCKF_RWLOCK:
> > > > > +        fl.l_type = F_WRLCK;
> > > > > +        break;
> > > > > +    case BDRV_LOCKF_ROLOCK:
> > > > > +        fl.l_type = F_RDLCK;
> > > > > +        break;
> > > > > +    case BDRV_LOCKF_UNLOCK:
> > > > > +        fl.l_type = F_UNLCK;
> > > > > +        break;
> > > > 
> > > > My understanding is this prevents libguestfs from working on live disk
> > > > images -- we want to be able to read live disk images (using a
> > > > writable overlay and the real disk image as a read-only backing file).
> > > 
> > > Do you lock the live image or the backing file?
> > 
> > At the moment we don't need to do anything, but we do have the problem
> > that if someone uses libguestfs in write mode on a live image, then
> > obviously they end up with a corrupted guest.
> > 
> > However in this email I'm talking about using libguestfs in
> > "read-only" mode on a live guest, which is a completely different
> > thing, and should not be prevented.
> > 
> > My assumption [with this patch applied] is the live image (being live)
> > is locked by some other qemu.
> > 
> > Now libguestfs creates a temporary overlay with the live image as
> > backing, using a command similar to:
> > 
> >   qemu-img create -f qcow2 -b live.img tmp-overlay.img
> > 
> > and opens 'tmp-overlay.img'.  But since the live image has an
> > exclusive lock, the open will *fail*, and that is a problem.
> 
> I'm not sure that is what we want to support. The image is being read-write
> open by the VM, any other accessing is a bad idea.

We've done this successfully for years, for people monitoring their
VMs using virt-df, pulling out files using guestfish and so on.  We
allow you to do it while the guest is live and running, with the
proviso that a consistent view cannot always be guaranteed (although
it works so reliably that it's never really a problem), or users can
briefly pause VMs if they need a guaranteed consistent view.

I'm afraid the onus is on you to explain why this existing practice is
a bad idea.

Rich.

Fam Zheng April 19, 2016, 1:19 p.m. UTC | #11

On Tue, 04/19 14:07, Richard W.M. Jones wrote:
> On Tue, Apr 19, 2016 at 08:37:14PM +0800, Fam Zheng wrote:
> > On Mon, 04/18 09:04, Richard W.M. Jones wrote:
> > > On Mon, Apr 18, 2016 at 09:10:36AM +0800, Fam Zheng wrote:
> > > > On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> > > > > On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > > > > > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > > > > > the intervene.
> > > > > > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > > > > > +{
> > > > > > +
> > > > > > +    BDRVRawState *s = bs->opaque;
> > > > > > +    int ret;
> > > > > > +    struct flock fl = (struct flock) {
> > > > > > +        .l_whence  = SEEK_SET,
> > > > > > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > > > > > +        .l_start = 1,
> > > > > > +        .l_len = 1,
> > > > > > +    };
> > > > > > +
> > > > > > +    switch (cmd) {
> > > > > > +    case BDRV_LOCKF_RWLOCK:
> > > > > > +        fl.l_type = F_WRLCK;
> > > > > > +        break;
> > > > > > +    case BDRV_LOCKF_ROLOCK:
> > > > > > +        fl.l_type = F_RDLCK;
> > > > > > +        break;
> > > > > > +    case BDRV_LOCKF_UNLOCK:
> > > > > > +        fl.l_type = F_UNLCK;
> > > > > > +        break;
> > > > > 
> > > > > My understanding is this prevents libguestfs from working on live disk
> > > > > images -- we want to be able to read live disk images (using a
> > > > > writable overlay and the real disk image as a read-only backing file).
> > > > 
> > > > Do you lock the live image or the backing file?
> > > 
> > > At the moment we don't need to do anything, but we do have the problem
> > > that if someone uses libguestfs in write mode on a live image, then
> > > obviously they end up with a corrupted guest.
> > > 
> > > However in this email I'm talking about using libguestfs in
> > > "read-only" mode on a live guest, which is a completely different
> > > thing, and should not be prevented.
> > > 
> > > My assumption [with this patch applied] is the live image (being live)
> > > is locked by some other qemu.
> > > 
> > > Now libguestfs creates a temporary overlay with the live image as
> > > backing, using a command similar to:
> > > 
> > >   qemu-img create -f qcow2 -b live.img tmp-overlay.img
> > > 
> > > and opens 'tmp-overlay.img'.  But since the live image has an
> > > exclusive lock, the open will *fail*, and that is a problem.
> > 
> > I'm not sure that is what we want to support. The image is being read-write
> > open by the VM, any other accessing is a bad idea.
> 
> We've done this successfully for years, for people monitoring their
> VMs using virt-df, pulling out files using guestfish and so on.  We
> allow you to do it while the guest is live and running, with the
> proviso that a consistent view cannot always be guaranteed (although
> it works so reliably that it's never really a problem), or users can
> briefly pause VMs if they need a guaranteed consistent view.
> 
> I'm afraid the onus is on you to explain why this existing practice is
> a bad idea.

It is bad idea because it can produce erroneous data. Perhaps it's not
critical and is rare enough to be practically useful.

As a tradeoff, I guess, we can skip the shared lock in this series. Does that
work for you?

Fam

Daniel P. Berrangé April 19, 2016, 1:34 p.m. UTC | #12

On Mon, Apr 18, 2016 at 09:04:19AM +0100, Richard W.M. Jones wrote:
> On Mon, Apr 18, 2016 at 09:10:36AM +0800, Fam Zheng wrote:
> > On Sun, 04/17 20:27, Richard W.M. Jones wrote:
> > > On Fri, Apr 15, 2016 at 11:27:55AM +0800, Fam Zheng wrote:
> > > > virtlockd in libvirt locks the first byte, we lock byte 1 to avoid
> > > > the intervene.
> > > > +static int raw_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> > > > +{
> > > > +
> > > > +    BDRVRawState *s = bs->opaque;
> > > > +    int ret;
> > > > +    struct flock fl = (struct flock) {
> > > > +        .l_whence  = SEEK_SET,
> > > > +        /* Locking byte 1 avoids interfereing with virtlockd. */
> > > > +        .l_start = 1,
> > > > +        .l_len = 1,
> > > > +    };
> > > > +
> > > > +    switch (cmd) {
> > > > +    case BDRV_LOCKF_RWLOCK:
> > > > +        fl.l_type = F_WRLCK;
> > > > +        break;
> > > > +    case BDRV_LOCKF_ROLOCK:
> > > > +        fl.l_type = F_RDLCK;
> > > > +        break;
> > > > +    case BDRV_LOCKF_UNLOCK:
> > > > +        fl.l_type = F_UNLCK;
> > > > +        break;
> > > 
> > > My understanding is this prevents libguestfs from working on live disk
> > > images -- we want to be able to read live disk images (using a
> > > writable overlay and the real disk image as a read-only backing file).
> > 
> > Do you lock the live image or the backing file?
> 
> At the moment we don't need to do anything, but we do have the problem
> that if someone uses libguestfs in write mode on a live image, then
> obviously they end up with a corrupted guest.
> 
> However in this email I'm talking about using libguestfs in
> "read-only" mode on a live guest, which is a completely different
> thing, and should not be prevented.
> 
> My assumption [with this patch applied] is the live image (being live)
> is locked by some other qemu.
> 
> Now libguestfs creates a temporary overlay with the live image as
> backing, using a command similar to:
> 
>   qemu-img create -f qcow2 -b live.img tmp-overlay.img
> 
> and opens 'tmp-overlay.img'.  But since the live image has an
> exclusive lock, the open will *fail*, and that is a problem.

Have you ever considered integration with the QEMU NBD server. We
don't have APIs for enabling it explicitly in libvirt, but it strikes
me that it could be ideally suited for your needs.

eg a hypothetical libvirt command to export a disk via NBD:

   virsh dom-export-disk myguest --readonly vda1  localhost 9000
   qemu-img create -f qcow2 -b nbd:localhost:9000 tmp-overlay.img
   ...do stuff...
   virsh dom-unexport-disk myguest vda1

Or to make cleanup easier, perhaps there's a way to tell QEMU
to close its NBD server after it has had 1 client connection.

With this approach, you wouldn't need to take any lock on the
underlying real image.


Regards,
Daniel

Richard W.M. Jones April 19, 2016, 1:36 p.m. UTC | #13

On Tue, Apr 19, 2016 at 09:19:02PM +0800, Fam Zheng wrote:
> On Tue, 04/19 14:07, Richard W.M. Jones wrote:
> > We've done this successfully for years, for people monitoring their
> > VMs using virt-df, pulling out files using guestfish and so on.  We
> > allow you to do it while the guest is live and running, with the
> > proviso that a consistent view cannot always be guaranteed (although
> > it works so reliably that it's never really a problem), or users can
> > briefly pause VMs if they need a guaranteed consistent view.
> > 
> > I'm afraid the onus is on you to explain why this existing practice is
> > a bad idea.
> 
> It is bad idea because it can produce erroneous data. Perhaps it's not
> critical and is rare enough to be practically useful.

As explained above, we deal with the inconsistent case (by detecting
and ignoring it, or retrying), and then there's the case where we
pause the guest to get consistent data.

> As a tradeoff, I guess, we can skip the shared lock in this series. Does that
> work for you?

I'd prefer some kind of no lock / ignore lock.  There is a legitimate
case where you want to have the shared lock behaviour, but also a
legitimate one for turning it off.  I'm not opposed to the idea --
there are very real cases where your patch saves people from
themselves.

Off topic: How does this patch deal with genuinely shared (writable)
disks, as used in cluster filesystems?

Rich.

Richard W.M. Jones April 19, 2016, 1:40 p.m. UTC | #14

On Tue, Apr 19, 2016 at 02:34:30PM +0100, Daniel P. Berrange wrote:
> Have you ever considered integration with the QEMU NBD server. We
> don't have APIs for enabling it explicitly in libvirt, but it strikes
> me that it could be ideally suited for your needs.
> 
> eg a hypothetical libvirt command to export a disk via NBD:
> 
>    virsh dom-export-disk myguest --readonly vda1  localhost 9000
>    qemu-img create -f qcow2 -b nbd:localhost:9000 tmp-overlay.img
>    ...do stuff...
>    virsh dom-unexport-disk myguest vda1
> 
> Or to make cleanup easier, perhaps there's a way to tell QEMU
> to close its NBD server after it has had 1 client connection.
> 
> With this approach, you wouldn't need to take any lock on the
> underlying real image.

It's been on my todo list for a long time, but this does require
libvirt to be involved, and libvirt is not the default for upstream
libguestfs.  Even with the libvirt backend, there are still uses cases
like:

  virt-df -a /mnt/vms/disk.img -h
  virt-inspector -a /mnt/vms/disk.img | grep some_insecure_software
  for f in /mnt/vms/*; do virt-alignment-scan -a $f; done
and so on.

Rich.

Daniel P. Berrangé April 19, 2016, 1:45 p.m. UTC | #15

On Tue, Apr 19, 2016 at 02:36:05PM +0100, Richard W.M. Jones wrote:
> 
> I'd prefer some kind of no lock / ignore lock.  There is a legitimate
> case where you want to have the shared lock behaviour, but also a
> legitimate one for turning it off.  I'm not opposed to the idea --
> there are very real cases where your patch saves people from
> themselves.

[snip]

> Off topic: How does this patch deal with genuinely shared (writable)
> disks, as used in cluster filesystems?

This series does add support for a lock_image=no parameter for disks
which turns off all locking. IIUC, this was intended for use when
the disk image must be shared between guests

IOW, this series is doing the following:

 - read-only, exclusive -> F_RDLOCK
 - writable, shared     -> no locking
 - writable, exclusive  -> F_WRLOCK

This stuff leaves open possibility of mistakes if one VM is
given a disk in write+exclusive, while other VM gets the
same disk in write+shared mode.

In virtlockd we took a different approach to this:

 - read-only, exclusive -> no locking
 - writable, shared     -> F_RDLOCK
 - writable, exclusive  -> F_WRLOCK

because this lets the kernel prevent case where the disk image is presented
in write+shared mode to one VM and write+exlcusive mode to another VM.
Not doing any locking at all in readonly case allowed for the scenario
needed with libguestfs where they create an overlay for disks in use.

Regards,
Daniel

[for-2.7,v2,05/17] raw-posix: Implement .bdrv_lockf

Commit Message

Comments

Patch