diff mbox series

[v2] docs: document file-posix locking protocol

Message ID 20210703135033.835344-1-vsementsov@virtuozzo.com (mailing list archive)
State New, archived
Headers show
Series [v2] docs: document file-posix locking protocol | expand

Commit Message

Vladimir Sementsov-Ogievskiy July 3, 2021, 1:50 p.m. UTC
Let's document how we use file locks in file-posix driver, to allow
external programs to "communicate" in this way with Qemu.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

v2: improve some descriptions
    add examples
    add notice about old bad POSIX file locks

 docs/system/qemu-block-drivers.rst.inc | 186 +++++++++++++++++++++++++
 1 file changed, 186 insertions(+)

Comments

Nir Soffer July 3, 2021, 2:50 p.m. UTC | #1
On Sat, Jul 3, 2021 at 4:51 PM Vladimir Sementsov-Ogievskiy
<vsementsov@virtuozzo.com> wrote:
>
> Let's document how we use file locks in file-posix driver, to allow
> external programs to "communicate" in this way with Qemu.

This makes the locking implementation public, so qemu can never change
it without breaking external programs. I'm not sure this is an issue since
even now qemu cannot change without breaking compatibility with older
qemu versions.

Maybe a better way to integrate with external programs is to provide
a library/tool to perform locking?

For example we can have tool like:

   qemu-img lock [how] image command

This example will take the lock specified by "how" on image while "command"
is running.

> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>
> v2: improve some descriptions
>     add examples
>     add notice about old bad POSIX file locks
>
>  docs/system/qemu-block-drivers.rst.inc | 186 +++++++++++++++++++++++++
>  1 file changed, 186 insertions(+)
>
> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
> index 16225710eb..74fb71600d 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -909,3 +909,189 @@ some additional tasks, hooking io requests.
>    .. option:: prealloc-size
>
>      How much to preallocate (in bytes), default 128M.
> +
> +Image locking protocol
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +QEMU holds rd locks and never rw locks. Instead, GETLK fcntl is used with F_WRLCK
> +to handle permissions as described below.
> +QEMU process may rd-lock the following bytes of the image with corresponding
> +meaning:
> +
> +Permission bytes. If permission byte is rd-locked, it means that some process
> +uses corresponding permission on that file.
> +
> +Byte    Operation
> +100     read
> +          Lock holder can read
> +101     write
> +          Lock holder can write
> +102     write-unchanged
> +          Lock holder can write same data if it sure, that this write doesn't
> +          break concurrent readers. This is mostly used internally in Qemu
> +          and it wouldn't be good idea to exploit it somehow.
> +103     resize
> +          Lock holder can resize the file. "write" permission is also required
> +          for resizing, so lock byte 103 only if you also lock byte 101.
> +104     graph-mod
> +          Undefined. QEMU may sometimes locks this byte, but external programs
> +          should not. QEMU will stop locking this byte in future
> +
> +Unshare bytes. If permission byte is rd-locked, it means that some process
> +does not allow the others use corresponding options on that file.
> +
> +Byte    Operation
> +200     read
> +          Lock holder don't allow read operation to other processes.
> +201     write
> +          Lock holder don't allow write operation to other processes. This
> +          still allows others to do write-uncahnged operations. Better not
> +          exploit outside of Qemu.
> +202     write-unchanged
> +          Lock holder don't allow write-unchanged operation to other processes.
> +203     resize
> +          Lock holder don't allow resizing the file by other processes.
> +204     graph-mod
> +          Undefined. QEMU may sometimes locks this byte, but external programs
> +          should not. QEMU will stop locking this byte in future
> +
> +Handling the permissions works as follows: assume we want to open the file to do
> +some operations and in the same time want to disallow some operation to other
> +processes. So, we want to lock some of the bytes described above. We operate as
> +follows:
> +
> +1. rd-lock all needed bytes, both "permission" bytes and "unshare" bytes.
> +
> +2. For each "unshare" byte we rd-locked, do GETLK that "tries" to wr-lock
> +corresponding "permission" byte. So, we check is there any other process that
> +uses the permission we want to unshare. If it exists we fail.
> +
> +3. For each "permission" byte we rd-locked, do GETLK that "tries" to wr-lock
> +corresponding "unshare" byte. So, we check is there any other process that
> +unshares the permission we want to have. If it exists we fail.
> +
> +Important notice: Qemu may fallback to POSIX file locks only if OFD locks
> +unavailable. Other programs should behave similarly: use POSIX file locks
> +only if OFD locks unavailable and if you are OK with drawbacks of POSIX
> +file locks (for example, they are lost on close() of any file descriptor
> +for that file).

Worth an example.

> +
> +Image locking examples
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +Read-only, allow others to write
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +So, we want to read and don't care what other users do with the image. We only
> +need to lock byte 100. Operation is as follows:
> +
> +1. rd-lock byte 100
> +
> +.. highlight:: c
> +
> +    struct flock fl = {
> +        .l_whence = SEEK_SET,
> +        .l_start  = 100,
> +        .l_len    = 1,
> +        .l_type   = F_RDLCK,
> +    };
> +    ret = fcntl(fd, F_OFD_SETLK, &fl);
> +    if (ret == -1) {
> +        /* Error */
> +    }
> +
> +2. try wr-lock byte 200, to check that no one is against our read access
> +
> +.. highlight:: c
> +
> +    struct flock fl = {
> +        .l_whence = SEEK_SET,
> +        .l_start  = 200,
> +        .l_len    = 1,
> +        .l_type   = F_WRLCK,
> +    };
> +    ret = fcntl(fd, F_OFD_GETLK, &fl);
> +    if (ret != -1 && fl.l_type == F_UNLCK) {
> +        /*
> +         * We are lucky, nobody against. So, now we have RO access
> +         * that we want.
> +         */
> +    } else {
> +        /* Error, or RO access is blocked by someone. We don't have access */
> +    }
> +
> +3. Now we can operate read the data.
> +
> +4. When finished, release the lock:
> +
> +.. highlight:: c
> +
> +    struct flock fl = {
> +        .l_whence = SEEK_SET,
> +        .l_start  = 100,
> +        .l_len    = 1,
> +        .l_type   = F_UNLCK,
> +    };
> +    ret = fcntl(fd, F_OFD_SETLK, &fl);
> +
> +RW, allow others to read only
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +We want to read and write, and don't want others to modify the image.
> +So, let's lock bytes 100, 101, 201. Operation is as follows:
> +
> +1. rd-lock bytes 100 (read), 101 (write), 201 (don't allow others to write)
> +
> +.. highlight:: c
> +
> +    for byte in (100, 101, 201) {

Using python syntax here is a little bit confusing.

> +        struct flock fl = {
> +            .l_whence = SEEK_SET,
> +            .l_start  = byte,
> +            .l_len    = 1,
> +            .l_type   = F_RDLCK,
> +        };
> +        ret = fcntl(fd, F_OFD_SETLK, &fl);
> +        if (ret == -1) {
> +            /* Error */
> +        }
> +    }
> +
> +2. try wr-lock bytes 200 (to check that no one is against our read access),
> +   201 (no one against our write access), 101 (there are no writers currently)
> +
> +.. highlight:: c
> +
> +    for byte in (200, 201, 101) {
> +        struct flock fl = {
> +            .l_whence = SEEK_SET,
> +            .l_start  = byte,
> +            .l_len    = 1,
> +            .l_type   = F_WRLCK,
> +        };
> +        ret = fcntl(fd, F_OFD_GETLK, &fl);
> +        if (ret != -1 && fl.l_type == F_UNLCK) {
> +            /* We are lucky, nobody against. */
> +        } else {
> +            /*
> +             * Error, or feature we want is blocked by someone.
> +             * We don't have access.
> +             */
> +        }
> +    }
> +
> +3. Now we can read and write.
> +
> +4. When finished, release locks:
> +
> +.. highlight:: c
> +
> +    for byte in (100, 101, 201) {
> +        struct flock fl = {
> +            .l_whence = SEEK_SET,
> +            .l_start  = byte,
> +            .l_len    = 1,
> +            .l_type   = F_UNLCK,
> +        };
> +        fcntl(fd, F_OFD_SETLK, &fl);
> +    }
> --
> 2.29.2

Having this is great even if the locking protocol is not made public.

Nir
Vladimir Sementsov-Ogievskiy July 5, 2021, 7:55 a.m. UTC | #2
03.07.2021 17:50, Nir Soffer wrote:
> On Sat, Jul 3, 2021 at 4:51 PM Vladimir Sementsov-Ogievskiy
> <vsementsov@virtuozzo.com> wrote:
>>
>> Let's document how we use file locks in file-posix driver, to allow
>> external programs to "communicate" in this way with Qemu.
> 
> This makes the locking implementation public, so qemu can never change
> it without breaking external programs. I'm not sure this is an issue since
> even now qemu cannot change without breaking compatibility with older
> qemu versions.

Yes, that's my thought too. I think, that's enough to say that we actually have "public" protocol, just undocumented.

Note, that breaking that compatibility may break shared migration, and migration without one host (which may be used for live upgrade of qemu).

> 
> Maybe a better way to integrate with external programs is to provide
> a library/tool to perform locking?
> 
> For example we can have tool like:
> 
>     qemu-img lock [how] image command
> 
> This example will take the lock specified by "how" on image while "command"
> is running.

Having a parallel process, that takes locks "for us" is a pain. At least we should handle unexpected death of that process. Some filesystems may not allow opening file in two processes in some circumstances. And it just breaks normal operation with file locks: lock should be taken by the process that use it.

Library has GPL limitation of use.

> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> v2: improve some descriptions
>>      add examples
>>      add notice about old bad POSIX file locks
>>
>>   docs/system/qemu-block-drivers.rst.inc | 186 +++++++++++++++++++++++++
>>   1 file changed, 186 insertions(+)
>>
>> diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
>> index 16225710eb..74fb71600d 100644
>> --- a/docs/system/qemu-block-drivers.rst.inc
>> +++ b/docs/system/qemu-block-drivers.rst.inc
>> @@ -909,3 +909,189 @@ some additional tasks, hooking io requests.
>>     .. option:: prealloc-size
>>
>>       How much to preallocate (in bytes), default 128M.
>> +
>> +Image locking protocol
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +QEMU holds rd locks and never rw locks. Instead, GETLK fcntl is used with F_WRLCK
>> +to handle permissions as described below.
>> +QEMU process may rd-lock the following bytes of the image with corresponding
>> +meaning:
>> +
>> +Permission bytes. If permission byte is rd-locked, it means that some process
>> +uses corresponding permission on that file.
>> +
>> +Byte    Operation
>> +100     read
>> +          Lock holder can read
>> +101     write
>> +          Lock holder can write
>> +102     write-unchanged
>> +          Lock holder can write same data if it sure, that this write doesn't
>> +          break concurrent readers. This is mostly used internally in Qemu
>> +          and it wouldn't be good idea to exploit it somehow.
>> +103     resize
>> +          Lock holder can resize the file. "write" permission is also required
>> +          for resizing, so lock byte 103 only if you also lock byte 101.
>> +104     graph-mod
>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>> +          should not. QEMU will stop locking this byte in future
>> +
>> +Unshare bytes. If permission byte is rd-locked, it means that some process
>> +does not allow the others use corresponding options on that file.
>> +
>> +Byte    Operation
>> +200     read
>> +          Lock holder don't allow read operation to other processes.
>> +201     write
>> +          Lock holder don't allow write operation to other processes. This
>> +          still allows others to do write-uncahnged operations. Better not
>> +          exploit outside of Qemu.
>> +202     write-unchanged
>> +          Lock holder don't allow write-unchanged operation to other processes.
>> +203     resize
>> +          Lock holder don't allow resizing the file by other processes.
>> +204     graph-mod
>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>> +          should not. QEMU will stop locking this byte in future
>> +
>> +Handling the permissions works as follows: assume we want to open the file to do
>> +some operations and in the same time want to disallow some operation to other
>> +processes. So, we want to lock some of the bytes described above. We operate as
>> +follows:
>> +
>> +1. rd-lock all needed bytes, both "permission" bytes and "unshare" bytes.
>> +
>> +2. For each "unshare" byte we rd-locked, do GETLK that "tries" to wr-lock
>> +corresponding "permission" byte. So, we check is there any other process that
>> +uses the permission we want to unshare. If it exists we fail.
>> +
>> +3. For each "permission" byte we rd-locked, do GETLK that "tries" to wr-lock
>> +corresponding "unshare" byte. So, we check is there any other process that
>> +unshares the permission we want to have. If it exists we fail.
>> +
>> +Important notice: Qemu may fallback to POSIX file locks only if OFD locks
>> +unavailable. Other programs should behave similarly: use POSIX file locks
>> +only if OFD locks unavailable and if you are OK with drawbacks of POSIX
>> +file locks (for example, they are lost on close() of any file descriptor
>> +for that file).
> 
> Worth an example.
> 
>> +
>> +Image locking examples
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Read-only, allow others to write
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +So, we want to read and don't care what other users do with the image. We only
>> +need to lock byte 100. Operation is as follows:
>> +
>> +1. rd-lock byte 100
>> +
>> +.. highlight:: c
>> +
>> +    struct flock fl = {
>> +        .l_whence = SEEK_SET,
>> +        .l_start  = 100,
>> +        .l_len    = 1,
>> +        .l_type   = F_RDLCK,
>> +    };
>> +    ret = fcntl(fd, F_OFD_SETLK, &fl);
>> +    if (ret == -1) {
>> +        /* Error */
>> +    }
>> +
>> +2. try wr-lock byte 200, to check that no one is against our read access
>> +
>> +.. highlight:: c
>> +
>> +    struct flock fl = {
>> +        .l_whence = SEEK_SET,
>> +        .l_start  = 200,
>> +        .l_len    = 1,
>> +        .l_type   = F_WRLCK,
>> +    };
>> +    ret = fcntl(fd, F_OFD_GETLK, &fl);
>> +    if (ret != -1 && fl.l_type == F_UNLCK) {
>> +        /*
>> +         * We are lucky, nobody against. So, now we have RO access
>> +         * that we want.
>> +         */
>> +    } else {
>> +        /* Error, or RO access is blocked by someone. We don't have access */
>> +    }
>> +
>> +3. Now we can operate read the data.
>> +
>> +4. When finished, release the lock:
>> +
>> +.. highlight:: c
>> +
>> +    struct flock fl = {
>> +        .l_whence = SEEK_SET,
>> +        .l_start  = 100,
>> +        .l_len    = 1,
>> +        .l_type   = F_UNLCK,
>> +    };
>> +    ret = fcntl(fd, F_OFD_SETLK, &fl);
>> +
>> +RW, allow others to read only
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +We want to read and write, and don't want others to modify the image.
>> +So, let's lock bytes 100, 101, 201. Operation is as follows:
>> +
>> +1. rd-lock bytes 100 (read), 101 (write), 201 (don't allow others to write)
>> +
>> +.. highlight:: c
>> +
>> +    for byte in (100, 101, 201) {
> 
> Using python syntax here is a little bit confusing.
> 
>> +        struct flock fl = {
>> +            .l_whence = SEEK_SET,
>> +            .l_start  = byte,
>> +            .l_len    = 1,
>> +            .l_type   = F_RDLCK,
>> +        };
>> +        ret = fcntl(fd, F_OFD_SETLK, &fl);
>> +        if (ret == -1) {
>> +            /* Error */
>> +        }
>> +    }
>> +
>> +2. try wr-lock bytes 200 (to check that no one is against our read access),
>> +   201 (no one against our write access), 101 (there are no writers currently)
>> +
>> +.. highlight:: c
>> +
>> +    for byte in (200, 201, 101) {
>> +        struct flock fl = {
>> +            .l_whence = SEEK_SET,
>> +            .l_start  = byte,
>> +            .l_len    = 1,
>> +            .l_type   = F_WRLCK,
>> +        };
>> +        ret = fcntl(fd, F_OFD_GETLK, &fl);
>> +        if (ret != -1 && fl.l_type == F_UNLCK) {
>> +            /* We are lucky, nobody against. */
>> +        } else {
>> +            /*
>> +             * Error, or feature we want is blocked by someone.
>> +             * We don't have access.
>> +             */
>> +        }
>> +    }
>> +
>> +3. Now we can read and write.
>> +
>> +4. When finished, release locks:
>> +
>> +.. highlight:: c
>> +
>> +    for byte in (100, 101, 201) {
>> +        struct flock fl = {
>> +            .l_whence = SEEK_SET,
>> +            .l_start  = byte,
>> +            .l_len    = 1,
>> +            .l_type   = F_UNLCK,
>> +        };
>> +        fcntl(fd, F_OFD_SETLK, &fl);
>> +    }
>> --
>> 2.29.2
> 
> Having this is great even if the locking protocol is not made public.
> 
> Nir
>
Denis V. Lunev July 5, 2021, 8:26 a.m. UTC | #3
On 7/5/21 10:55 AM, Vladimir Sementsov-Ogievskiy wrote:
> 03.07.2021 17:50, Nir Soffer wrote:
>> On Sat, Jul 3, 2021 at 4:51 PM Vladimir Sementsov-Ogievskiy
>> <vsementsov@virtuozzo.com> wrote:
>>>
>>> Let's document how we use file locks in file-posix driver, to allow
>>> external programs to "communicate" in this way with Qemu.
>>
>> This makes the locking implementation public, so qemu can never change
>> it without breaking external programs. I'm not sure this is an issue
>> since
>> even now qemu cannot change without breaking compatibility with older
>> qemu versions.
>
> Yes, that's my thought too. I think, that's enough to say that we
> actually have "public" protocol, just undocumented.
>
> Note, that breaking that compatibility may break shared migration, and
> migration without one host (which may be used for live upgrade of qemu).
>
>>
>> Maybe a better way to integrate with external programs is to provide
>> a library/tool to perform locking?
>>
>> For example we can have tool like:
>>
>>     qemu-img lock [how] image command
>>
>> This example will take the lock specified by "how" on image while
>> "command"
>> is running.
>
> Having a parallel process, that takes locks "for us" is a pain. At
> least we should handle unexpected death of that process. Some
> filesystems may not allow opening file in two processes in some
> circumstances. And it just breaks normal operation with file locks:
> lock should be taken by the process that use it.
>
> Library has GPL limitation of use.

and there are also some important consequences. 3rd party implements
QCOW2 support in a 3rd party way. Thus it opens the image and creates
3rd party data structures for it.

It handles metadata processing etc. Thus once the "locking" library will
be ready to operate, some bits indicating that the image is in use would
be on the filesystem, f.e. "busy" state and thus we would need to perform
the "locking" in QEMU code through a very specific QEMU data structure
creation. The library could do locking first, but in that case 3rd party
code would have same problems.

In general, this is not only QCOW2 problem, such locking is a problem
for all supported formats and thus we come to the dilemma:
- to document
- or to provide an utility
In that case we should provide locking for all "alien" formats, which
is looking overcomplicated at my opinion.

In general, the format itself is opened with providing the information
for all 3rd parties to integrate with. The locking is an important part
of interoperability and thus should be published too.

Den
Vladimir Sementsov-Ogievskiy July 15, 2021, 5:13 p.m. UTC | #4
03.07.2021 17:50, Nir Soffer wrote:
> On Sat, Jul 3, 2021 at 4:51 PM Vladimir Sementsov-Ogievskiy
> <vsementsov@virtuozzo.com> wrote:

[..]

>> +
>> +Important notice: Qemu may fallback to POSIX file locks only if OFD locks
>> +unavailable. Other programs should behave similarly: use POSIX file locks
>> +only if OFD locks unavailable and if you are OK with drawbacks of POSIX
>> +file locks (for example, they are lost on close() of any file descriptor
>> +for that file).
> 
> Worth an example.

Hmm.. Copying here the whole #ifdef and probing logic around these locks from Qemu is too much..

I can't imagine what small and short could be added here.

Actually I think, OFD is old enough so we shouldn't care too much about older kernels without it. Let's just rewrite paragraph to something like this:

Don't use POSIX locks, they are known to be unsafe. Qemu uses OFD, so to be compatible, use OFD locks. Qemu may use POSIX locks when OFD is not available in the system. Other programs are not recommended to open images on such old systems if there is a risk of parallel access to the same image.


Related question, are POSIX locks somehow compatible with OFD locks? If one program use OFD and the other use POSIX locks on the same file.. Will it work or not?

> 
>> +
>> +Image locking examples
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Read-only, allow others to write
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[..]

>> +RW, allow others to read only
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +We want to read and write, and don't want others to modify the image.
>> +So, let's lock bytes 100, 101, 201. Operation is as follows:
>> +
>> +1. rd-lock bytes 100 (read), 101 (write), 201 (don't allow others to write)
>> +
>> +.. highlight:: c
>> +
>> +    for byte in (100, 101, 201) {
> 
> Using python syntax here is a little bit confusing.

Agree, as everything other is C..

Will change to something like

int offsets[] = {100, 101, 201}, *off, *end = offsets + 3;

for (off = offsets; off < end; off++) {


> 
>> +        struct flock fl = {
>> +            .l_whence = SEEK_SET,
>> +            .l_start  = byte,
>> +            .l_len    = 1,
>> +            .l_type   = F_RDLCK,
>> +        };
>> +        ret = fcntl(fd, F_OFD_SETLK, &fl);
>> +        if (ret == -1) {
>> +            /* Error */
>> +        }
>> +    }
>> +

[..]

> 
> Having this is great even if the locking protocol is not made public.
> 

Thanks!
Daniel P. Berrangé July 15, 2021, 5:19 p.m. UTC | #5
On Thu, Jul 15, 2021 at 08:13:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.07.2021 17:50, Nir Soffer wrote:
> > On Sat, Jul 3, 2021 at 4:51 PM Vladimir Sementsov-Ogievskiy
> > <vsementsov@virtuozzo.com> wrote:
> 
> [..]
> 
> > > +
> > > +Important notice: Qemu may fallback to POSIX file locks only if OFD locks
> > > +unavailable. Other programs should behave similarly: use POSIX file locks
> > > +only if OFD locks unavailable and if you are OK with drawbacks of POSIX
> > > +file locks (for example, they are lost on close() of any file descriptor
> > > +for that file).
> > 
> > Worth an example.
> 
> Hmm.. Copying here the whole #ifdef and probing logic around these locks from Qemu is too much..
> 
> I can't imagine what small and short could be added here.
> 
> Actually I think, OFD is old enough so we shouldn't care
> too much about older kernels without it. Let's just rewrite
> paragraph to something like this:

Yes, that's a good point. From a Linux POV, I think our platform
support matrix means we can assume existance of OFD locking
unconditionally. The kernel impl transparently applies to all
filesystems too.

So we could likely change the code such that Linux always uses
OFD and non-Linux uses traditional POSIX locks.

> Related question, are POSIX locks somehow compatible with OFD
> locks? If one program use OFD and the other use POSIX locks on
> the same file.. Will it work or not?

Yes, the kernel level implementation uses the same locking primitives
internally. The only difference between the two is how they're invoked
from userspace, and the semantics around lock release when closing
files. So it is fully compatible with applications mixing the two
APIs for fcntl POSIX and OFD locking

The flock() syscall is however completely independant locking
from the fcntl based locking.


Regards,
Daniel
Vladimir Sementsov-Ogievskiy July 15, 2021, 8 p.m. UTC | #6
03.07.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
> +Permission bytes. If permission byte is rd-locked, it means that some process
> +uses corresponding permission on that file.
> +
> +Byte    Operation
> +100     read
> +          Lock holder can read
> +101     write
> +          Lock holder can write
> +102     write-unchanged
> +          Lock holder can write same data if it sure, that this write doesn't
> +          break concurrent readers. This is mostly used internally in Qemu
> +          and it wouldn't be good idea to exploit it somehow.

Let's make it more strict:

New software should never lock this byte and interpret this byte locked by other process like write permission (same as 101).

> +103     resize
> +          Lock holder can resize the file. "write" permission is also required
> +          for resizing, so lock byte 103 only if you also lock byte 101.
> +104     graph-mod
> +          Undefined. QEMU may sometimes locks this byte, but external programs
> +          should not. QEMU will stop locking this byte in future
> +
> +Unshare bytes. If permission byte is rd-locked, it means that some process
> +does not allow the others use corresponding options on that file.
> +
> +Byte    Operation
> +200     read
> +          Lock holder don't allow read operation to other processes.
> +201     write
> +          Lock holder don't allow write operation to other processes. This
> +          still allows others to do write-uncahnged operations. Better not
> +          exploit outside of Qemu.
> +202     write-unchanged
> +          Lock holder don't allow write-unchanged operation to other processes.

And here, correspondingly:

New software should never lock this byte and interpret this byte locked by other process like write permission unshared (same as 201).

> +203     resize
> +          Lock holder don't allow resizing the file by other processes.
> +204     graph-mod
> +          Undefined. QEMU may sometimes locks this byte, but external programs
> +          should not. QEMU will stop locking this byte in future
> +
> +Handling the permissions works as follows: assume we want to open the file to do
> +some operations and in the same time want to disallow some operation to other
> +processes. So, we want to lock some of the bytes described above. We operate as
> +follows:
> +
Vladimir Sementsov-Ogievskiy July 16, 2021, 4:21 p.m. UTC | #7
15.07.2021 23:00, Vladimir Sementsov-Ogievskiy wrote:
> 03.07.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>> +Permission bytes. If permission byte is rd-locked, it means that some process
>> +uses corresponding permission on that file.
>> +
>> +Byte    Operation
>> +100     read
>> +          Lock holder can read
>> +101     write
>> +          Lock holder can write
>> +102     write-unchanged
>> +          Lock holder can write same data if it sure, that this write doesn't
>> +          break concurrent readers. This is mostly used internally in Qemu
>> +          and it wouldn't be good idea to exploit it somehow.
> 
> Let's make it more strict:
> 
> New software should never lock this byte and interpret this byte locked by other process like write permission (same as 101).

[*]

> 
>> +103     resize
>> +          Lock holder can resize the file. "write" permission is also required
>> +          for resizing, so lock byte 103 only if you also lock byte 101.
>> +104     graph-mod
>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>> +          should not. QEMU will stop locking this byte in future
>> +
>> +Unshare bytes. If permission byte is rd-locked, it means that some process
>> +does not allow the others use corresponding options on that file.
>> +
>> +Byte    Operation
>> +200     read
>> +          Lock holder don't allow read operation to other processes.
>> +201     write
>> +          Lock holder don't allow write operation to other processes. This
>> +          still allows others to do write-uncahnged operations. Better not
>> +          exploit outside of Qemu.
>> +202     write-unchanged
>> +          Lock holder don't allow write-unchanged operation to other processes.
> 
> And here, correspondingly:
> 
> New software should never lock this byte and interpret this byte locked by other process like write permission unshared (same as 201).

Hmm, no. For [*] work correctly, new software should always lock 202 if it locks 201.


> 
>> +203     resize
>> +          Lock holder don't allow resizing the file by other processes.
>> +204     graph-mod
>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>> +          should not. QEMU will stop locking this byte in future
>> +
>> +Handling the permissions works as follows: assume we want to open the file to do
>> +some operations and in the same time want to disallow some operation to other
>> +processes. So, we want to lock some of the bytes described above. We operate as
>> +follows:
>> +
> 
> 

More on this.

Hmm, we can just drop graph-mod. I don't think it is needed now, and anyway no sense in locking it in file-posix.

write-unchanged is rather strange here. One process may read data and write it back and consider it WRITE_UNCHANGED, but other process may change the file intermediatly, and for it the WRITE_UNCHANGED of first process is not "unchanged"..

It probably good to drop write-unchanged from the protocol at all. But what if we have old qemu that can lock write-unchanged (byte 102) and not write (byte 101)? The audit do we really have such a case now or in some previous version is rather hard to do..
So, if we want to support it in the most compatible way, all new software should lock both 201 and 202 to unshare write.

Locking 102 in pair with 101 doesn't make real sense: if we has write access, we don't need additional write-unchanged anyway. Still, we may document that these bytes should be locked together just for consistency with "unshare" bytes.

Any thoughts?

Probably it worth also document that 203 is recommended to be locked if application want to lock 201, to avoid bad behaving process, that locks 103 and not 101 and thinks that it has permission to resize.
Vladimir Sementsov-Ogievskiy July 16, 2021, 6:47 p.m. UTC | #8
16.07.2021 19:21, Vladimir Sementsov-Ogievskiy wrote:
> 15.07.2021 23:00, Vladimir Sementsov-Ogievskiy wrote:
>> 03.07.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>>> +Permission bytes. If permission byte is rd-locked, it means that some process
>>> +uses corresponding permission on that file.
>>> +
>>> +Byte    Operation
>>> +100     read
>>> +          Lock holder can read
>>> +101     write
>>> +          Lock holder can write
>>> +102     write-unchanged
>>> +          Lock holder can write same data if it sure, that this write doesn't
>>> +          break concurrent readers. This is mostly used internally in Qemu
>>> +          and it wouldn't be good idea to exploit it somehow.
>>
>> Let's make it more strict:
>>
>> New software should never lock this byte and interpret this byte locked by other process like write permission (same as 101).
> 
> [*]
> 
>>
>>> +103     resize
>>> +          Lock holder can resize the file. "write" permission is also required
>>> +          for resizing, so lock byte 103 only if you also lock byte 101.
>>> +104     graph-mod
>>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>>> +          should not. QEMU will stop locking this byte in future
>>> +
>>> +Unshare bytes. If permission byte is rd-locked, it means that some process
>>> +does not allow the others use corresponding options on that file.
>>> +
>>> +Byte    Operation
>>> +200     read
>>> +          Lock holder don't allow read operation to other processes.
>>> +201     write
>>> +          Lock holder don't allow write operation to other processes. This
>>> +          still allows others to do write-uncahnged operations. Better not
>>> +          exploit outside of Qemu.
>>> +202     write-unchanged
>>> +          Lock holder don't allow write-unchanged operation to other processes.
>>
>> And here, correspondingly:
>>
>> New software should never lock this byte and interpret this byte locked by other process like write permission unshared (same as 201).
> 
> Hmm, no. For [*] work correctly, new software should always lock 202 if it locks 201.
> 
> 
>>
>>> +203     resize
>>> +          Lock holder don't allow resizing the file by other processes.
>>> +204     graph-mod
>>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>>> +          should not. QEMU will stop locking this byte in future
>>> +
>>> +Handling the permissions works as follows: assume we want to open the file to do
>>> +some operations and in the same time want to disallow some operation to other
>>> +processes. So, we want to lock some of the bytes described above. We operate as
>>> +follows:
>>> +
>>
>>
> 
> More on this.
> 
> Hmm, we can just drop graph-mod. I don't think it is needed now, and anyway no sense in locking it in file-posix.
> 
> write-unchanged is rather strange here. One process may read data and write it back and consider it WRITE_UNCHANGED, but other process may change the file intermediatly, and for it the WRITE_UNCHANGED of first process is not "unchanged"..
> 
> It probably good to drop write-unchanged from the protocol at all. But what if we have old qemu that can lock write-unchanged (byte 102) and not write (byte 101)? The audit do we really have such a case now or in some previous version is rather hard to do..
> So, if we want to support it in the most compatible way, all new software should lock both 201 and 202 to unshare write.
> 
> Locking 102 in pair with 101 doesn't make real sense: if we has write access, we don't need additional write-unchanged anyway. Still, we may document that these bytes should be locked together just for consistency with "unshare" bytes.
> 
> Any thoughts?
> 
> Probably it worth also document that 203 is recommended to be locked if application want to lock 201, to avoid bad behaving process, that locks 103 and not 101 and thinks that it has permission to resize.
> 


And one more question to consider.

What about read? Actually what's the sense of locking byte 200? What's wrong for us if some other process access the file in RO mode?

In Qemu this permissin is called CONSISTENT_READ.. Who knows what it means? How much consistent may be read by process A, when process B has write permission and do write at the moment?

Hmm, looking at BLK_PERM_CONSISTENT_READ description, I see that it mostly about intermediate nodes of commit block job. If we read from intermediate node of commite block job we may read old data (original view of the node) or new data (that was recently committed and actually unrelated to this intermediate node that we are going to remove).

But I doubt that it make sense for interprocess syncrhonisation: if want to read consistantly, we must unshare write on the whole backing chain.. And no real sense in sharing READ..


Finally, it seems to me that only the following scenarios make sense:

1. read-only access, don't care about existing writing Qemu process: just open RO, don't care about locks (like qemu-img --force-share)

2. RO or RW access, want consistency: unshare all operations, lock bytes 20*. Than no real sense in locking any of 10*.. But for consistency and safety better to lock them all.
Vladimir Sementsov-Ogievskiy July 16, 2021, 8:35 p.m. UTC | #9
16.07.2021 21:47, Vladimir Sementsov-Ogievskiy wrote:
> 16.07.2021 19:21, Vladimir Sementsov-Ogievskiy wrote:
>> 15.07.2021 23:00, Vladimir Sementsov-Ogievskiy wrote:
>>> 03.07.2021 16:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> +Permission bytes. If permission byte is rd-locked, it means that some process
>>>> +uses corresponding permission on that file.
>>>> +
>>>> +Byte    Operation
>>>> +100     read
>>>> +          Lock holder can read
>>>> +101     write
>>>> +          Lock holder can write
>>>> +102     write-unchanged
>>>> +          Lock holder can write same data if it sure, that this write doesn't
>>>> +          break concurrent readers. This is mostly used internally in Qemu
>>>> +          and it wouldn't be good idea to exploit it somehow.
>>>
>>> Let's make it more strict:
>>>
>>> New software should never lock this byte and interpret this byte locked by other process like write permission (same as 101).
>>
>> [*]
>>
>>>
>>>> +103     resize
>>>> +          Lock holder can resize the file. "write" permission is also required
>>>> +          for resizing, so lock byte 103 only if you also lock byte 101.
>>>> +104     graph-mod
>>>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>>>> +          should not. QEMU will stop locking this byte in future
>>>> +
>>>> +Unshare bytes. If permission byte is rd-locked, it means that some process
>>>> +does not allow the others use corresponding options on that file.
>>>> +
>>>> +Byte    Operation
>>>> +200     read
>>>> +          Lock holder don't allow read operation to other processes.
>>>> +201     write
>>>> +          Lock holder don't allow write operation to other processes. This
>>>> +          still allows others to do write-uncahnged operations. Better not
>>>> +          exploit outside of Qemu.
>>>> +202     write-unchanged
>>>> +          Lock holder don't allow write-unchanged operation to other processes.
>>>
>>> And here, correspondingly:
>>>
>>> New software should never lock this byte and interpret this byte locked by other process like write permission unshared (same as 201).
>>
>> Hmm, no. For [*] work correctly, new software should always lock 202 if it locks 201.
>>
>>
>>>
>>>> +203     resize
>>>> +          Lock holder don't allow resizing the file by other processes.
>>>> +204     graph-mod
>>>> +          Undefined. QEMU may sometimes locks this byte, but external programs
>>>> +          should not. QEMU will stop locking this byte in future
>>>> +
>>>> +Handling the permissions works as follows: assume we want to open the file to do
>>>> +some operations and in the same time want to disallow some operation to other
>>>> +processes. So, we want to lock some of the bytes described above. We operate as
>>>> +follows:
>>>> +
>>>
>>>
>>
>> More on this.
>>
>> Hmm, we can just drop graph-mod. I don't think it is needed now, and anyway no sense in locking it in file-posix.
>>
>> write-unchanged is rather strange here. One process may read data and write it back and consider it WRITE_UNCHANGED, but other process may change the file intermediatly, and for it the WRITE_UNCHANGED of first process is not "unchanged"..
>>
>> It probably good to drop write-unchanged from the protocol at all. But what if we have old qemu that can lock write-unchanged (byte 102) and not write (byte 101)? The audit do we really have such a case now or in some previous version is rather hard to do..
>> So, if we want to support it in the most compatible way, all new software should lock both 201 and 202 to unshare write.
>>
>> Locking 102 in pair with 101 doesn't make real sense: if we has write access, we don't need additional write-unchanged anyway. Still, we may document that these bytes should be locked together just for consistency with "unshare" bytes.
>>
>> Any thoughts?
>>
>> Probably it worth also document that 203 is recommended to be locked if application want to lock 201, to avoid bad behaving process, that locks 103 and not 101 and thinks that it has permission to resize.
>>
> 
> 
> And one more question to consider.
> 
> What about read? Actually what's the sense of locking byte 200? What's wrong for us if some other process access the file in RO mode?
> 
> In Qemu this permissin is called CONSISTENT_READ.. Who knows what it means? How much consistent may be read by process A, when process B has write permission and do write at the moment?
> 
> Hmm, looking at BLK_PERM_CONSISTENT_READ description, I see that it mostly about intermediate nodes of commit block job. If we read from intermediate node of commite block job we may read old data (original view of the node) or new data (that was recently committed and actually unrelated to this intermediate node that we are going to remove).
> 
> But I doubt that it make sense for interprocess syncrhonisation: if want to read consistantly, we must unshare write on the whole backing chain.. And no real sense in sharing READ..
> 
> 
> Finally, it seems to me that only the following scenarios make sense:
> 
> 1. read-only access, don't care about existing writing Qemu process: just open RO, don't care about locks (like qemu-img --force-share)
> 
> 2. RO or RW access, want consistency: unshare all operations, lock bytes 20*. Than no real sense in locking any of 10*.. But for consistency and safety better to lock them all.
> 
> 


And final interesting question: we do force-share write in block-job filters, to not break the job itself.. Interesting, don't we release file locks during block-jobs, so that another process can write to the image during the job? I'll check it later.
diff mbox series

Patch

diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index 16225710eb..74fb71600d 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -909,3 +909,189 @@  some additional tasks, hooking io requests.
   .. option:: prealloc-size
 
     How much to preallocate (in bytes), default 128M.
+
+Image locking protocol
+~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU holds rd locks and never rw locks. Instead, GETLK fcntl is used with F_WRLCK
+to handle permissions as described below.
+QEMU process may rd-lock the following bytes of the image with corresponding
+meaning:
+
+Permission bytes. If permission byte is rd-locked, it means that some process
+uses corresponding permission on that file.
+
+Byte    Operation
+100     read
+          Lock holder can read
+101     write
+          Lock holder can write
+102     write-unchanged
+          Lock holder can write same data if it sure, that this write doesn't
+          break concurrent readers. This is mostly used internally in Qemu
+          and it wouldn't be good idea to exploit it somehow.
+103     resize
+          Lock holder can resize the file. "write" permission is also required
+          for resizing, so lock byte 103 only if you also lock byte 101.
+104     graph-mod
+          Undefined. QEMU may sometimes locks this byte, but external programs
+          should not. QEMU will stop locking this byte in future
+
+Unshare bytes. If permission byte is rd-locked, it means that some process
+does not allow the others use corresponding options on that file.
+
+Byte    Operation
+200     read
+          Lock holder don't allow read operation to other processes.
+201     write
+          Lock holder don't allow write operation to other processes. This
+          still allows others to do write-uncahnged operations. Better not
+          exploit outside of Qemu.
+202     write-unchanged
+          Lock holder don't allow write-unchanged operation to other processes.
+203     resize
+          Lock holder don't allow resizing the file by other processes.
+204     graph-mod
+          Undefined. QEMU may sometimes locks this byte, but external programs
+          should not. QEMU will stop locking this byte in future
+
+Handling the permissions works as follows: assume we want to open the file to do
+some operations and in the same time want to disallow some operation to other
+processes. So, we want to lock some of the bytes described above. We operate as
+follows:
+
+1. rd-lock all needed bytes, both "permission" bytes and "unshare" bytes.
+
+2. For each "unshare" byte we rd-locked, do GETLK that "tries" to wr-lock
+corresponding "permission" byte. So, we check is there any other process that
+uses the permission we want to unshare. If it exists we fail.
+
+3. For each "permission" byte we rd-locked, do GETLK that "tries" to wr-lock
+corresponding "unshare" byte. So, we check is there any other process that
+unshares the permission we want to have. If it exists we fail.
+
+Important notice: Qemu may fallback to POSIX file locks only if OFD locks
+unavailable. Other programs should behave similarly: use POSIX file locks
+only if OFD locks unavailable and if you are OK with drawbacks of POSIX
+file locks (for example, they are lost on close() of any file descriptor
+for that file).
+
+Image locking examples
+~~~~~~~~~~~~~~~~~~~~~~
+
+Read-only, allow others to write
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+So, we want to read and don't care what other users do with the image. We only
+need to lock byte 100. Operation is as follows:
+
+1. rd-lock byte 100
+
+.. highlight:: c
+
+    struct flock fl = {
+        .l_whence = SEEK_SET,
+        .l_start  = 100,
+        .l_len    = 1,
+        .l_type   = F_RDLCK,
+    };
+    ret = fcntl(fd, F_OFD_SETLK, &fl);
+    if (ret == -1) {
+        /* Error */
+    }
+
+2. try wr-lock byte 200, to check that no one is against our read access
+
+.. highlight:: c
+
+    struct flock fl = {
+        .l_whence = SEEK_SET,
+        .l_start  = 200,
+        .l_len    = 1,
+        .l_type   = F_WRLCK,
+    };
+    ret = fcntl(fd, F_OFD_GETLK, &fl);
+    if (ret != -1 && fl.l_type == F_UNLCK) {
+        /*
+         * We are lucky, nobody against. So, now we have RO access
+         * that we want.
+         */
+    } else {
+        /* Error, or RO access is blocked by someone. We don't have access */
+    }
+
+3. Now we can operate read the data.
+
+4. When finished, release the lock:
+
+.. highlight:: c
+
+    struct flock fl = {
+        .l_whence = SEEK_SET,
+        .l_start  = 100,
+        .l_len    = 1,
+        .l_type   = F_UNLCK,
+    };
+    ret = fcntl(fd, F_OFD_SETLK, &fl);
+
+RW, allow others to read only
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We want to read and write, and don't want others to modify the image.
+So, let's lock bytes 100, 101, 201. Operation is as follows:
+
+1. rd-lock bytes 100 (read), 101 (write), 201 (don't allow others to write)
+
+.. highlight:: c
+
+    for byte in (100, 101, 201) {
+        struct flock fl = {
+            .l_whence = SEEK_SET,
+            .l_start  = byte,
+            .l_len    = 1,
+            .l_type   = F_RDLCK,
+        };
+        ret = fcntl(fd, F_OFD_SETLK, &fl);
+        if (ret == -1) {
+            /* Error */
+        }
+    }
+
+2. try wr-lock bytes 200 (to check that no one is against our read access),
+   201 (no one against our write access), 101 (there are no writers currently)
+
+.. highlight:: c
+
+    for byte in (200, 201, 101) {
+        struct flock fl = {
+            .l_whence = SEEK_SET,
+            .l_start  = byte,
+            .l_len    = 1,
+            .l_type   = F_WRLCK,
+        };
+        ret = fcntl(fd, F_OFD_GETLK, &fl);
+        if (ret != -1 && fl.l_type == F_UNLCK) {
+            /* We are lucky, nobody against. */
+        } else {
+            /*
+             * Error, or feature we want is blocked by someone.
+             * We don't have access.
+             */
+        }
+    }
+
+3. Now we can read and write.
+
+4. When finished, release locks:
+
+.. highlight:: c
+
+    for byte in (100, 101, 201) {
+        struct flock fl = {
+            .l_whence = SEEK_SET,
+            .l_start  = byte,
+            .l_len    = 1,
+            .l_type   = F_UNLCK,
+        };
+        fcntl(fd, F_OFD_SETLK, &fl);
+    }