diff mbox

[00/11,V1] rbd journaling feature

Message ID 1543841435-13652-1-git-send-email-dongsheng.yang@easystack.cn (mailing list archive)
State New, archived
Headers show

Commit Message

Dongsheng Yang Dec. 3, 2018, 12:50 p.m. UTC
Hi all,
   This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:

```
[root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
```

Changelog from RFC:
        1. error out if there is some unsupported event type in replaying
        2. just one memory copy from bio to msg.
        3. use async IO in journal appending.
        4. no mutex around IO.

Dongsheng Yang (11):
  libceph: support prefix and suffix in bio_iter
  libceph: support op append
  libceph: introduce cls_journaler_client
  libceph: introduce generic journaling
  libceph: journaling: introduce api to replay uncommitted journal
    events
  libceph: journaling: introduce api for journal appending
  libceph: journaling: trim object set when we found there is no client
    refer it
  rbd: wait image request all complete in lock releasing
  rbd: introduce completion for each img_request
  rbd: introduce a lock_flag in rbd_dev to make some process
    exclusive-lock protected
  rbd: enable journaling

 drivers/block/rbd.c                       |  635 +++++++++-
 include/linux/ceph/cls_journaler_client.h |   93 ++
 include/linux/ceph/journaler.h            |  176 +++
 include/linux/ceph/messenger.h            |    9 +
 net/ceph/Makefile                         |    3 +-
 net/ceph/cls_journaler_client.c           |  535 +++++++++
 net/ceph/journaler.c                      | 1843 +++++++++++++++++++++++++++++
 net/ceph/messenger.c                      |   96 +-
 net/ceph/osd_client.c                     |   14 +-
 9 files changed, 3360 insertions(+), 44 deletions(-)
 create mode 100644 include/linux/ceph/cls_journaler_client.h
 create mode 100644 include/linux/ceph/journaler.h
 create mode 100644 net/ceph/cls_journaler_client.c
 create mode 100644 net/ceph/journaler.c

Comments

Ilya Dryomov Dec. 4, 2018, 2:03 p.m. UTC | #1
On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> Hi all,
>    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
>
> ```
> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> index e019de5..9d00d3e 100755
> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> @@ -854,9 +854,9 @@ write_image()
>
>      test -n "${size}" || size=4096
>
> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> -       --io-pattern rand
> +    rbd --cluster ${cluster} -p ${pool} map ${image}
> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>  }
>
>  stress_write_image()
> ```
>
> Changelog from RFC:
>         1. error out if there is some unsupported event type in replaying

So the journal is still replayed in the kernel.  Was there a design
discussion about this?

Like I said in one of my replies to the RFC, I think we should avoid
replaying the journal in the kernel and try to come up with a design
where it's done by librbd.

The fundamental problem with replaying the journal in the kernel and
therefore supporting only a couple of event types is that the journal
has to be replayed not only at "rbd map" time, but also every time the
exclusive lock is reacquired.  Whereas we can safely error out at "rbd
map" time, I don't see a sane way to handle a case where an unsupported
event type is encountered after reacquiring the lock when the image is
already mapped.

Thanks,

                Ilya
Jason Dillaman Dec. 4, 2018, 2:17 p.m. UTC | #2
On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
> >
> > Hi all,
> >    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
> >
> > ```
> > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > index e019de5..9d00d3e 100755
> > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > @@ -854,9 +854,9 @@ write_image()
> >
> >      test -n "${size}" || size=4096
> >
> > -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> > -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> > -       --io-pattern rand
> > +    rbd --cluster ${cluster} -p ${pool} map ${image}
> > +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> > +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> >  }
> >
> >  stress_write_image()
> > ```
> >
> > Changelog from RFC:
> >         1. error out if there is some unsupported event type in replaying
>
> So the journal is still replayed in the kernel.  Was there a design
> discussion about this?
>
> Like I said in one of my replies to the RFC, I think we should avoid
> replaying the journal in the kernel and try to come up with a design
> where it's done by librbd.

+1 to this. If "rbd [device] map" first replays the journal (by just
acquiring the exclusive lock via the API), then krbd would only need
to check that there are no events to replay. If there are one or more
events that it needs to replay for some reason, it implies that it
lost the exclusive lock to another client that changed the image *and*
failed to commit the entries to the journal. I think it seems
reasonable to just move the volume to R/O in that case since something
odd was occurring.

> The fundamental problem with replaying the journal in the kernel and
> therefore supporting only a couple of event types is that the journal
> has to be replayed not only at "rbd map" time, but also every time the
> exclusive lock is reacquired.  Whereas we can safely error out at "rbd
> map" time, I don't see a sane way to handle a case where an unsupported
> event type is encountered after reacquiring the lock when the image is
> already mapped.
>
> Thanks,
>
>                 Ilya
Ilya Dryomov Dec. 4, 2018, 3:15 p.m. UTC | #3
On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@redhat.com> wrote:
>
> On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
> > <dongsheng.yang@easystack.cn> wrote:
> > >
> > > Hi all,
> > >    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> > > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
> > >
> > > ```
> > > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > index e019de5..9d00d3e 100755
> > > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > @@ -854,9 +854,9 @@ write_image()
> > >
> > >      test -n "${size}" || size=4096
> > >
> > > -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> > > -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> > > -       --io-pattern rand
> > > +    rbd --cluster ${cluster} -p ${pool} map ${image}
> > > +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> > > +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> > >  }
> > >
> > >  stress_write_image()
> > > ```
> > >
> > > Changelog from RFC:
> > >         1. error out if there is some unsupported event type in replaying
> >
> > So the journal is still replayed in the kernel.  Was there a design
> > discussion about this?
> >
> > Like I said in one of my replies to the RFC, I think we should avoid
> > replaying the journal in the kernel and try to come up with a design
> > where it's done by librbd.
>
> +1 to this. If "rbd [device] map" first replays the journal (by just
> acquiring the exclusive lock via the API), then krbd would only need
> to check that there are no events to replay. If there are one or more
> events that it needs to replay for some reason, it implies that it
> lost the exclusive lock to another client that changed the image *and*
> failed to commit the entries to the journal. I think it seems
> reasonable to just move the volume to R/O in that case since something
> odd was occurring.

"rbd map" is easy -- we can fail it with a nice error message.  The
real issue is replay on reacquire.  Quoting myself:

  The fundamental problem with replaying the journal in the kernel and
  therefore supporting only a couple of event types is that the journal
  has to be replayed not only at "rbd map" time, but also every time the
  exclusive lock is reacquired.  Whereas we can safely error out at "rbd
  map" time, I don't see a sane way to handle a case where an unsupported
  event type is encountered after reacquiring the lock when the image is
  already mapped.

Consider the following scenario: the kernel gives up the lock for
creating a snapshot, librbd writes SNAP_CREATE event to the journal and
crashes.  Its watch times out, the kernel reacquires the lock but there
is an unsupported entry in the journal.  What should we do?

Marking the device read-only doesn't feel right.  I wouldn't want my
storage to freeze just because a maintenance pod went down.  I think we
need to look into making it so that the kernel can ask librbd to replay
the journal on its behalf.  This will solve the general case, take care
of "rbd map" and also avoid introducing a hard dependency on the tool
for mapping images.

Thanks,

                Ilya
Jason Dillaman Dec. 4, 2018, 4:01 p.m. UTC | #4
On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@redhat.com> wrote:
> >
> > On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > >
> > > On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
> > > <dongsheng.yang@easystack.cn> wrote:
> > > >
> > > > Hi all,
> > > >    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> > > > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
> > > >
> > > > ```
> > > > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > index e019de5..9d00d3e 100755
> > > > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > @@ -854,9 +854,9 @@ write_image()
> > > >
> > > >      test -n "${size}" || size=4096
> > > >
> > > > -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> > > > -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> > > > -       --io-pattern rand
> > > > +    rbd --cluster ${cluster} -p ${pool} map ${image}
> > > > +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> > > > +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> > > >  }
> > > >
> > > >  stress_write_image()
> > > > ```
> > > >
> > > > Changelog from RFC:
> > > >         1. error out if there is some unsupported event type in replaying
> > >
> > > So the journal is still replayed in the kernel.  Was there a design
> > > discussion about this?
> > >
> > > Like I said in one of my replies to the RFC, I think we should avoid
> > > replaying the journal in the kernel and try to come up with a design
> > > where it's done by librbd.
> >
> > +1 to this. If "rbd [device] map" first replays the journal (by just
> > acquiring the exclusive lock via the API), then krbd would only need
> > to check that there are no events to replay. If there are one or more
> > events that it needs to replay for some reason, it implies that it
> > lost the exclusive lock to another client that changed the image *and*
> > failed to commit the entries to the journal. I think it seems
> > reasonable to just move the volume to R/O in that case since something
> > odd was occurring.
>
> "rbd map" is easy -- we can fail it with a nice error message.  The
> real issue is replay on reacquire.  Quoting myself:
>
>   The fundamental problem with replaying the journal in the kernel and
>   therefore supporting only a couple of event types is that the journal
>   has to be replayed not only at "rbd map" time, but also every time the
>   exclusive lock is reacquired.  Whereas we can safely error out at "rbd
>   map" time, I don't see a sane way to handle a case where an unsupported
>   event type is encountered after reacquiring the lock when the image is
>   already mapped.
>
> Consider the following scenario: the kernel gives up the lock for
> creating a snapshot, librbd writes SNAP_CREATE event to the journal and
> crashes.  Its watch times out, the kernel reacquires the lock but there
> is an unsupported entry in the journal.  What should we do?
>
> Marking the device read-only doesn't feel right.  I wouldn't want my
> storage to freeze just because a maintenance pod went down.  I think we
> need to look into making it so that the kernel can ask librbd to replay
> the journal on its behalf.  This will solve the general case, take care
> of "rbd map" and also avoid introducing a hard dependency on the tool
> for mapping images.

What would be the best mechanism for doing such things? Have a daemon
waiting in user-space listening to a netlink notification from krbd,
or udev notification and user-space trigger, or ...?

> Thanks,
>
>                 Ilya

--
Jason
Ilya Dryomov Dec. 4, 2018, 5:16 p.m. UTC | #5
On Tue, Dec 4, 2018 at 5:01 PM Jason Dillaman <jdillama@redhat.com> wrote:
>
> On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@redhat.com> wrote:
> > >
> > > On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
> > > > <dongsheng.yang@easystack.cn> wrote:
> > > > >
> > > > > Hi all,
> > > > >    This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> > > > > It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
> > > > >
> > > > > ```
> > > > > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > > index e019de5..9d00d3e 100755
> > > > > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> > > > > @@ -854,9 +854,9 @@ write_image()
> > > > >
> > > > >      test -n "${size}" || size=4096
> > > > >
> > > > > -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> > > > > -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> > > > > -       --io-pattern rand
> > > > > +    rbd --cluster ${cluster} -p ${pool} map ${image}
> > > > > +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
> > > > > +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> > > > >  }
> > > > >
> > > > >  stress_write_image()
> > > > > ```
> > > > >
> > > > > Changelog from RFC:
> > > > >         1. error out if there is some unsupported event type in replaying
> > > >
> > > > So the journal is still replayed in the kernel.  Was there a design
> > > > discussion about this?
> > > >
> > > > Like I said in one of my replies to the RFC, I think we should avoid
> > > > replaying the journal in the kernel and try to come up with a design
> > > > where it's done by librbd.
> > >
> > > +1 to this. If "rbd [device] map" first replays the journal (by just
> > > acquiring the exclusive lock via the API), then krbd would only need
> > > to check that there are no events to replay. If there are one or more
> > > events that it needs to replay for some reason, it implies that it
> > > lost the exclusive lock to another client that changed the image *and*
> > > failed to commit the entries to the journal. I think it seems
> > > reasonable to just move the volume to R/O in that case since something
> > > odd was occurring.
> >
> > "rbd map" is easy -- we can fail it with a nice error message.  The
> > real issue is replay on reacquire.  Quoting myself:
> >
> >   The fundamental problem with replaying the journal in the kernel and
> >   therefore supporting only a couple of event types is that the journal
> >   has to be replayed not only at "rbd map" time, but also every time the
> >   exclusive lock is reacquired.  Whereas we can safely error out at "rbd
> >   map" time, I don't see a sane way to handle a case where an unsupported
> >   event type is encountered after reacquiring the lock when the image is
> >   already mapped.
> >
> > Consider the following scenario: the kernel gives up the lock for
> > creating a snapshot, librbd writes SNAP_CREATE event to the journal and
> > crashes.  Its watch times out, the kernel reacquires the lock but there
> > is an unsupported entry in the journal.  What should we do?
> >
> > Marking the device read-only doesn't feel right.  I wouldn't want my
> > storage to freeze just because a maintenance pod went down.  I think we
> > need to look into making it so that the kernel can ask librbd to replay
> > the journal on its behalf.  This will solve the general case, take care
> > of "rbd map" and also avoid introducing a hard dependency on the tool
> > for mapping images.
>
> What would be the best mechanism for doing such things? Have a daemon
> waiting in user-space listening to a netlink notification from krbd,
> or udev notification and user-space trigger, or ...?

As for the communication mechanism, the existing watch-notify on either
the header object or the journal object seems like the natural choice.

As for the execution context, I'm not sure...  If the only reason to
take the performance hit of journaling is to have mirroring and thus
a running rbd-mirror daemon for sending out journal entries, could we
perhaps piggy back on it?

Thanks,

                Ilya
Dongsheng Yang Dec. 5, 2018, 2:21 a.m. UTC | #6
On 12/05/2018 01:16 AM, Ilya Dryomov wrote:
> On Tue, Dec 4, 2018 at 5:01 PM Jason Dillaman <jdillama@redhat.com> wrote:
>> On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@redhat.com> wrote:
>>>> On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>> On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
>>>>> <dongsheng.yang@easystack.cn> wrote:
>>>>>> Hi all,
>>>>>>     This is V1 to implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
>>>>>> It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change as below:
>>>>>>
>>>>>> ```
>>>>>> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>> index e019de5..9d00d3e 100755
>>>>>> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>> @@ -854,9 +854,9 @@ write_image()
>>>>>>
>>>>>>       test -n "${size}" || size=4096
>>>>>>
>>>>>> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
>>>>>> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
>>>>>> -       --io-pattern rand
>>>>>> +    rbd --cluster ${cluster} -p ${pool} map ${image}
>>>>>> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
>>>>>> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>>>>>>   }
>>>>>>
>>>>>>   stress_write_image()
>>>>>> ```
>>>>>>
>>>>>> Changelog from RFC:
>>>>>>          1. error out if there is some unsupported event type in replaying
>>>>> So the journal is still replayed in the kernel.  Was there a design
>>>>> discussion about this?
>>>>>
>>>>> Like I said in one of my replies to the RFC, I think we should avoid
>>>>> replaying the journal in the kernel and try to come up with a design
>>>>> where it's done by librbd.
>>>> +1 to this. If "rbd [device] map" first replays the journal (by just
>>>> acquiring the exclusive lock via the API), then krbd would only need
>>>> to check that there are no events to replay. If there are one or more
>>>> events that it needs to replay for some reason, it implies that it
>>>> lost the exclusive lock to another client that changed the image *and*
>>>> failed to commit the entries to the journal. I think it seems
>>>> reasonable to just move the volume to R/O in that case since something
>>>> odd was occurring.
>>> "rbd map" is easy -- we can fail it with a nice error message.  The
>>> real issue is replay on reacquire.  Quoting myself:
>>>
>>>    The fundamental problem with replaying the journal in the kernel and
>>>    therefore supporting only a couple of event types is that the journal
>>>    has to be replayed not only at "rbd map" time, but also every time the
>>>    exclusive lock is reacquired.  Whereas we can safely error out at "rbd
>>>    map" time, I don't see a sane way to handle a case where an unsupported
>>>    event type is encountered after reacquiring the lock when the image is
>>>    already mapped.
>>>
>>> Consider the following scenario: the kernel gives up the lock for
>>> creating a snapshot, librbd writes SNAP_CREATE event to the journal and
>>> crashes.  Its watch times out, the kernel reacquires the lock but there
>>> is an unsupported entry in the journal.  What should we do?
>>>
>>> Marking the device read-only doesn't feel right.  I wouldn't want my
>>> storage to freeze just because a maintenance pod went down.  I think we
>>> need to look into making it so that the kernel can ask librbd to replay
>>> the journal on its behalf.  This will solve the general case, take care
>>> of "rbd map" and also avoid introducing a hard dependency on the tool
>>> for mapping images.
>> What would be the best mechanism for doing such things? Have a daemon
>> waiting in user-space listening to a netlink notification from krbd,
>> or udev notification and user-space trigger, or ...?
> As for the communication mechanism, the existing watch-notify on either
> the header object or the journal object seems like the natural choice.
>
> As for the execution context, I'm not sure...  If the only reason to
> take the performance hit of journaling is to have mirroring and thus
> a running rbd-mirror daemon for sending out journal entries, could we
> perhaps piggy back on it?

Thanx Ilya and Jason,
       Good idea to me to use librbd for replaying, but I am not sure is
that good to use rbd-mirrror daemon. Or should we introduce a new
daemon in generic way to handle image requests from kernel or others in
watch-notify way.
       In my use case, we always have a rbd-mirror daemon if we have
to use journaling in krbd, so I think rbd-mirror sounds not bad to me.
What's your opinion, Jason? Do you think it's proper to do this work
in rbd-mirror daemon from the point of view on the design of rbd mirroring?

Thanx a lot.
>
> Thanks,
>
>                  Ilya
>
Dongsheng Yang Dec. 5, 2018, 2:46 a.m. UTC | #7
Hi Ilya and Jason,
     Maybe there is another option, umh (user mod helper): 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
We can provide a subcommand in userspace for journal replaying, and we 
can check the journal in rbd map and reaquire exclusive-lock, if we
find there is uncommitted entry, we can call userspace helper by umh to 
replay it.

Thanx


On 12/05/2018 10:21 AM, Dongsheng Yang wrote:
>
>
> On 12/05/2018 01:16 AM, Ilya Dryomov wrote:
>> On Tue, Dec 4, 2018 at 5:01 PM Jason Dillaman <jdillama@redhat.com> 
>> wrote:
>>> On Tue, Dec 4, 2018 at 10:16 AM Ilya Dryomov <idryomov@gmail.com> 
>>> wrote:
>>>> On Tue, Dec 4, 2018 at 3:17 PM Jason Dillaman <jdillama@redhat.com> 
>>>> wrote:
>>>>> On Tue, Dec 4, 2018 at 9:03 AM Ilya Dryomov <idryomov@gmail.com> 
>>>>> wrote:
>>>>>> On Mon, Dec 3, 2018 at 1:50 PM Dongsheng Yang
>>>>>> <dongsheng.yang@easystack.cn> wrote:
>>>>>>> Hi all,
>>>>>>>     This is V1 to implement the journaling feature in kernel 
>>>>>>> rbd, which makes mirroring in kubernetes possible.
>>>>>>> It passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a 
>>>>>>> little change as below:
>>>>>>>
>>>>>>> ```
>>>>>>> [root@atest-guest build]# git diff 
>>>>>>> /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>>> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh 
>>>>>>> b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>>> index e019de5..9d00d3e 100755
>>>>>>> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>>> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>>>>> @@ -854,9 +854,9 @@ write_image()
>>>>>>>
>>>>>>>       test -n "${size}" || size=4096
>>>>>>>
>>>>>>> -    rbd --cluster ${cluster} -p ${pool} bench ${image} 
>>>>>>> --io-type write \
>>>>>>> -       --io-size ${size} --io-threads 1 --io-total $((size * 
>>>>>>> count)) \
>>>>>>> -       --io-pattern rand
>>>>>>> +    rbd --cluster ${cluster} -p ${pool} map ${image}
>>>>>>> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 
>>>>>>> --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 
>>>>>>> --direct=1 --group_reporting --size $((size * count)) 
>>>>>>> --group_reporting --eta-newline 1
>>>>>>> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>>>>>>>   }
>>>>>>>
>>>>>>>   stress_write_image()
>>>>>>> ```
>>>>>>>
>>>>>>> Changelog from RFC:
>>>>>>>          1. error out if there is some unsupported event type in 
>>>>>>> replaying
>>>>>> So the journal is still replayed in the kernel.  Was there a design
>>>>>> discussion about this?
>>>>>>
>>>>>> Like I said in one of my replies to the RFC, I think we should avoid
>>>>>> replaying the journal in the kernel and try to come up with a design
>>>>>> where it's done by librbd.
>>>>> +1 to this. If "rbd [device] map" first replays the journal (by just
>>>>> acquiring the exclusive lock via the API), then krbd would only need
>>>>> to check that there are no events to replay. If there are one or more
>>>>> events that it needs to replay for some reason, it implies that it
>>>>> lost the exclusive lock to another client that changed the image 
>>>>> *and*
>>>>> failed to commit the entries to the journal. I think it seems
>>>>> reasonable to just move the volume to R/O in that case since 
>>>>> something
>>>>> odd was occurring.
>>>> "rbd map" is easy -- we can fail it with a nice error message.  The
>>>> real issue is replay on reacquire.  Quoting myself:
>>>>
>>>>    The fundamental problem with replaying the journal in the kernel 
>>>> and
>>>>    therefore supporting only a couple of event types is that the 
>>>> journal
>>>>    has to be replayed not only at "rbd map" time, but also every 
>>>> time the
>>>>    exclusive lock is reacquired.  Whereas we can safely error out 
>>>> at "rbd
>>>>    map" time, I don't see a sane way to handle a case where an 
>>>> unsupported
>>>>    event type is encountered after reacquiring the lock when the 
>>>> image is
>>>>    already mapped.
>>>>
>>>> Consider the following scenario: the kernel gives up the lock for
>>>> creating a snapshot, librbd writes SNAP_CREATE event to the journal 
>>>> and
>>>> crashes.  Its watch times out, the kernel reacquires the lock but 
>>>> there
>>>> is an unsupported entry in the journal.  What should we do?
>>>>
>>>> Marking the device read-only doesn't feel right.  I wouldn't want my
>>>> storage to freeze just because a maintenance pod went down. I think we
>>>> need to look into making it so that the kernel can ask librbd to 
>>>> replay
>>>> the journal on its behalf.  This will solve the general case, take 
>>>> care
>>>> of "rbd map" and also avoid introducing a hard dependency on the tool
>>>> for mapping images.
>>> What would be the best mechanism for doing such things? Have a daemon
>>> waiting in user-space listening to a netlink notification from krbd,
>>> or udev notification and user-space trigger, or ...?
>> As for the communication mechanism, the existing watch-notify on either
>> the header object or the journal object seems like the natural choice.
>>
>> As for the execution context, I'm not sure...  If the only reason to
>> take the performance hit of journaling is to have mirroring and thus
>> a running rbd-mirror daemon for sending out journal entries, could we
>> perhaps piggy back on it?
>
> Thanx Ilya and Jason,
>       Good idea to me to use librbd for replaying, but I am not sure is
> that good to use rbd-mirrror daemon. Or should we introduce a new
> daemon in generic way to handle image requests from kernel or others in
> watch-notify way.
>       In my use case, we always have a rbd-mirror daemon if we have
> to use journaling in krbd, so I think rbd-mirror sounds not bad to me.
> What's your opinion, Jason? Do you think it's proper to do this work
> in rbd-mirror daemon from the point of view on the design of rbd 
> mirroring?
>
> Thanx a lot.
>>
>> Thanks,
>>
>>                  Ilya
>>
>
>
>
Ilya Dryomov Dec. 5, 2018, 10:17 a.m. UTC | #8
On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> Hi Ilya and Jason,
>      Maybe there is another option, umh (user mod helper):
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
> We can provide a subcommand in userspace for journal replaying, and we
> can check the journal in rbd map and reaquire exclusive-lock, if we
> find there is uncommitted entry, we can call userspace helper by umh to
> replay it.

Yes, making an upcall from the kernel might be an option, but the
problem is that this can happen deep in the I/O path.  I'm not sure
it's safe wrt memory allocation deadlocks because the helper is ran
out of a regular workqueue, etc.

Another option might be to daemonize "rbd map" process.

Or maybe attempting a minimal replay in the kernel and going read-only
in case something is wrong is actually fine as a starting point...

Thanks,

                Ilya
Jason Dillaman Dec. 5, 2018, 4:16 p.m. UTC | #9
On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
> >
> > Hi Ilya and Jason,
> >      Maybe there is another option, umh (user mod helper):
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
> > We can provide a subcommand in userspace for journal replaying, and we
> > can check the journal in rbd map and reaquire exclusive-lock, if we
> > find there is uncommitted entry, we can call userspace helper by umh to
> > replay it.
>
> Yes, making an upcall from the kernel might be an option, but the
> problem is that this can happen deep in the I/O path.  I'm not sure
> it's safe wrt memory allocation deadlocks because the helper is ran
> out of a regular workqueue, etc.
>
> Another option might be to daemonize "rbd map" process.
>
> Or maybe attempting a minimal replay in the kernel and going read-only
> in case something is wrong is actually fine as a starting point...

I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
tool). The local "rbd-mirror" daemon process doesn't currently open a
watch on local primary images and in the case of one-way mirroring,
you would potentially not even have an "rbd-mirror" daemon running
locally.

> Thanks,
>
>                 Ilya
Dongsheng Yang Dec. 6, 2018, 5:33 a.m. UTC | #10
On 12/06/2018 12:16 AM, Jason Dillaman wrote:
> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
>> <dongsheng.yang@easystack.cn> wrote:
>>> Hi Ilya and Jason,
>>>       Maybe there is another option, umh (user mod helper):
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
>>> We can provide a subcommand in userspace for journal replaying, and we
>>> can check the journal in rbd map and reaquire exclusive-lock, if we
>>> find there is uncommitted entry, we can call userspace helper by umh to
>>> replay it.
>> Yes, making an upcall from the kernel might be an option, but the
>> problem is that this can happen deep in the I/O path.  I'm not sure
>> it's safe wrt memory allocation deadlocks because the helper is ran
>> out of a regular workqueue, etc.
>>
>> Another option might be to daemonize "rbd map" process.
>>
>> Or maybe attempting a minimal replay in the kernel and going read-only
>> in case something is wrong is actually fine as a starting point...
> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
> tool).

+1 for this.
> The local "rbd-mirror" daemon process doesn't currently open a
> watch on local primary images and in the case of one-way mirroring,
> you would potentially not even have an "rbd-mirror" daemon running
> locally.
>
>> Thanks,
>>
>>                  Ilya
>
>
Dongsheng Yang Dec. 6, 2018, 5:44 a.m. UTC | #11
On 12/06/2018 01:33 PM, Dongsheng Yang wrote:
>
>
> On 12/06/2018 12:16 AM, Jason Dillaman wrote:
>> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
>>> <dongsheng.yang@easystack.cn> wrote:
>>>> Hi Ilya and Jason,
>>>>       Maybe there is another option, umh (user mod helper):
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h 
>>>>
>>>> We can provide a subcommand in userspace for journal replaying, and we
>>>> can check the journal in rbd map and reaquire exclusive-lock, if we
>>>> find there is uncommitted entry, we can call userspace helper by 
>>>> umh to
>>>> replay it.
>>> Yes, making an upcall from the kernel might be an option, but the
>>> problem is that this can happen deep in the I/O path.  I'm not sure
>>> it's safe wrt memory allocation deadlocks because the helper is ran
>>> out of a regular workqueue, etc.
>>>
>>> Another option might be to daemonize "rbd map" process.
>>>
>>> Or maybe attempting a minimal replay in the kernel and going read-only
>>> in case something is wrong is actually fine as a starting point...
>> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
>> tool).
>
> +1 for this.
Can we introduce a new service to watch all images mapped on a node?
If we start a new process for each rbd map, that's a little expensive. we
can start a single service to watch all of them, register in rbd map and
unregister in rbd unmap.
>> The local "rbd-mirror" daemon process doesn't currently open a
>> watch on local primary images and in the case of one-way mirroring,
>> you would potentially not even have an "rbd-mirror" daemon running
>> locally.
>>
>>> Thanks,
>>>
>>>                  Ilya
>>
>>
>
>
>
Jason Dillaman Dec. 6, 2018, 2:35 p.m. UTC | #12
On Thu, Dec 6, 2018 at 12:44 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
>
>
> On 12/06/2018 01:33 PM, Dongsheng Yang wrote:
> >
> >
> > On 12/06/2018 12:16 AM, Jason Dillaman wrote:
> >> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
> >>> <dongsheng.yang@easystack.cn> wrote:
> >>>> Hi Ilya and Jason,
> >>>>       Maybe there is another option, umh (user mod helper):
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
> >>>>
> >>>> We can provide a subcommand in userspace for journal replaying, and we
> >>>> can check the journal in rbd map and reaquire exclusive-lock, if we
> >>>> find there is uncommitted entry, we can call userspace helper by
> >>>> umh to
> >>>> replay it.
> >>> Yes, making an upcall from the kernel might be an option, but the
> >>> problem is that this can happen deep in the I/O path.  I'm not sure
> >>> it's safe wrt memory allocation deadlocks because the helper is ran
> >>> out of a regular workqueue, etc.
> >>>
> >>> Another option might be to daemonize "rbd map" process.
> >>>
> >>> Or maybe attempting a minimal replay in the kernel and going read-only
> >>> in case something is wrong is actually fine as a starting point...
> >> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
> >> tool).
> >
> > +1 for this.
> Can we introduce a new service to watch all images mapped on a node?
> If we start a new process for each rbd map, that's a little expensive. we
> can start a single service to watch all of them, register in rbd map and
> unregister in rbd unmap.

What's your concern re: the expense of one daemon per image? If you
have a single daemon per-node, what's your communication channel to
alert the daemon of map/unmap events?

> >> The local "rbd-mirror" daemon process doesn't currently open a
> >> watch on local primary images and in the case of one-way mirroring,
> >> you would potentially not even have an "rbd-mirror" daemon running
> >> locally.
> >>
> >>> Thanks,
> >>>
> >>>                  Ilya
> >>
> >>
> >
> >
> >
>
>
Dongsheng Yang Dec. 7, 2018, 1:56 a.m. UTC | #13
On 12/06/2018 10:35 PM, Jason Dillaman wrote:
> On Thu, Dec 6, 2018 at 12:44 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>>
>>
>> On 12/06/2018 01:33 PM, Dongsheng Yang wrote:
>>>
>>> On 12/06/2018 12:16 AM, Jason Dillaman wrote:
>>>> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
>>>>> <dongsheng.yang@easystack.cn> wrote:
>>>>>> Hi Ilya and Jason,
>>>>>>        Maybe there is another option, umh (user mod helper):
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
>>>>>>
>>>>>> We can provide a subcommand in userspace for journal replaying, and we
>>>>>> can check the journal in rbd map and reaquire exclusive-lock, if we
>>>>>> find there is uncommitted entry, we can call userspace helper by
>>>>>> umh to
>>>>>> replay it.
>>>>> Yes, making an upcall from the kernel might be an option, but the
>>>>> problem is that this can happen deep in the I/O path.  I'm not sure
>>>>> it's safe wrt memory allocation deadlocks because the helper is ran
>>>>> out of a regular workqueue, etc.
>>>>>
>>>>> Another option might be to daemonize "rbd map" process.
>>>>>
>>>>> Or maybe attempting a minimal replay in the kernel and going read-only
>>>>> in case something is wrong is actually fine as a starting point...
>>>> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
>>>> tool).
>>> +1 for this.
>> Can we introduce a new service to watch all images mapped on a node?
>> If we start a new process for each rbd map, that's a little expensive. we
>> can start a single service to watch all of them, register in rbd map and
>> unregister in rbd unmap.
> What's your concern re: the expense of one daemon per image? If you

As we probably have lots of images mapped on a node, when we
are using k8s with krbd. Then each mapping would start a new process
to watch image. But actually the work it need to do is very simple
and rare, replaying entries. So I think that's a little waste.
> have a single daemon per-node, what's your communication channel to
> alert the daemon of map/unmap events?

Maybe sharedmemory, we can define a unique shm_key for this
sharememroy, (what I did in other project is to find a num from PI,
such as "983367" is the 501st ~ 506th number in Pi)

Then the deamon can create the sharememory in starting and destroy
it in stopping.

rbd map command can find the sharememory with specified shm_key.
If found, register it; otherwise, fail to map. we can also share the lock
in sharedmemory and solve the problem of process killed without
unlock by PTHREAD_MUTEX_ROBUST.

That's just an option, we can make it as an future improvement if we
really find that's worth to do that. :)

Thanx
>
>>>> The local "rbd-mirror" daemon process doesn't currently open a
>>>> watch on local primary images and in the case of one-way mirroring,
>>>> you would potentially not even have an "rbd-mirror" daemon running
>>>> locally.
>>>>
>>>>> Thanks,
>>>>>
>>>>>                   Ilya
>>>>
>>>
>>>
>>
>
Dongsheng Yang Jan. 7, 2019, 2:50 a.m. UTC | #14
On 12/06/2018 12:16 AM, Jason Dillaman wrote:
> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
>> <dongsheng.yang@easystack.cn> wrote:
>>> Hi Ilya and Jason,
>>>       Maybe there is another option, umh (user mod helper):
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
>>> We can provide a subcommand in userspace for journal replaying, and we
>>> can check the journal in rbd map and reaquire exclusive-lock, if we
>>> find there is uncommitted entry, we can call userspace helper by umh to
>>> replay it.
>> Yes, making an upcall from the kernel might be an option, but the
>> problem is that this can happen deep in the I/O path.  I'm not sure
>> it's safe wrt memory allocation deadlocks because the helper is ran
>> out of a regular workqueue, etc.
>>
>> Another option might be to daemonize "rbd map" process.
>>
>> Or maybe attempting a minimal replay in the kernel and going read-only
>> in case something is wrong is actually fine as a starting point...
> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
> tool). The local "rbd-mirror" daemon process doesn't currently open a
> watch on local primary images and in the case of one-way mirroring,
> you would potentially not even have an "rbd-mirror" daemon running
> locally.

Hi Ilya and Jason,
     When I want to go to implement in this way, I found a case I don't
have a good solution, how to notify the deamon to exit in rbd unmap?

In my design, there are two new messages will be introduced:
   NOTIFY_OP_JOURNAL_REPLAY_REQUEST      = 17,
   NOTIFY_OP_JOURNAL_REPLAY_COMPLETE     = 18,

and both of the two messages have a pair of (tag_tid, replay_id) to specify
a unique journal_replay_request. That can make sure we are receiving the
replay_complete what we are waiting after replay_request.

But about the rbd unmap, if we use the watch-notify on this image, all
of the rbd map daemons watching on this image will get this notification.

Do you have any suggestion about this case?


Thanx

>
>> Thanks,
>>
>>                  Ilya
>
>
Ilya Dryomov Jan. 7, 2019, 12:24 p.m. UTC | #15
On Mon, Jan 7, 2019 at 3:51 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
>
>
> On 12/06/2018 12:16 AM, Jason Dillaman wrote:
> > On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
> >> <dongsheng.yang@easystack.cn> wrote:
> >>> Hi Ilya and Jason,
> >>>       Maybe there is another option, umh (user mod helper):
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
> >>> We can provide a subcommand in userspace for journal replaying, and we
> >>> can check the journal in rbd map and reaquire exclusive-lock, if we
> >>> find there is uncommitted entry, we can call userspace helper by umh to
> >>> replay it.
> >> Yes, making an upcall from the kernel might be an option, but the
> >> problem is that this can happen deep in the I/O path.  I'm not sure
> >> it's safe wrt memory allocation deadlocks because the helper is ran
> >> out of a regular workqueue, etc.
> >>
> >> Another option might be to daemonize "rbd map" process.
> >>
> >> Or maybe attempting a minimal replay in the kernel and going read-only
> >> in case something is wrong is actually fine as a starting point...
> > I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
> > tool). The local "rbd-mirror" daemon process doesn't currently open a
> > watch on local primary images and in the case of one-way mirroring,
> > you would potentially not even have an "rbd-mirror" daemon running
> > locally.
>
> Hi Ilya and Jason,
>      When I want to go to implement in this way, I found a case I don't
> have a good solution, how to notify the deamon to exit in rbd unmap?
>
> In my design, there are two new messages will be introduced:
>    NOTIFY_OP_JOURNAL_REPLAY_REQUEST      = 17,
>    NOTIFY_OP_JOURNAL_REPLAY_COMPLETE     = 18,
>
> and both of the two messages have a pair of (tag_tid, replay_id) to specify
> a unique journal_replay_request. That can make sure we are receiving the
> replay_complete what we are waiting after replay_request.
>
> But about the rbd unmap, if we use the watch-notify on this image, all
> of the rbd map daemons watching on this image will get this notification.
>
> Do you have any suggestion about this case?

Hi Dongsheng,

(I'm not answering your question, but rather just writing up my
thoughts on this.)

I've thought about this and I'm not sure I like the "do everything in
the daemon" approach.  krbd is self-contained and I consider this one
of its strengths -- even mapping and unmapping images doesn't strictly
require any installed packages or dependencies.

What relying on the daemon (whether it's a daemonized "rbd map" process
or something else) for replay really means is an external dependency in
the I/O path.  A daemon can be killed at any time, OOMed, etc.  This is
a non-starter for a block device...

It seems like attempting a minimal replay in the kernel is the best we
can do in the general case.  If it fails, we return an error (map case)
or go read-only (general case).  And of course we can do a full replay
at "rbd map" time if the CLI tool is used for mapping the image.

A daemon that we can ask to do a reply for us might be an option in the
future, but only as a supplement and not as a hard dependency.

Thanks,

                Ilya
Dongsheng Yang Jan. 8, 2019, 1:28 a.m. UTC | #16
On 01/07/2019 08:24 PM, Ilya Dryomov wrote:
> On Mon, Jan 7, 2019 at 3:51 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>>
>>
>> On 12/06/2018 12:16 AM, Jason Dillaman wrote:
>>> On Wed, Dec 5, 2018 at 5:17 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>>> On Wed, Dec 5, 2018 at 3:46 AM Dongsheng Yang
>>>> <dongsheng.yang@easystack.cn> wrote:
>>>>> Hi Ilya and Jason,
>>>>>        Maybe there is another option, umh (user mod helper):
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/umh.h
>>>>> We can provide a subcommand in userspace for journal replaying, and we
>>>>> can check the journal in rbd map and reaquire exclusive-lock, if we
>>>>> find there is uncommitted entry, we can call userspace helper by umh to
>>>>> replay it.
>>>> Yes, making an upcall from the kernel might be an option, but the
>>>> problem is that this can happen deep in the I/O path.  I'm not sure
>>>> it's safe wrt memory allocation deadlocks because the helper is ran
>>>> out of a regular workqueue, etc.
>>>>
>>>> Another option might be to daemonize "rbd map" process.
>>>>
>>>> Or maybe attempting a minimal replay in the kernel and going read-only
>>>> in case something is wrong is actually fine as a starting point...
>>> I'd vote for daemonizing "rbd map" (or a similar small, purpose-built
>>> tool). The local "rbd-mirror" daemon process doesn't currently open a
>>> watch on local primary images and in the case of one-way mirroring,
>>> you would potentially not even have an "rbd-mirror" daemon running
>>> locally.
>> Hi Ilya and Jason,
>>       When I want to go to implement in this way, I found a case I don't
>> have a good solution, how to notify the deamon to exit in rbd unmap?
>>
>> In my design, there are two new messages will be introduced:
>>     NOTIFY_OP_JOURNAL_REPLAY_REQUEST      = 17,
>>     NOTIFY_OP_JOURNAL_REPLAY_COMPLETE     = 18,
>>
>> and both of the two messages have a pair of (tag_tid, replay_id) to specify
>> a unique journal_replay_request. That can make sure we are receiving the
>> replay_complete what we are waiting after replay_request.
>>
>> But about the rbd unmap, if we use the watch-notify on this image, all
>> of the rbd map daemons watching on this image will get this notification.
>>
>> Do you have any suggestion about this case?
> Hi Dongsheng,
>
> (I'm not answering your question, but rather just writing up my
> thoughts on this.)
>
> I've thought about this and I'm not sure I like the "do everything in
> the daemon" approach.  krbd is self-contained and I consider this one
> of its strengths -- even mapping and unmapping images doesn't strictly
> require any installed packages or dependencies.
>
> What relying on the daemon (whether it's a daemonized "rbd map" process
> or something else) for replay really means is an external dependency in
> the I/O path.  A daemon can be killed at any time, OOMed, etc.  This is
> a non-starter for a block device...
>
> It seems like attempting a minimal replay in the kernel is the best we
> can do in the general case.  If it fails, we return an error (map case)
> or go read-only (general case).  And of course we can do a full replay
> at "rbd map" time if the CLI tool is used for mapping the image.

Hi Ilya,
     Thanx for your suggestion, This is the longest but the best solution
in my opinion. I am glad we can take this way, starting from a minimal
replay. And I am glad to implement the full replay step by step in future.

Thanx
>
> A daemon that we can ask to do a reply for us might be an option in the
> future, but only as a supplement and not as a hard dependency.
>
> Thanks,
>
>                  Ilya
>
diff mbox

Patch

diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
index e019de5..9d00d3e 100755
--- a/qa/workunits/rbd/rbd_mirror_helpers.sh
+++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
@@ -854,9 +854,9 @@  write_image()

     test -n "${size}" || size=4096

-    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
-       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
-       --io-pattern rand
+    rbd --cluster ${cluster} -p ${pool} map ${image}
+    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 1
+    rbd --cluster ${cluster} -p ${pool} unmap ${image}
 }

 stress_write_image()