[RFC,0/4] rbd journaling feature
diff mbox

Message ID 1534399172-27610-1-git-send-email-dongsheng.yang@easystack.cn
State New
Headers show

Commit Message

Dongsheng Yang Aug. 16, 2018, 5:59 a.m. UTC
Hi all,
   This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.

This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
as below:

```
[root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
```

That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
of them are about performance improvement.

So I think it's a good timing to ask for comments from all of you guys.

If you want to play with it, there is a simple script mirroring xfs below:

```
./mstart.sh remote -k  -l --bluestore
./mstart.sh local -k  -l --bluestore

rados -c ./run/local/ceph.conf rmpool rbd rbd --yes-i-really-really-mean-it
rados -c ./run/remote/ceph.conf rmpool rbd rbd --yes-i-really-really-mean-it

rados -c ./run/local/ceph.conf mkpool rbd
rados -c ./run/remote/ceph.conf mkpool rbd

rbd -c ./run/local/ceph.conf mirror pool enable rbd image
rbd -c ./run/remote/ceph.conf mirror pool enable rbd image

rbd -c ./run/local/ceph.conf mirror pool peer add rbd client.admin@remote
rbd -c ./run/remote/ceph.conf mirror pool peer add rbd client.admin@local

rbd -c ./run/remote/ceph.conf mirror pool info rbd
rbd -c ./run/local/ceph.conf mirror pool info rbd

rbd -c ./run/local/ceph.conf create test --image-feature layering --image-feature exclusive-lock --image-feature journaling -s 100M
rbd -c ./run/local/ceph.conf mirror image enable test

rbd -c ./run/remote/ceph.conf ls

rbd -c ./run/local/ceph.conf map test

mkfs.xfs /dev/rbd0

mount /dev/rbd0 /mnt/rbd0

dd if=/dev/urandom of=/mnt/rbd0/data bs=128K count=1
md5sum /mnt/rbd0/data
sync
data_1=`md5sum /mnt/rbd0/data|awk '{print $1}'`

umount /mnt/rbd0

rbd unmap /dev/rbd0

until rbd -c ./run/remote/ceph.conf ls |grep test; do
	sleep 1
done

rbd -c ./run/local/ceph.conf mirror image demote test
sleep 3

rbd -c ./run/remote/ceph.conf mirror image promote test


rbd -c ./run/remote/ceph.conf map test

mount /dev/rbd0 /mnt/rbd0
md5sum /mnt/rbd0/data
data_2=`md5sum /mnt/rbd0/data|awk '{print $1}'`

echo data_1: $data_1
echo data_2: $data_2

if (( "$data_1" != "$data_2" )); then
	echo "failed"
else
	echo "pass"
fi

umount /mnt/rbd0
rbd unmap /dev/rbd0

exit
```

Any comment is welcome!

Dongsheng Yang (4):
  libceph: support op append
  libceph: introduce cls_journaler_client
  libceph: introduce generic journaling
  rbd: enable journaling

 drivers/block/rbd.c                       |  478 +++++++++++-
 include/linux/ceph/cls_journaler_client.h |   87 +++
 include/linux/ceph/journaler.h            |  131 ++++
 net/ceph/Makefile                         |    3 +-
 net/ceph/cls_journaler_client.c           |  501 ++++++++++++
 net/ceph/journaler.c                      | 1208 +++++++++++++++++++++++++++++
 net/ceph/osd_client.c                     |   13 +-
 7 files changed, 2409 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/ceph/cls_journaler_client.h
 create mode 100644 include/linux/ceph/journaler.h
 create mode 100644 net/ceph/cls_journaler_client.c
 create mode 100644 net/ceph/journaler.c

Comments

Ilya Dryomov Aug. 16, 2018, 1:45 p.m. UTC | #1
On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> Hi all,
>    This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
>
> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
> as below:
>
> ```
> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> index e019de5..9d00d3e 100755
> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> @@ -854,9 +854,9 @@ write_image()
>
>      test -n "${size}" || size=4096
>
> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> -       --io-pattern rand
> +    rbd --cluster ${cluster} -p ${pool} map ${image}
> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>  }
>
>  stress_write_image()
> ```
>
> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
> of them are about performance improvement.
>
> So I think it's a good timing to ask for comments from all of you guys.

Hi Dongsheng,

First of all, thanks for taking this on!

I'll skip over the minor technical issues and style nits and focus on
the three high-level things that immediately jumped at me:

1. Have you thought about the consequences of supporting just WRITE and
   DISCARD journal entries in rbd_journal_replay()?  The fact that the
   kernel client would only ever emit those two types of entries doesn't
   prevent someone from attempting to map an image with a bunch of
   SNAP_CREATEs or other unsupported entries in the journal.

   I think we should avoid replaying the journal in the kernel and try
   to come up with a design where that is done by librbd in userspace.

2. You mention some performance-related TODOs, but I don't see anything
   about extra data copies.  rbd_journal_append_write_event() copies
   the data payload four (!) times before handing it to the messenger:

   - in rbd_journal_append_write_event(), from @bio to @data
   - in rbd_journal_append_write_event(), from @data to @buf
   - in ceph_journaler_object_append(), from @buf to another @buf
     (this time vmalloc'ed?)
   - in ceph_journaler_obj_write_sync(), from @buf to @pages

   Since the data is required to be durable in the journal before any
   writes to the image objects are issued, not even reference counting
   is necessary, let alone copying.  The messenger should be getting it
   straight from the bio.

3. There is a mutex held across network calls in journaler.c.  This is
   the case with some of the existing code as well (exclusive lock and
   refresh) and it is a huge pain.  Cancelling is almost impossible and
   as I know you know this is what is blocking "rbd unmap -o full-force"
   and rbd-level timeouts.  No new such code will be accepted upstream.

Thanks,

                Ilya
Dongsheng Yang Aug. 17, 2018, 7:59 a.m. UTC | #2
Hi Ilya,

On 08/16/2018 09:45 PM, Ilya Dryomov wrote:
> On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>> Hi all,
>>     This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
>>
>> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
>> as below:
>>
>> ```
>> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
>> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
>> index e019de5..9d00d3e 100755
>> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
>> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
>> @@ -854,9 +854,9 @@ write_image()
>>
>>       test -n "${size}" || size=4096
>>
>> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
>> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
>> -       --io-pattern rand
>> +    rbd --cluster ${cluster} -p ${pool} map ${image}
>> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
>> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>>   }
>>
>>   stress_write_image()
>> ```
>>
>> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
>> of them are about performance improvement.
>>
>> So I think it's a good timing to ask for comments from all of you guys.
> Hi Dongsheng,
>
> First of all, thanks for taking this on!
>
> I'll skip over the minor technical issues and style nits and focus on
> the three high-level things that immediately jumped at me:

Great, that's what I want in the RFC round reviewing :)
>
> 1. Have you thought about the consequences of supporting just WRITE and
>     DISCARD journal entries in rbd_journal_replay()?  The fact that the
>     kernel client would only ever emit those two types of entries doesn't
>     prevent someone from attempting to map an image with a bunch of
>     SNAP_CREATEs or other unsupported entries in the journal.
>
>     I think we should avoid replaying the journal in the kernel and try
>     to come up with a design where that is done by librbd in userspace.

In my opinion,  we should prevent user to map an image with unsupported 
events in
his journal. Maybe return error in start_replay() and fail the map 
operation. But the start_replay()
is void now, I think we need an int return-value for it.

And at the same times, we can output a message to suggest user to do 
something before mapping,
such as rbd journal reset.
>
> 2. You mention some performance-related TODOs, but I don't see anything
>     about extra data copies.  rbd_journal_append_write_event() copies
>     the data payload four (!) times before handing it to the messenger:
>
>     - in rbd_journal_append_write_event(), from @bio to @data
>     - in rbd_journal_append_write_event(), from @data to @buf
>     - in ceph_journaler_object_append(), from @buf to another @buf
>       (this time vmalloc'ed?)
>     - in ceph_journaler_obj_write_sync(), from @buf to @pages

Yes, there are times of copies, and I really need put this thing in my 
TODO list. thanx for you suggestion.
>
>     Since the data is required to be durable in the journal before any
>     writes to the image objects are issued, not even reference counting
>     is necessary, let alone copying.  The messenger should be getting it
>     straight from the bio.

Let me think about how to avoid these copies.
>
> 3. There is a mutex held across network calls in journaler.c.  This is
>     the case with some of the existing code as well (exclusive lock and
>     refresh) and it is a huge pain.  Cancelling is almost impossible and
>     as I know you know this is what is blocking "rbd unmap -o full-force"
>     and rbd-level timeouts.  No new such code will be accepted upstream.

Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a 
time-lines for rbd-level timeout feature?

Thanx Ilya, your comments are helpful (as always)
Dongsheng
>
> Thanks,
>
>                  Ilya
>
Ilya Dryomov Aug. 17, 2018, 9:30 a.m. UTC | #3
On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> Hi Ilya,
>
> On 08/16/2018 09:45 PM, Ilya Dryomov wrote:
> > On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
> > <dongsheng.yang@easystack.cn> wrote:
> >> Hi all,
> >>     This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
> >>
> >> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
> >> as below:
> >>
> >> ```
> >> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> index e019de5..9d00d3e 100755
> >> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
> >> @@ -854,9 +854,9 @@ write_image()
> >>
> >>       test -n "${size}" || size=4096
> >>
> >> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
> >> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
> >> -       --io-pattern rand
> >> +    rbd --cluster ${cluster} -p ${pool} map ${image}
> >> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
> >> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
> >>   }
> >>
> >>   stress_write_image()
> >> ```
> >>
> >> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
> >> of them are about performance improvement.
> >>
> >> So I think it's a good timing to ask for comments from all of you guys.
> > Hi Dongsheng,
> >
> > First of all, thanks for taking this on!
> >
> > I'll skip over the minor technical issues and style nits and focus on
> > the three high-level things that immediately jumped at me:
>
> Great, that's what I want in the RFC round reviewing :)
> >
> > 1. Have you thought about the consequences of supporting just WRITE and
> >     DISCARD journal entries in rbd_journal_replay()?  The fact that the
> >     kernel client would only ever emit those two types of entries doesn't
> >     prevent someone from attempting to map an image with a bunch of
> >     SNAP_CREATEs or other unsupported entries in the journal.
> >
> >     I think we should avoid replaying the journal in the kernel and try
> >     to come up with a design where that is done by librbd in userspace.
>
> In my opinion,  we should prevent user to map an image with unsupported
> events in
> his journal. Maybe return error in start_replay() and fail the map
> operation. But the start_replay()
> is void now, I think we need an int return-value for it.
>
> And at the same times, we can output a message to suggest user to do
> something before mapping,
> such as rbd journal reset.

"rbd journal reset" just discards all journal entries, doesn't it?
Suggesting something like that would be pretty bad...

> >
> > 2. You mention some performance-related TODOs, but I don't see anything
> >     about extra data copies.  rbd_journal_append_write_event() copies
> >     the data payload four (!) times before handing it to the messenger:
> >
> >     - in rbd_journal_append_write_event(), from @bio to @data
> >     - in rbd_journal_append_write_event(), from @data to @buf
> >     - in ceph_journaler_object_append(), from @buf to another @buf
> >       (this time vmalloc'ed?)
> >     - in ceph_journaler_obj_write_sync(), from @buf to @pages
>
> Yes, there are times of copies, and I really need put this thing in my
> TODO list. thanx for you suggestion.
> >
> >     Since the data is required to be durable in the journal before any
> >     writes to the image objects are issued, not even reference counting
> >     is necessary, let alone copying.  The messenger should be getting it
> >     straight from the bio.
>
> Let me think about how to avoid these copies.
> >
> > 3. There is a mutex held across network calls in journaler.c.  This is
> >     the case with some of the existing code as well (exclusive lock and
> >     refresh) and it is a huge pain.  Cancelling is almost impossible and
> >     as I know you know this is what is blocking "rbd unmap -o full-force"
> >     and rbd-level timeouts.  No new such code will be accepted upstream.
>
> Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a
> time-lines for rbd-level timeout feature?

I made some headway on refactoring exclusive lock code, but it's
nowhere close to completion yet.

Thanks,

                Ilya
Dongsheng Yang Aug. 20, 2018, 1:31 a.m. UTC | #4
On 08/17/2018 05:30 PM, Ilya Dryomov wrote:
> On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>> Hi Ilya,
>>
>> On 08/16/2018 09:45 PM, Ilya Dryomov wrote:
>>> On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang
>>> <dongsheng.yang@easystack.cn> wrote:
>>>> Hi all,
>>>>      This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible.
>>>>
>>>> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change
>>>> as below:
>>>>
>>>> ```
>>>> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>> index e019de5..9d00d3e 100755
>>>> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
>>>> @@ -854,9 +854,9 @@ write_image()
>>>>
>>>>        test -n "${size}" || size=4096
>>>>
>>>> -    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
>>>> -       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
>>>> -       --io-pattern rand
>>>> +    rbd --cluster ${cluster} -p ${pool} map ${image}
>>>> +    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline
>>>> +    rbd --cluster ${cluster} -p ${pool} unmap ${image}
>>>>    }
>>>>
>>>>    stress_write_image()
>>>> ```
>>>>
>>>> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most
>>>> of them are about performance improvement.
>>>>
>>>> So I think it's a good timing to ask for comments from all of you guys.
>>> Hi Dongsheng,
>>>
>>> First of all, thanks for taking this on!
>>>
>>> I'll skip over the minor technical issues and style nits and focus on
>>> the three high-level things that immediately jumped at me:
>> Great, that's what I want in the RFC round reviewing :)
>>> 1. Have you thought about the consequences of supporting just WRITE and
>>>      DISCARD journal entries in rbd_journal_replay()?  The fact that the
>>>      kernel client would only ever emit those two types of entries doesn't
>>>      prevent someone from attempting to map an image with a bunch of
>>>      SNAP_CREATEs or other unsupported entries in the journal.
>>>
>>>      I think we should avoid replaying the journal in the kernel and try
>>>      to come up with a design where that is done by librbd in userspace.
>> In my opinion,  we should prevent user to map an image with unsupported
>> events in
>> his journal. Maybe return error in start_replay() and fail the map
>> operation. But the start_replay()
>> is void now, I think we need an int return-value for it.
>>
>> And at the same times, we can output a message to suggest user to do
>> something before mapping,
>> such as rbd journal reset.
> "rbd journal reset" just discards all journal entries, doesn't it?
> Suggesting something like that would be pretty bad...

Actually, what we need to worry about is the uncommitted entries, in rbd 
journal reset command,
it will call start_replay() in librbd when open journal, so the all 
uncommitted entries will be replayed.

But I agree with not suggest something, but only error out with the 
problem of "there are some unsupported events uncommitted."
>
>>> 2. You mention some performance-related TODOs, but I don't see anything
>>>      about extra data copies.  rbd_journal_append_write_event() copies
>>>      the data payload four (!) times before handing it to the messenger:
>>>
>>>      - in rbd_journal_append_write_event(), from @bio to @data
>>>      - in rbd_journal_append_write_event(), from @data to @buf
>>>      - in ceph_journaler_object_append(), from @buf to another @buf
>>>        (this time vmalloc'ed?)
>>>      - in ceph_journaler_obj_write_sync(), from @buf to @pages
>> Yes, there are times of copies, and I really need put this thing in my
>> TODO list. thanx for you suggestion.
>>>      Since the data is required to be durable in the journal before any
>>>      writes to the image objects are issued, not even reference counting
>>>      is necessary, let alone copying.  The messenger should be getting it
>>>      straight from the bio.
>> Let me think about how to avoid these copies.
>>> 3. There is a mutex held across network calls in journaler.c.  This is
>>>      the case with some of the existing code as well (exclusive lock and
>>>      refresh) and it is a huge pain.  Cancelling is almost impossible and
>>>      as I know you know this is what is blocking "rbd unmap -o full-force"
>>>      and rbd-level timeouts.  No new such code will be accepted upstream.
>> Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a
>> time-lines for rbd-level timeout feature?
> I made some headway on refactoring exclusive lock code, but it's
> nowhere close to completion yet.

okey, thanx for your information.


>
> Thanks,
>
>                  Ilya
>

Patch
diff mbox

diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh
index e019de5..9d00d3e 100755
--- a/qa/workunits/rbd/rbd_mirror_helpers.sh
+++ b/qa/workunits/rbd/rbd_mirror_helpers.sh
@@ -854,9 +854,9 @@  write_image()
 
     test -n "${size}" || size=4096
 
-    rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \
-       --io-size ${size} --io-threads 1 --io-total $((size * count)) \
-       --io-pattern rand
+    rbd --cluster ${cluster} -p ${pool} map ${image}
+    fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline 
+    rbd --cluster ${cluster} -p ${pool} unmap ${image}
 }
 
 stress_write_image()