Message ID | 1534399172-27610-1-git-send-email-dongsheng.yang@easystack.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > Hi all, > This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. > > This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change > as below: > > ``` > [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh > diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh > index e019de5..9d00d3e 100755 > --- a/qa/workunits/rbd/rbd_mirror_helpers.sh > +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh > @@ -854,9 +854,9 @@ write_image() > > test -n "${size}" || size=4096 > > - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ > - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ > - --io-pattern rand > + rbd --cluster ${cluster} -p ${pool} map ${image} > + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline > + rbd --cluster ${cluster} -p ${pool} unmap ${image} > } > > stress_write_image() > ``` > > That means this patchset is working well in mirroring data. There are some TODOs in comments, but most > of them are about performance improvement. > > So I think it's a good timing to ask for comments from all of you guys. Hi Dongsheng, First of all, thanks for taking this on! I'll skip over the minor technical issues and style nits and focus on the three high-level things that immediately jumped at me: 1. Have you thought about the consequences of supporting just WRITE and DISCARD journal entries in rbd_journal_replay()? The fact that the kernel client would only ever emit those two types of entries doesn't prevent someone from attempting to map an image with a bunch of SNAP_CREATEs or other unsupported entries in the journal. I think we should avoid replaying the journal in the kernel and try to come up with a design where that is done by librbd in userspace. 2. You mention some performance-related TODOs, but I don't see anything about extra data copies. rbd_journal_append_write_event() copies the data payload four (!) times before handing it to the messenger: - in rbd_journal_append_write_event(), from @bio to @data - in rbd_journal_append_write_event(), from @data to @buf - in ceph_journaler_object_append(), from @buf to another @buf (this time vmalloc'ed?) - in ceph_journaler_obj_write_sync(), from @buf to @pages Since the data is required to be durable in the journal before any writes to the image objects are issued, not even reference counting is necessary, let alone copying. The messenger should be getting it straight from the bio. 3. There is a mutex held across network calls in journaler.c. This is the case with some of the existing code as well (exclusive lock and refresh) and it is a huge pain. Cancelling is almost impossible and as I know you know this is what is blocking "rbd unmap -o full-force" and rbd-level timeouts. No new such code will be accepted upstream. Thanks, Ilya
Hi Ilya, On 08/16/2018 09:45 PM, Ilya Dryomov wrote: > On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> Hi all, >> This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. >> >> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change >> as below: >> >> ``` >> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh >> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh >> index e019de5..9d00d3e 100755 >> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh >> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh >> @@ -854,9 +854,9 @@ write_image() >> >> test -n "${size}" || size=4096 >> >> - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ >> - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ >> - --io-pattern rand >> + rbd --cluster ${cluster} -p ${pool} map ${image} >> + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline >> + rbd --cluster ${cluster} -p ${pool} unmap ${image} >> } >> >> stress_write_image() >> ``` >> >> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most >> of them are about performance improvement. >> >> So I think it's a good timing to ask for comments from all of you guys. > Hi Dongsheng, > > First of all, thanks for taking this on! > > I'll skip over the minor technical issues and style nits and focus on > the three high-level things that immediately jumped at me: Great, that's what I want in the RFC round reviewing :) > > 1. Have you thought about the consequences of supporting just WRITE and > DISCARD journal entries in rbd_journal_replay()? The fact that the > kernel client would only ever emit those two types of entries doesn't > prevent someone from attempting to map an image with a bunch of > SNAP_CREATEs or other unsupported entries in the journal. > > I think we should avoid replaying the journal in the kernel and try > to come up with a design where that is done by librbd in userspace. In my opinion, we should prevent user to map an image with unsupported events in his journal. Maybe return error in start_replay() and fail the map operation. But the start_replay() is void now, I think we need an int return-value for it. And at the same times, we can output a message to suggest user to do something before mapping, such as rbd journal reset. > > 2. You mention some performance-related TODOs, but I don't see anything > about extra data copies. rbd_journal_append_write_event() copies > the data payload four (!) times before handing it to the messenger: > > - in rbd_journal_append_write_event(), from @bio to @data > - in rbd_journal_append_write_event(), from @data to @buf > - in ceph_journaler_object_append(), from @buf to another @buf > (this time vmalloc'ed?) > - in ceph_journaler_obj_write_sync(), from @buf to @pages Yes, there are times of copies, and I really need put this thing in my TODO list. thanx for you suggestion. > > Since the data is required to be durable in the journal before any > writes to the image objects are issued, not even reference counting > is necessary, let alone copying. The messenger should be getting it > straight from the bio. Let me think about how to avoid these copies. > > 3. There is a mutex held across network calls in journaler.c. This is > the case with some of the existing code as well (exclusive lock and > refresh) and it is a huge pain. Cancelling is almost impossible and > as I know you know this is what is blocking "rbd unmap -o full-force" > and rbd-level timeouts. No new such code will be accepted upstream. Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a time-lines for rbd-level timeout feature? Thanx Ilya, your comments are helpful (as always) Dongsheng > > Thanks, > > Ilya >
On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > Hi Ilya, > > On 08/16/2018 09:45 PM, Ilya Dryomov wrote: > > On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang > > <dongsheng.yang@easystack.cn> wrote: > >> Hi all, > >> This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. > >> > >> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change > >> as below: > >> > >> ``` > >> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh > >> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh > >> index e019de5..9d00d3e 100755 > >> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh > >> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh > >> @@ -854,9 +854,9 @@ write_image() > >> > >> test -n "${size}" || size=4096 > >> > >> - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ > >> - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ > >> - --io-pattern rand > >> + rbd --cluster ${cluster} -p ${pool} map ${image} > >> + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline > >> + rbd --cluster ${cluster} -p ${pool} unmap ${image} > >> } > >> > >> stress_write_image() > >> ``` > >> > >> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most > >> of them are about performance improvement. > >> > >> So I think it's a good timing to ask for comments from all of you guys. > > Hi Dongsheng, > > > > First of all, thanks for taking this on! > > > > I'll skip over the minor technical issues and style nits and focus on > > the three high-level things that immediately jumped at me: > > Great, that's what I want in the RFC round reviewing :) > > > > 1. Have you thought about the consequences of supporting just WRITE and > > DISCARD journal entries in rbd_journal_replay()? The fact that the > > kernel client would only ever emit those two types of entries doesn't > > prevent someone from attempting to map an image with a bunch of > > SNAP_CREATEs or other unsupported entries in the journal. > > > > I think we should avoid replaying the journal in the kernel and try > > to come up with a design where that is done by librbd in userspace. > > In my opinion, we should prevent user to map an image with unsupported > events in > his journal. Maybe return error in start_replay() and fail the map > operation. But the start_replay() > is void now, I think we need an int return-value for it. > > And at the same times, we can output a message to suggest user to do > something before mapping, > such as rbd journal reset. "rbd journal reset" just discards all journal entries, doesn't it? Suggesting something like that would be pretty bad... > > > > 2. You mention some performance-related TODOs, but I don't see anything > > about extra data copies. rbd_journal_append_write_event() copies > > the data payload four (!) times before handing it to the messenger: > > > > - in rbd_journal_append_write_event(), from @bio to @data > > - in rbd_journal_append_write_event(), from @data to @buf > > - in ceph_journaler_object_append(), from @buf to another @buf > > (this time vmalloc'ed?) > > - in ceph_journaler_obj_write_sync(), from @buf to @pages > > Yes, there are times of copies, and I really need put this thing in my > TODO list. thanx for you suggestion. > > > > Since the data is required to be durable in the journal before any > > writes to the image objects are issued, not even reference counting > > is necessary, let alone copying. The messenger should be getting it > > straight from the bio. > > Let me think about how to avoid these copies. > > > > 3. There is a mutex held across network calls in journaler.c. This is > > the case with some of the existing code as well (exclusive lock and > > refresh) and it is a huge pain. Cancelling is almost impossible and > > as I know you know this is what is blocking "rbd unmap -o full-force" > > and rbd-level timeouts. No new such code will be accepted upstream. > > Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a > time-lines for rbd-level timeout feature? I made some headway on refactoring exclusive lock code, but it's nowhere close to completion yet. Thanks, Ilya
On 08/17/2018 05:30 PM, Ilya Dryomov wrote: > On Fri, Aug 17, 2018 at 10:00 AM Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> Hi Ilya, >> >> On 08/16/2018 09:45 PM, Ilya Dryomov wrote: >>> On Thu, Aug 16, 2018 at 8:00 AM Dongsheng Yang >>> <dongsheng.yang@easystack.cn> wrote: >>>> Hi all, >>>> This patchset implement the journaling feature in kernel rbd, which makes mirroring in kubernetes possible. >>>> >>>> This is an RFC patchset, and it passed the /ceph/ceph/qa/workunits/rbd/rbd_mirror.sh, with a little change >>>> as below: >>>> >>>> ``` >>>> [root@atest-guest build]# git diff /ceph/ceph/qa/workunits/rbd/rbd_mirror_helpers.sh >>>> diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh >>>> index e019de5..9d00d3e 100755 >>>> --- a/qa/workunits/rbd/rbd_mirror_helpers.sh >>>> +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh >>>> @@ -854,9 +854,9 @@ write_image() >>>> >>>> test -n "${size}" || size=4096 >>>> >>>> - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ >>>> - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ >>>> - --io-pattern rand >>>> + rbd --cluster ${cluster} -p ${pool} map ${image} >>>> + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline >>>> + rbd --cluster ${cluster} -p ${pool} unmap ${image} >>>> } >>>> >>>> stress_write_image() >>>> ``` >>>> >>>> That means this patchset is working well in mirroring data. There are some TODOs in comments, but most >>>> of them are about performance improvement. >>>> >>>> So I think it's a good timing to ask for comments from all of you guys. >>> Hi Dongsheng, >>> >>> First of all, thanks for taking this on! >>> >>> I'll skip over the minor technical issues and style nits and focus on >>> the three high-level things that immediately jumped at me: >> Great, that's what I want in the RFC round reviewing :) >>> 1. Have you thought about the consequences of supporting just WRITE and >>> DISCARD journal entries in rbd_journal_replay()? The fact that the >>> kernel client would only ever emit those two types of entries doesn't >>> prevent someone from attempting to map an image with a bunch of >>> SNAP_CREATEs or other unsupported entries in the journal. >>> >>> I think we should avoid replaying the journal in the kernel and try >>> to come up with a design where that is done by librbd in userspace. >> In my opinion, we should prevent user to map an image with unsupported >> events in >> his journal. Maybe return error in start_replay() and fail the map >> operation. But the start_replay() >> is void now, I think we need an int return-value for it. >> >> And at the same times, we can output a message to suggest user to do >> something before mapping, >> such as rbd journal reset. > "rbd journal reset" just discards all journal entries, doesn't it? > Suggesting something like that would be pretty bad... Actually, what we need to worry about is the uncommitted entries, in rbd journal reset command, it will call start_replay() in librbd when open journal, so the all uncommitted entries will be replayed. But I agree with not suggest something, but only error out with the problem of "there are some unsupported events uncommitted." > >>> 2. You mention some performance-related TODOs, but I don't see anything >>> about extra data copies. rbd_journal_append_write_event() copies >>> the data payload four (!) times before handing it to the messenger: >>> >>> - in rbd_journal_append_write_event(), from @bio to @data >>> - in rbd_journal_append_write_event(), from @data to @buf >>> - in ceph_journaler_object_append(), from @buf to another @buf >>> (this time vmalloc'ed?) >>> - in ceph_journaler_obj_write_sync(), from @buf to @pages >> Yes, there are times of copies, and I really need put this thing in my >> TODO list. thanx for you suggestion. >>> Since the data is required to be durable in the journal before any >>> writes to the image objects are issued, not even reference counting >>> is necessary, let alone copying. The messenger should be getting it >>> straight from the bio. >> Let me think about how to avoid these copies. >>> 3. There is a mutex held across network calls in journaler.c. This is >>> the case with some of the existing code as well (exclusive lock and >>> refresh) and it is a huge pain. Cancelling is almost impossible and >>> as I know you know this is what is blocking "rbd unmap -o full-force" >>> and rbd-level timeouts. No new such code will be accepted upstream. >> Ah, yes, that's what I was concerning. BTW, Ilya, do you already have a >> time-lines for rbd-level timeout feature? > I made some headway on refactoring exclusive lock code, but it's > nowhere close to completion yet. okey, thanx for your information. > > Thanks, > > Ilya >
diff --git a/qa/workunits/rbd/rbd_mirror_helpers.sh b/qa/workunits/rbd/rbd_mirror_helpers.sh index e019de5..9d00d3e 100755 --- a/qa/workunits/rbd/rbd_mirror_helpers.sh +++ b/qa/workunits/rbd/rbd_mirror_helpers.sh @@ -854,9 +854,9 @@ write_image() test -n "${size}" || size=4096 - rbd --cluster ${cluster} -p ${pool} bench ${image} --io-type write \ - --io-size ${size} --io-threads 1 --io-total $((size * count)) \ - --io-pattern rand + rbd --cluster ${cluster} -p ${pool} map ${image} + fio --name=test --rw=randwrite --bs=${size} --runtime=60 --ioengine=libaio --iodepth=1 --numjobs=1 --filename=/dev/rbd0 --direct=1 --group_reporting --size $((size * count)) --group_reporting --eta-newline + rbd --cluster ${cluster} -p ${pool} unmap ${image} } stress_write_image()