mark rbd requiring stable pages

Message ID	201510151850.48348.ronny.hegewald@online.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: Ronny Hegewald <ronny.hegewald@online.de> Subject: [PATCH] mark rbd requiring stable pages Date: Thu, 15 Oct 2015 18:50:46 +0000 User-Agent: KMail/1.13.5 (Linux/3.19.8-64-patched; KDE/4.4.5; x86_64; svn-1222987; 2010-06-22) Cc: Ilya Dryomov <idryomov@gmail.com>, Sage Weil <sage@redhat.com>, Alex Elder <elder@kernel.org> MIME-Version: 1.0 To: ceph-devel@vger.kernel.org Message-Id: <201510151850.48348.ronny.hegewald@online.de> Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Ronny Hegewald Oct. 15, 2015, 6:50 p.m. UTC

rbd requires stable pages, as it performs a crc of the page data before they 
are send to the OSDs.

But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 "mm: only
enforce stable page writes if the backing device requires it") it is not 
assumed anymore that block devices require stable pages.

This patch sets the necessary flag to get stable pages back for rbd.

In a ceph installation that provides multiple ext4 formatted rbd devices "bad 
crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD 
that provided the data for the rbd) in the OSD-logs before this patch. After 
this patch this messages are pretty much gone (only ca 1-2 / month / OSD).

This patch seems also to fix data and filesystem corruptions on ext4 formatted 
rbd devices that were previously seen on pretty much a daily basis. But it is 
unknown at the moment why this is the case.  

Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de>

---

That the mentioned corruption issue is really affected through this patch 
i could verify through the system logs. Since installation of this patch i 
have seen only a 2-3 filesystem corruptions. But these could be also just 
leftovers of corruptions that happened before the installation but got noticed 
from ext4 only later after the patched kernel was installed. This seems even 
more likely as i have seen not a single data corruption issue since the patch. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 16, 2015, 11:09 a.m. UTC | #1

On Thu, Oct 15, 2015 at 8:50 PM, Ronny Hegewald
<ronny.hegewald@online.de> wrote:
> rbd requires stable pages, as it performs a crc of the page data before they
> are send to the OSDs.
>
> But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 "mm: only
> enforce stable page writes if the backing device requires it") it is not
> assumed anymore that block devices require stable pages.
>
> This patch sets the necessary flag to get stable pages back for rbd.
>
> In a ceph installation that provides multiple ext4 formatted rbd devices "bad
> crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD
> that provided the data for the rbd) in the OSD-logs before this patch. After
> this patch this messages are pretty much gone (only ca 1-2 / month / OSD).
>
> This patch seems also to fix data and filesystem corruptions on ext4 formatted
> rbd devices that were previously seen on pretty much a daily basis. But it is
> unknown at the moment why this is the case.
>
> Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de>
>
> ---
>
> That the mentioned corruption issue is really affected through this patch
> i could verify through the system logs. Since installation of this patch i
> have seen only a 2-3 filesystem corruptions. But these could be also just
> leftovers of corruptions that happened before the installation but got noticed
> from ext4 only later after the patched kernel was installed. This seems even
> more likely as i have seen not a single data corruption issue since the patch.
>
> --- linux/drivers/block/rbd.c.org       2015-10-07 01:32:55.906666667 +0000
> +++ linux/drivers/block/rbd.c   2015-09-04 02:21:22.349999999 +0000
> @@ -3786,6 +3786,7 @@
>
>         blk_queue_merge_bvec(q, rbd_merge_bvec);
>         disk->queue = q;
> +       disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES;
>
>         q->queuedata = rbd_dev;

Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
so enabling this unconditionally is probably too harsh.  OTOH we are
talking to the network, which means all sorts of delays, retransmission
issues, etc, so I wonder how exactly "unstable" pages behave when, say,
added to an skb - you can't write anything to a page until networking
is fully done with it and expect it to work.  It's particularly
alarming that you've seen corruptions.

Currently the only users of this flag are block integrity stuff and
md-raid5, which makes me wonder what iscsi, nfs and others do in this
area.  There's an old ticket on this topic somewhere on the tracker, so
I'll need to research this.  Thanks for bringing this up!

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 21, 2015, 8:51 p.m. UTC | #2

On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
> so enabling this unconditionally is probably too harsh.  OTOH we are
> talking to the network, which means all sorts of delays, retransmission
> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
> added to an skb - you can't write anything to a page until networking
> is fully done with it and expect it to work.  It's particularly
> alarming that you've seen corruptions.
>
> Currently the only users of this flag are block integrity stuff and
> md-raid5, which makes me wonder what iscsi, nfs and others do in this
> area.  There's an old ticket on this topic somewhere on the tracker, so
> I'll need to research this.  Thanks for bringing this up!

Hi Mike,

I was hoping to grab you for a few minutes, but you weren't there...

I spent a better part of today reading code and mailing lists on this
topic.  It is of course a bug that we use sendpage() which inlines
pages into an skb and do nothing to keep those pages stable.  We have
csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
case is an obvious fix.

I looked at drbd and iscsi and I think iscsi could do the same - ditch
the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
rather than having a roll-your-own implementation which doesn't close
the race but only narrows it sounds like a win, unless copying through
sendmsg() is for some reason cheaper than stable-waiting?

drbd still needs the non-zero-copy version for its async protocol for
when they free the pages before the NIC has chance to put them on the
wire.  md-raid5 it turns out has an option to essentially disable most
of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
if that option is enabled.

What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
convince myself that mucking with sendpage()ed pages while they sit in
the TCP queue (or anywhere in the networking stack, really), is safe -
there is nothing to prevent pages from being modified after sendpage()
returned and Ronny reports data corruptions that pretty much went away
with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
this, starting to confuse fs with block, though.  How does that work in
iscsi land?

(There was/is also this [1] bug, which is kind of related and probably
worth looking into at some point later.  ceph shouldn't be that easily
affected - we've got state, but there is a ticket for it.)

[1] http://www.spinics.net/lists/linux-nfs/msg34913.html

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 21, 2015, 8:57 p.m. UTC | #3

On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>> so enabling this unconditionally is probably too harsh.  OTOH we are
>> talking to the network, which means all sorts of delays, retransmission
>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>> added to an skb - you can't write anything to a page until networking
>> is fully done with it and expect it to work.  It's particularly
>> alarming that you've seen corruptions.
>>
>> Currently the only users of this flag are block integrity stuff and
>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>> area.  There's an old ticket on this topic somewhere on the tracker, so
>> I'll need to research this.  Thanks for bringing this up!
>
> Hi Mike,
>
> I was hoping to grab you for a few minutes, but you weren't there...
>
> I spent a better part of today reading code and mailing lists on this
> topic.  It is of course a bug that we use sendpage() which inlines
> pages into an skb and do nothing to keep those pages stable.  We have
> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
> case is an obvious fix.
>
> I looked at drbd and iscsi and I think iscsi could do the same - ditch
> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
> rather than having a roll-your-own implementation which doesn't close
> the race but only narrows it sounds like a win, unless copying through
> sendmsg() is for some reason cheaper than stable-waiting?
>
> drbd still needs the non-zero-copy version for its async protocol for
> when they free the pages before the NIC has chance to put them on the
> wire.  md-raid5 it turns out has an option to essentially disable most
> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
> if that option is enabled.
>
> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
> convince myself that mucking with sendpage()ed pages while they sit in
> the TCP queue (or anywhere in the networking stack, really), is safe -
> there is nothing to prevent pages from being modified after sendpage()
> returned and Ronny reports data corruptions that pretty much went away
> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
> this, starting to confuse fs with block, though.  How does that work in
> iscsi land?
>
> (There was/is also this [1] bug, which is kind of related and probably
> worth looking into at some point later.  ceph shouldn't be that easily
> affected - we've got state, but there is a ticket for it.)
>
> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html

And now with Mike on the CC and a mention that at least one scenario of
[1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
issue when retransmitting RPC calls").

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Christie Oct. 22, 2015, 4:07 a.m. UTC | #4

On 10/21/2015 03:57 PM, Ilya Dryomov wrote:
> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>>> so enabling this unconditionally is probably too harsh.  OTOH we are
>>> talking to the network, which means all sorts of delays, retransmission
>>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>>> added to an skb - you can't write anything to a page until networking
>>> is fully done with it and expect it to work.  It's particularly
>>> alarming that you've seen corruptions.
>>>
>>> Currently the only users of this flag are block integrity stuff and
>>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>>> area.  There's an old ticket on this topic somewhere on the tracker, so
>>> I'll need to research this.  Thanks for bringing this up!
>>
>> Hi Mike,
>>
>> I was hoping to grab you for a few minutes, but you weren't there...
>>
>> I spent a better part of today reading code and mailing lists on this
>> topic.  It is of course a bug that we use sendpage() which inlines
>> pages into an skb and do nothing to keep those pages stable.  We have
>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
>> case is an obvious fix.
>>
>> I looked at drbd and iscsi and I think iscsi could do the same - ditch
>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
>> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
>> rather than having a roll-your-own implementation which doesn't close
>> the race but only narrows it sounds like a win, unless copying through
>> sendmsg() is for some reason cheaper than stable-waiting?

Yeah, that is what I was saying on the call the other day, but the
reception was bad. We only have the sendmsg code path when digest are on
because that code came before stable pages. When stable pages were
created, it was on by default but did not cover all the cases, so we
left the code. It then handled most scenarios, but I just never got
around to removing old the code. However, it was set to off by default
so I left it and made this patch for iscsi to turn on stable pages:

[this patch only enabled stable pages when digests/crcs are on and dif
not remove the code yet]
https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM

I did not really like the layering so I have not posted it for inclusion.

>>
>> drbd still needs the non-zero-copy version for its async protocol for
>> when they free the pages before the NIC has chance to put them on the
>> wire.  md-raid5 it turns out has an option to essentially disable most
>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
>> if that option is enabled.
>>
>> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
>> convince myself that mucking with sendpage()ed pages while they sit in
>> the TCP queue (or anywhere in the networking stack, really), is safe -
>> there is nothing to prevent pages from being modified after sendpage()
>> returned and Ronny reports data corruptions that pretty much went away
>> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
>> this, starting to confuse fs with block, though.  How does that work in
>> iscsi land?

This is what I was trying to ask about in the call the other day. Where
is the corruption that Ronny was seeing. Was it checksum mismatches on
data being written, or is incorrect meta data being written, etc?

If we are just talking about if stable pages are not used, and someone
is re-writing data to a page after the page has already been submitted
to the block layer (I mean the page is on some bio which is on a request
which is on some request_queue scheduler list or basically anywhere in
the block layer), then I was saying this can occur with any block
driver. There is nothing that is preventing this from happening with a
FC driver or nvme or cciss or in dm or whatever. The app/user can
rewrite as late as when we are in the make_request_fn/request_fn.

I think I am misunderstanding your question because I thought this is
expected behavior, and there is nothing drivers can do if the app is not
doing a flush/sync between these types of write sequences.

>>
>> (There was/is also this [1] bug, which is kind of related and probably
>> worth looking into at some point later.  ceph shouldn't be that easily
>> affected - we've got state, but there is a ticket for it.)
>>
>> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html
> 
> And now with Mike on the CC and a mention that at least one scenario of
> [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
> issue when retransmitting RPC calls").
> 

iSCSI handles timeouts/retries and sequence numbers/responses
differently so we are not affected. We go through some abort and
possibly reconnect process.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 22, 2015, 11:20 a.m. UTC | #5

On Thu, Oct 22, 2015 at 6:07 AM, Mike Christie <michaelc@cs.wisc.edu> wrote:
> On 10/21/2015 03:57 PM, Ilya Dryomov wrote:
>> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>>>> so enabling this unconditionally is probably too harsh.  OTOH we are
>>>> talking to the network, which means all sorts of delays, retransmission
>>>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>>>> added to an skb - you can't write anything to a page until networking
>>>> is fully done with it and expect it to work.  It's particularly
>>>> alarming that you've seen corruptions.
>>>>
>>>> Currently the only users of this flag are block integrity stuff and
>>>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>>>> area.  There's an old ticket on this topic somewhere on the tracker, so
>>>> I'll need to research this.  Thanks for bringing this up!
>>>
>>> Hi Mike,
>>>
>>> I was hoping to grab you for a few minutes, but you weren't there...
>>>
>>> I spent a better part of today reading code and mailing lists on this
>>> topic.  It is of course a bug that we use sendpage() which inlines
>>> pages into an skb and do nothing to keep those pages stable.  We have
>>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
>>> case is an obvious fix.
>>>
>>> I looked at drbd and iscsi and I think iscsi could do the same - ditch
>>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
>>> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
>>> rather than having a roll-your-own implementation which doesn't close
>>> the race but only narrows it sounds like a win, unless copying through
>>> sendmsg() is for some reason cheaper than stable-waiting?
>
> Yeah, that is what I was saying on the call the other day, but the
> reception was bad. We only have the sendmsg code path when digest are on
> because that code came before stable pages. When stable pages were
> created, it was on by default but did not cover all the cases, so we
> left the code. It then handled most scenarios, but I just never got
> around to removing old the code. However, it was set to off by default
> so I left it and made this patch for iscsi to turn on stable pages:
>
> [this patch only enabled stable pages when digests/crcs are on and dif
> not remove the code yet]
> https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM
>
> I did not really like the layering so I have not posted it for inclusion.

Good to know I got it right ;)

>
>
>
>>>
>>> drbd still needs the non-zero-copy version for its async protocol for
>>> when they free the pages before the NIC has chance to put them on the
>>> wire.  md-raid5 it turns out has an option to essentially disable most
>>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
>>> if that option is enabled.
>>>
>>> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
>>> convince myself that mucking with sendpage()ed pages while they sit in
>>> the TCP queue (or anywhere in the networking stack, really), is safe -
>>> there is nothing to prevent pages from being modified after sendpage()
>>> returned and Ronny reports data corruptions that pretty much went away
>>> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
>>> this, starting to confuse fs with block, though.  How does that work in
>>> iscsi land?
>
> This is what I was trying to ask about in the call the other day. Where
> is the corruption that Ronny was seeing. Was it checksum mismatches on
> data being written, or is incorrect meta data being written, etc?

Well, checksum mismatches are to be expected given what we are doing
now, but I wouldn't expect any data corruptions.  Ronny writes that he
saw frequent ext4 corruptions on krbd devices before he enabled stable
pages, which leads me to believe that the !crc case, for which we won't
be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
could you describe it in more detail and maybe share some of those osd
logs with bad crc messages?

>
> If we are just talking about if stable pages are not used, and someone
> is re-writing data to a page after the page has already been submitted
> to the block layer (I mean the page is on some bio which is on a request
> which is on some request_queue scheduler list or basically anywhere in
> the block layer), then I was saying this can occur with any block
> driver. There is nothing that is preventing this from happening with a
> FC driver or nvme or cciss or in dm or whatever. The app/user can
> rewrite as late as when we are in the make_request_fn/request_fn.
>
> I think I am misunderstanding your question because I thought this is
> expected behavior, and there is nothing drivers can do if the app is not
> doing a flush/sync between these types of write sequences.

I don't see a problem with rewriting as late as when we are in
request_fn() (or in a wq after being put there by request_fn()).  Where
I thought there *might* be an issue is rewriting after sendpage(), if
sendpage() is used - perhaps some sneaky sequence similar to that
retransmit bug that would cause us to *transmit* incorrect bytes (as
opposed to *re*transmit) or something of that nature?

Another (and much more likely) explanation for the corruptions Ronny
was seeing is a bug in how rbd/libceph handle reconnects and revoke old
messages.  Given the number of crc errors ("1 message every 1-2 minutes
on every OSD"), each of which is a connection reset, the reconnect code
was being exercised pretty heavily...

>
>
>>>
>>> (There was/is also this [1] bug, which is kind of related and probably
>>> worth looking into at some point later.  ceph shouldn't be that easily
>>> affected - we've got state, but there is a ticket for it.)
>>>
>>> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html
>>
>> And now with Mike on the CC and a mention that at least one scenario of
>> [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
>> issue when retransmitting RPC calls").
>>
>
> iSCSI handles timeouts/retries and sequence numbers/responses
> differently so we are not affected. We go through some abort and
> possibly reconnect process.

It's the same with ceph, but still something to look at, I guess.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Christie Oct. 22, 2015, 3:37 p.m. UTC | #6

On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
> 
>> >
>> > If we are just talking about if stable pages are not used, and someone
>> > is re-writing data to a page after the page has already been submitted
>> > to the block layer (I mean the page is on some bio which is on a request
>> > which is on some request_queue scheduler list or basically anywhere in
>> > the block layer), then I was saying this can occur with any block
>> > driver. There is nothing that is preventing this from happening with a
>> > FC driver or nvme or cciss or in dm or whatever. The app/user can
>> > rewrite as late as when we are in the make_request_fn/request_fn.
>> >
>> > I think I am misunderstanding your question because I thought this is
>> > expected behavior, and there is nothing drivers can do if the app is not
>> > doing a flush/sync between these types of write sequences.
> I don't see a problem with rewriting as late as when we are in
> request_fn() (or in a wq after being put there by request_fn()).  Where
> I thought there *might* be an issue is rewriting after sendpage(), if
> sendpage() is used - perhaps some sneaky sequence similar to that
> retransmit bug that would cause us to *transmit* incorrect bytes (as
> opposed to *re*transmit) or something of that nature?


Just to make sure we are on the same page.

Are you concerned about the tcp/net layer retransmitting due to it
detecting a issue as part of the tcp protocol, or are you concerned
about rbd/libceph initiating a retry like with the nfs issue?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 22, 2015, 4:52 p.m. UTC | #7

On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <michaelc@cs.wisc.edu> wrote:
> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>
>>> >
>>> > If we are just talking about if stable pages are not used, and someone
>>> > is re-writing data to a page after the page has already been submitted
>>> > to the block layer (I mean the page is on some bio which is on a request
>>> > which is on some request_queue scheduler list or basically anywhere in
>>> > the block layer), then I was saying this can occur with any block
>>> > driver. There is nothing that is preventing this from happening with a
>>> > FC driver or nvme or cciss or in dm or whatever. The app/user can
>>> > rewrite as late as when we are in the make_request_fn/request_fn.
>>> >
>>> > I think I am misunderstanding your question because I thought this is
>>> > expected behavior, and there is nothing drivers can do if the app is not
>>> > doing a flush/sync between these types of write sequences.
>> I don't see a problem with rewriting as late as when we are in
>> request_fn() (or in a wq after being put there by request_fn()).  Where
>> I thought there *might* be an issue is rewriting after sendpage(), if
>> sendpage() is used - perhaps some sneaky sequence similar to that
>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>> opposed to *re*transmit) or something of that nature?
>
>
> Just to make sure we are on the same page.
>
> Are you concerned about the tcp/net layer retransmitting due to it
> detecting a issue as part of the tcp protocol, or are you concerned
> about rbd/libceph initiating a retry like with the nfs issue?

The former, tcp/net layer.  I'm just conjecturing though.

(We don't have the nfs issue, because even if the client sends such
a retransmit (which it won't), the primary OSD will reject it as
a dup.)

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Christie Oct. 22, 2015, 5:22 p.m. UTC | #8

On 10/22/15, 11:52 AM, Ilya Dryomov wrote:
> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <michaelc@cs.wisc.edu> wrote:
>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>>
>>>>>
>>>>> If we are just talking about if stable pages are not used, and someone
>>>>> is re-writing data to a page after the page has already been submitted
>>>>> to the block layer (I mean the page is on some bio which is on a request
>>>>> which is on some request_queue scheduler list or basically anywhere in
>>>>> the block layer), then I was saying this can occur with any block
>>>>> driver. There is nothing that is preventing this from happening with a
>>>>> FC driver or nvme or cciss or in dm or whatever. The app/user can
>>>>> rewrite as late as when we are in the make_request_fn/request_fn.
>>>>>
>>>>> I think I am misunderstanding your question because I thought this is
>>>>> expected behavior, and there is nothing drivers can do if the app is not
>>>>> doing a flush/sync between these types of write sequences.
>>> I don't see a problem with rewriting as late as when we are in
>>> request_fn() (or in a wq after being put there by request_fn()).  Where
>>> I thought there *might* be an issue is rewriting after sendpage(), if
>>> sendpage() is used - perhaps some sneaky sequence similar to that
>>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>>> opposed to *re*transmit) or something of that nature?
>>
>>
>> Just to make sure we are on the same page.
>>
>> Are you concerned about the tcp/net layer retransmitting due to it
>> detecting a issue as part of the tcp protocol, or are you concerned
>> about rbd/libceph initiating a retry like with the nfs issue?
>
> The former, tcp/net layer.  I'm just conjecturing though.
>

For iscsi, we normally use the sendpage path. Data digests are off by 
default and some distros do not even allow you to turn them on, so our 
sendpage path has got a lot of testing and we have not seen any 
corruptions. Not saying it is not possible, but just saying we have not 
seen any.

It could be due to a recent change. Ronny, tell us about the workload 
and I will check iscsi.

Oh yeah, for the tcp/net retransmission case, I had said offlist, I 
thought there might be a issue with iscsi but I guess I was wrong, so I 
have not seen any issues with that either.

iSCSI just has that bug I mentioned offlist where we close the socket 
and fail commands upwards in the wrong order. That is a iscsi specific 
bug though.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 22, 2015, 6:07 p.m. UTC | #9

On Thu, Oct 22, 2015 at 7:22 PM, Mike Christie <michaelc@cs.wisc.edu> wrote:
> On 10/22/15, 11:52 AM, Ilya Dryomov wrote:
>>
>> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <michaelc@cs.wisc.edu>
>> wrote:
>>>
>>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>>>
>>>>
>>>>>>
>>>>>> If we are just talking about if stable pages are not used, and someone
>>>>>> is re-writing data to a page after the page has already been submitted
>>>>>> to the block layer (I mean the page is on some bio which is on a
>>>>>> request
>>>>>> which is on some request_queue scheduler list or basically anywhere in
>>>>>> the block layer), then I was saying this can occur with any block
>>>>>> driver. There is nothing that is preventing this from happening with a
>>>>>> FC driver or nvme or cciss or in dm or whatever. The app/user can
>>>>>> rewrite as late as when we are in the make_request_fn/request_fn.
>>>>>>
>>>>>> I think I am misunderstanding your question because I thought this is
>>>>>> expected behavior, and there is nothing drivers can do if the app is
>>>>>> not
>>>>>> doing a flush/sync between these types of write sequences.
>>>>
>>>> I don't see a problem with rewriting as late as when we are in
>>>> request_fn() (or in a wq after being put there by request_fn()).  Where
>>>> I thought there *might* be an issue is rewriting after sendpage(), if
>>>> sendpage() is used - perhaps some sneaky sequence similar to that
>>>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>>>> opposed to *re*transmit) or something of that nature?
>>>
>>>
>>>
>>> Just to make sure we are on the same page.
>>>
>>> Are you concerned about the tcp/net layer retransmitting due to it
>>> detecting a issue as part of the tcp protocol, or are you concerned
>>> about rbd/libceph initiating a retry like with the nfs issue?
>>
>>
>> The former, tcp/net layer.  I'm just conjecturing though.
>>
>
> For iscsi, we normally use the sendpage path. Data digests are off by
> default and some distros do not even allow you to turn them on, so our
> sendpage path has got a lot of testing and we have not seen any corruptions.
> Not saying it is not possible, but just saying we have not seen any.

Great, that's reassuring.

>
> It could be due to a recent change. Ronny, tell us about the workload and I
> will check iscsi.
>
> Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought
> there might be a issue with iscsi but I guess I was wrong, so I have not
> seen any issues with that either.

I'll drop my concerns then.  Those corruptions could be a bug in ceph
reconnect code or something else - regardless, that's separate from the
issue at hand.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ronny Hegewald Oct. 22, 2015, 11:56 p.m. UTC | #10

On Thursday 22 October 2015, Ilya Dryomov wrote:
> Well, checksum mismatches are to be expected given what we are doing
> now, but I wouldn't expect any data corruptions.  Ronny writes that he
> saw frequent ext4 corruptions on krbd devices before he enabled stable
> pages, which leads me to believe that the !crc case, for which we won't
> be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
> could you describe it in more detail and maybe share some of those osd
> logs with bad crc messages?
> 
This is from a 10 minute period from one of the OSDs. 

23:11:02.423728 ce5dfb70  0 bad crc in data 1657725429 != exp 496797267
23:11:37.586411 ce5dfb70  0 bad crc in data 1216602498 != exp 1888811161
23:12:07.805675 cc3ffb70  0 bad crc in data 3140625666 != exp 2614069504
23:12:44.485713 c96ffb70  0 bad crc in data 1712148977 != exp 3239079328
23:13:24.746217 ce5dfb70  0 bad crc in data 144620426 != exp 3156694286
23:13:52.792367 ce5dfb70  0 bad crc in data 4033880920 != exp 4159672481
23:14:22.958999 c96ffb70  0 bad crc in data 847688321 != exp 1551499144
23:16:35.015629 ce5dfb70  0 bad crc in data 2790209714 != exp 3779604715
23:17:48.482049 c96ffb70  0 bad crc in data 1563466764 != exp 528198494
23:19:28.925357 cc3ffb70  0 bad crc in data 1764275395 != exp 2075504274
23:19:59.039843 cc3ffb70  0 bad crc in data 2960172683 != exp 1215950691

The filesystem corruptions are usually ones with messages of

EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block 
bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters

These were pretty common, at least every other day, often multiple times a 
day.

Sometimes there was a additional 

JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk 
of filesystem corruption in case of system crash.

Another type of Filesystem corruption i experienced through kernel 
compilations, that lead to the following messages.

EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) - 
no `.' or `..'
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted 
inode referenced: 270039
last message repeated 2 times
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) - 
no `.' or `..'
EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) - 
no `.' or `..'
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted 
inode referenced: 270039
last message repeated 2 times
kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: 
deleted inode referenced: 282221
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 282221
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 281914
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted 
inode referenced: 281914
EXT4-fs error: 243 callbacks suppressed 
EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted 
inode referenced: 45375
kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: 
deleted inode referenced: 45371

The result was that various files and directories in the kernel sourcedir 
couldn't be accessed anymore and even fsck couldn't repair it, so i had to 
finally delete it. But these ones were pretty rare.

Another issue were the data-corruptions in the files itself, that happened 
independently from the filesystem-corruptions.  These happened on most days, 
sometimes only once, sometimes multiple times a day. 

Newly written files that contained corrupted data seem to always have it only 
at one place. These corrupt data replaced the original data from the file, but 
never changed the file-size. The position of this corruptions in the files 
were always different.

Interesting part is that this corrupted parts always followed the same 
pattern. First some few hundred 0x0 bytes, then a few kb (10-30) of random 
binary data, that finished again with a few hundred bytes of 0x0.

In a few cases i could trace this data back to origin from another file that 
was read at the same time from the same programm. But that might be 
accidentially, because other corruptions that happened in the same scenario I 
couldn't trace back this way.

In other cases that corrupt data originated from files that were deleted 
recently (a few minutes ago).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ronny Hegewald Oct. 23, 2015, 12:52 a.m. UTC | #11

On Thursday 22 October 2015, you wrote:
> It could be due to a recent change. Ronny, tell us about the workload
> and I will check iscsi.
 
I guess the best testcase is a kernel compilation in a make clean; make -j (> 
1); loop. The data-corruptions usually happen in the generated .cmd files, 
which breaks the build immediatelly and makes the corruption easy to spot.

Beside that i have seen data corruptions in other simple circumstances. 
Copying data from non-rbd to rbd device, from rbd to rbd device, scp data from 
another machine to the rbd. 

Also i have mounted the rbds on the same machines im running the OSD, which 
might be a contributing factor. 

Unfortunatly there seems to be nothing that increases the likelyhood of the 
corruption to happen. I tried all kinds of things with no success.

Another part of the corruption might have been the amount of free memory. 
Before i added the flag for stable patches i regularly had warnings like. 
Since the use of stable pages for rbd these warnings are gone too.


kernel: swapper/1: page allocation failure: order:0, mode:0x20
kernel:  0000000000000000 ffff88012fc83b68 ffffffff8143f171 0000000000000000
kernel:  0000000000000020 ffff88012fc83bf8 ffffffff81127fda ffff88012fff9838
kernel:  ffff880109bc7100 01ff88012fc83be8 ffffffff8164aa40 0000002000000000
kernel: Call Trace:
kernel:  <IRQ>  [<ffffffff8143f171>] dump_stack+0x48/0x5f
kernel:  [<ffffffff81127fda>] warn_alloc_failed+0xea/0x130
kernel:  [<ffffffff8112918a>] __alloc_pages_nodemask+0x69a/0x910
kernel:  [<ffffffffa04ad060>] ? br_handle_frame_finish+0x500/0x500 [bridge]
kernel:  [<ffffffff81162827>] alloc_pages_current+0xa7/0x170
kernel:  [<ffffffffa03dbc4c>] atl1c_alloc_rx_buffer+0x36c/0x430 [atl1c]
kernel:  [<ffffffffa03ddc52>] atl1c_clean+0x212/0x3b0 [atl1c]
kernel:  [<ffffffff813a6fcf>] net_rx_action+0x15f/0x320
kernel:  [<ffffffff81069383>] __do_softirq+0x123/0x2e0
kernel:  [<ffffffff81069626>] irq_exit+0x96/0xc0
kernel:  [<ffffffff81446575>] do_IRQ+0x65/0x110
kernel:  [<ffffffff81444532>] common_interrupt+0x72/0x72
kernel:  <EOI>  [<ffffffff814445a4>] ? retint_restore_args+0x13/0x13
kernel:  [<ffffffff8101f4a2>] ? mwait_idle+0x72/0xb0
kernel:  [<ffffffff8101f499>] ? mwait_idle+0x69/0xb0
kernel:  [<ffffffff8101f24f>] arch_cpu_idle+0xf/0x20
kernel:  [<ffffffff8109ebeb>] cpu_startup_entry+0x22b/0x3e0
kernel:  [<ffffffff81047996>] start_secondary+0x156/0x180
kernel: Mem-Info:
kernel: Node 0 DMA per-cpu:
kernel: CPU    0: hi:    0, btch:   1 usd:   0
kernel: CPU    1: hi:    0, btch:   1 usd:   0
kernel: CPU    2: hi:    0, btch:   1 usd:   0
kernel: CPU    3: hi:    0, btch:   1 usd:   0
kernel: Node 0 DMA32 per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd: 182
kernel: CPU    1: hi:  186, btch:  31 usd: 179
kernel: CPU    2: hi:  186, btch:  31 usd: 156
kernel: CPU    3: hi:  186, btch:  31 usd: 170
kernel: Node 0 Normal per-cpu:
kernel: CPU    0: hi:  186, btch:  31 usd: 138
kernel: CPU    1: hi:  186, btch:  31 usd: 130
kernel: CPU    2: hi:  186, btch:  31 usd:  73
kernel: CPU    3: hi:  186, btch:  31 usd: 122
kernel: active_anon:499711 inactive_anon:128139 isolated_anon:0
kernel:  active_file:132181 inactive_file:145093 isolated_file:22
kernel:  unevictable:4083 dirty:1526 writeback:15597 unstable:0
kernel:  free:5225 slab_reclaimable:23735 slab_unreclaimable:29775
kernel:  mapped:11742 shmem:18846 pagetables:3946 bounce:0
kernel:  free_cma:0
kernel: Node 0 DMA free:15284kB min:32kB low:40kB high:48kB active_anon:0kB 
inactive_anon:96kB active_file:232kB inactive_file:80kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:12kB shmem:0kB 
slab_reclaimable:52kB slab_unreclaimable:80kB kernel_stack:16kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:88 
all_unreclaimable? no
kernel: lowmem_reserve[]: 0 3107 3818 3818
kernel: Node 0 DMA32 free:5064kB min:6420kB low:8024kB high:9628kB 
active_anon:1718524kB inactive_anon:365504kB active_file:418964kB 
inactive_file:469748kB unevictable:0kB isolated(anon):0kB isolated(file):88kB 
present:3257216kB managed:3183616kB mlocked:0kB dirty:5900kB writeback:48264kB 
mapped:39204kB shmem:54364kB slab_reclaimable:76256kB 
slab_unreclaimable:93456kB kernel_stack:6240kB pagetables:12280kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? 
no
kernel: lowmem_reserve[]: 0 0 710 710
kernel: Node 0 Normal free:552kB min:1468kB low:1832kB high:2200kB 
active_anon:280320kB inactive_anon:146956kB active_file:109528kB 
inactive_file:110544kB unevictable:16332kB isolated(anon):0kB 
isolated(file):0kB present:786432kB managed:728012kB mlocked:0kB dirty:204kB 
writeback:14124kB mapped:7752kB shmem:21020kB slab_reclaimable:18632kB 
slab_unreclaimable:25564kB kernel_stack:2432kB pagetables:3504kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:608 all_unreclaimable? 
no
kernel: lowmem_reserve[]: 0 0 0 0
kernel: Node 0 DMA: 4*4kB (UE) 4*8kB (UEM) 2*16kB (UE) 5*32kB (UEM) 3*64kB 
(UM) 2*128kB (UE) 1*256kB (E) 2*512kB (EM) 3*1024kB (UEM) 3*2048kB (UEM) 
1*4096kB (R) = 15280kB
kernel: Node 0 DMA32: 0*4kB 1*8kB (R) 0*16kB 0*32kB 1*64kB (R) 1*128kB (R) 
1*256kB (R) 3*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 5064kB
kernel: Node 0 Normal: 0*4kB 1*8kB (R) 0*16kB 1*32kB (R) 1*64kB (R) 1*128kB 
(R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 488kB
kernel: 308023 total pagecache pages
kernel: 7793 pages in swap cache
kernel: Swap cache stats: add 320089, delete 312296, find 144121/183225
kernel: Free swap  = 1728464kB
kernel: Total swap = 2052092kB
kernel: 1014910 pages RAM
kernel: 0 pages HighMem/MovableOnly
kernel: 33026 pages reserved
 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 23, 2015, 8:32 a.m. UTC | #12

On Fri, Oct 23, 2015 at 1:56 AM, Ronny Hegewald
<ronny.hegewald@online.de> wrote:
> On Thursday 22 October 2015, Ilya Dryomov wrote:
>> Well, checksum mismatches are to be expected given what we are doing
>> now, but I wouldn't expect any data corruptions.  Ronny writes that he
>> saw frequent ext4 corruptions on krbd devices before he enabled stable
>> pages, which leads me to believe that the !crc case, for which we won't
>> be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
>> could you describe it in more detail and maybe share some of those osd
>> logs with bad crc messages?
>>
> This is from a 10 minute period from one of the OSDs.
>
> 23:11:02.423728 ce5dfb70  0 bad crc in data 1657725429 != exp 496797267
> 23:11:37.586411 ce5dfb70  0 bad crc in data 1216602498 != exp 1888811161
> 23:12:07.805675 cc3ffb70  0 bad crc in data 3140625666 != exp 2614069504
> 23:12:44.485713 c96ffb70  0 bad crc in data 1712148977 != exp 3239079328
> 23:13:24.746217 ce5dfb70  0 bad crc in data 144620426 != exp 3156694286
> 23:13:52.792367 ce5dfb70  0 bad crc in data 4033880920 != exp 4159672481
> 23:14:22.958999 c96ffb70  0 bad crc in data 847688321 != exp 1551499144
> 23:16:35.015629 ce5dfb70  0 bad crc in data 2790209714 != exp 3779604715
> 23:17:48.482049 c96ffb70  0 bad crc in data 1563466764 != exp 528198494
> 23:19:28.925357 cc3ffb70  0 bad crc in data 1764275395 != exp 2075504274
> 23:19:59.039843 cc3ffb70  0 bad crc in data 2960172683 != exp 1215950691

Could you share the entire log snippet for those 10 minutes?

>
> The filesystem corruptions are usually ones with messages of
>
> EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block
> bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters
>
> These were pretty common, at least every other day, often multiple times a
> day.
>
> Sometimes there was a additional
>
> JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk
> of filesystem corruption in case of system crash.
>
> Another type of Filesystem corruption i experienced through kernel
> compilations, that lead to the following messages.
>
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) -
> no `.' or `..'
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted
> inode referenced: 270039
> last message repeated 2 times
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) -
> no `.' or `..'
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted
> inode referenced: 270039
> last message repeated 2 times
> kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm:
> deleted inode referenced: 282221
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 282221
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 281914
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 281914
> EXT4-fs error: 243 callbacks suppressed
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted
> inode referenced: 45375
> kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp:
> deleted inode referenced: 45371
>
> The result was that various files and directories in the kernel sourcedir
> couldn't be accessed anymore and even fsck couldn't repair it, so i had to
> finally delete it. But these ones were pretty rare.
>
> Another issue were the data-corruptions in the files itself, that happened
> independently from the filesystem-corruptions.  These happened on most days,
> sometimes only once, sometimes multiple times a day.
>
> Newly written files that contained corrupted data seem to always have it only
> at one place. These corrupt data replaced the original data from the file, but
> never changed the file-size. The position of this corruptions in the files
> were always different.
>
> Interesting part is that this corrupted parts always followed the same
> pattern. First some few hundred 0x0 bytes, then a few kb (10-30) of random
> binary data, that finished again with a few hundred bytes of 0x0.
>
> In a few cases i could trace this data back to origin from another file that
> was read at the same time from the same programm. But that might be
> accidentially, because other corruptions that happened in the same scenario I
> couldn't trace back this way.
>
> In other cases that corrupt data originated from files that were deleted
> recently (a few minutes ago).

Which kernel was this on?

You really should have reported all of this as soon as you hit it - it
sounds like you've been dealing with this issue for a while.  There's
definitely more than stable pages in play here, I'll look into it.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ronny Hegewald Oct. 23, 2015, 7 p.m. UTC | #13

> Could you share the entire log snippet for those 10 minutes?
 
Thats all in the logs.  But if more information would be useful tell me which logs 
to activate and i will give it another run. At least this part is easy to reproduce.

> Which kernel was this on?

The latest kernel i used which produced the corruption was 3.19.8. 
The earliest one was 3.11. 

> You really should have reported all of this as soon as you hit it - it
> sounds like you've been dealing with this issue for a while. There's
> definitely more than stable pages in play here, I'll look into it.

I didn't report it as i didn't have any conclusive to report. There were a lot
of other things that could have created this in my configuration, especially as 
there were no other reports of this kind. And in the meantime i have checked 
all kinds of this variables and could verify that they don't change anything. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 23, 2015, 7:06 p.m. UTC | #14

On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegewald@online.de
<ronny.hegewald@online.de> wrote:
>> Could you share the entire log snippet for those 10 minutes?
>
> Thats all in the logs.  But if more information would be useful tell me which logs
> to activate and i will give it another run. At least this part is easy to reproduce.
>
>> Which kernel was this on?
>
> The latest kernel i used which produced the corruption was 3.19.8.
> The earliest one was 3.11.

No need for now, I'll poke around and report back.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ilya Dryomov Oct. 30, 2015, 11:36 a.m. UTC | #15

On Fri, Oct 23, 2015 at 9:06 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegewald@online.de
> <ronny.hegewald@online.de> wrote:
>>> Could you share the entire log snippet for those 10 minutes?
>>
>> Thats all in the logs.  But if more information would be useful tell me which logs
>> to activate and i will give it another run. At least this part is easy to reproduce.
>>
>>> Which kernel was this on?
>>
>> The latest kernel i used which produced the corruption was 3.19.8.
>> The earliest one was 3.11.
>
> No need for now, I'll poke around and report back.

So the "bad crc" errors are of course easily reproducible, but
I haven't managed to reproduce ext4 corruptions.  I amended your patch
to only require stable pages in case we actually compute checksums, see
https://github.com/ceph/ceph-client/commit/4febcceb866822c1a1aee2836c9c810e3ef29bbf.

Any other data points you can share?  Can you describe your cluster
(boxes, OSDs, clients, rbds mapped - where, how many, ext4 mkfs and
mount options, etc) in more detail?  Is there anything special about
your setup that you can think of?

You've mentioned that the best test case in your experience is kernel
compilation.  What .config are you using, how many threads (make -jX)
and how long does it take to build a kernel with that .config and that
number of threads?  You have more than one rbd device mapped on the
same box - how many exactly, do you put any load on the rest while the
kernel is compiling on one of them?  What about rbd devices mapped on
other boxes?

You get the idea - every bit counts.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mark rbd requiring stable pages

Commit Message

Comments

Patch