diff mbox

Kernel 3.0.0 + ext4 + ceph == ...

Message ID Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net (mailing list archive)
State New, archived
Headers show

Commit Message

Sage Weil July 31, 2011, 5:04 p.m. UTC
On Sun, 31 Jul 2011, Fyodor Ustinov wrote:
> On 07/31/2011 07:54 AM, Sage Weil wrote:
> > On Sat, 30 Jul 2011, Ted Ts'o wrote:
> > > On Sat, Jul 30, 2011 at 10:21:13AM -0700, Sage Weil wrote:
> > > > We do use xattrs extensively, though; that was the last extN bug we
> > > > uncovered.  That's where my money is.
> > > Hmm, yes.  That could very well be.  How big are the xattrs, and are
> > > there cases where you:
> > > 
> > > a) start with a small xattr (where the total size is less than 128
> > > bytes, so it can be stored in the inode table), and then increase it
> > > something where it needs to be stored in an external block?
> > > 
> > > b) start with enough xattrs so it's large, and then delete all or most
> > > of them?
> > > 
> > > I could easily believe we might have some bugs as we transition from
> > > in-inode to external block storage, or vice versa.  I'll take a look
> > > at the code and try to create some reproduction cases, but if you
> > > could give me a handle on workload patterns of ceph around xattrs,
> > > that would be interesting.
> > I would guess a, but it could also be a+b.
> > 
> > Fyodor, can you take some of the corrupt inos that fsck complained about
> > and see what files/directories they are?  find /osd.0 -inum NNN.  (I'm
> > guessing the largest xattrs are on the collection directories, like
> > /osd.0/current/something_head/.)  Then grep that filename out of the log
> > to see exactly which operations took place.  The setattr log normally
> > includes xattr size.
> 
> 
> /etc/init.d/ceph stop
> umount /mnt/osd.0
> mke2fs -t ext4 -I 128 /dev/sdc1
> tune2fs -o journal_data_writeback /dev/sdc1
> mount -a
> mon getmap -o /tmp/monmap
> cosd --mkfs -i 0 --monmap /tmp/monmap
> /etc/init.d/ceph start
> sleep 300
> /etc/init.d/ceph stop
> umount /osd.0
> fsck.ext4 -f /dev/sdc1
> 
> Inode 99356878, i_blocks is 8208, should be 8200.
> 
> mount -a
> root@osd0:~# find /osd.0 -inum 99356878
> 
> /osd.0/current/0.2a4_head/10000000468.0000007e_head
> 
> root@osd0:~# grep "10000000468\.0000007e" /var/log/ceph/osd.0.log
> 2011-07-31 09:57:20.859834 7f624c82a700 filestore(/osd.0) remove
> temp/10000000468.0000007e/head = -1
> 2011-07-31 09:57:20.861166 7f624c82a700 filestore(/osd.0) write
> temp/10000000468.0000007e/head 0~1048576 = 1048576
> 2011-07-31 09:57:20.990464 7f624c029700 filestore(/osd.0) write
> temp/10000000468.0000007e/head 1048576~1048576 = 1048576
> 2011-07-31 09:57:21.121648 7f624c029700 filestore(/osd.0) write
> temp/10000000468.0000007e/head 2097152~1048576 = 1048576
> 2011-07-31 09:57:21.265879 7f624c029700 filestore(/osd.0) write
> temp/10000000468.0000007e/head 3145728~1048576 = 1048576
> 2011-07-31 09:57:21.265952 7f624c029700 filestore(/osd.0) remove
> 0.2a4_head/10000000468.0000007e/head = -1
> 2011-07-31 09:57:21.265995 7f624c029700 filestore(/osd.0) collection_add
> 0.2a4_head/10000000468.0000007e/head temp/10000000468.0000007e/head = 0
> 2011-07-31 09:57:21.266025 7f624c029700 filestore(/osd.0) collection_remove
> temp/10000000468.0000007e/head = 0
> 2011-07-31 09:57:21.266134 7f624c029700 filestore(/osd.0) setattrs
> 0.2a4_head/10000000468.0000007e/head = 26

Hrm, I was hoping it wouldn't be a setattrs call.  The below will tell us 
what xattrs it is setting, but sadly you'll need to reproduce the whole 
thing again.  The above is only telling us the size of the last xattr of 
the set.

Ted, I'm assuming the i_blocks mismatch is likely on the same files that 
make the transition from/to in-inode xattrs?

sage




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Fyodor Ustinov July 31, 2011, 5:32 p.m. UTC | #1
On 07/31/2011 08:04 PM, Sage Weil wrote:
>
> Hrm, I was hoping it wouldn't be a setattrs call.  The below will tell us
> what xattrs it is setting, but sadly you'll need to reproduce the whole
> thing again.  The above is only telling us the size of the last xattr of
> the set.

ok, until hour or two I do it.

WBR,
Fyodor.

P.S. muttered under his breath: "? ??? ??? 15 ??? ?? ???????????. ? 
??????? ????????? ?????????????. ?? ?? ??? ??? ??? ??? ??????"

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fyodor Ustinov July 31, 2011, 8:16 p.m. UTC | #2
On 07/31/2011 08:04 PM, Sage Weil wrote:
> dout(10)<<  "setattrs "<<  cid<<  "/"<<  oid<<  " '"<<  p->first<<  "' len "<<  p->second.length()<<  " = "<<  r<<  dendl;
Well.

root@osd0:~# grep "10000000483\.000005d6" /var/log/ceph/osd.0.log
2011-07-31 23:06:49.172838 7f23c048c700 filestore(/osd.0) remove 
temp/10000000483.000005d6/head = -1
2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write 
temp/10000000483.000005d6/head 0~1048576 = 1048576
2011-07-31 23:06:49.900837 7f23c048c700 filestore(/osd.0) write 
temp/10000000483.000005d6/head 1048576~1048576 = 1048576
2011-07-31 23:06:49.985600 7f23c048c700 filestore(/osd.0) write 
temp/10000000483.000005d6/head 2097152~1048576 = 1048576
2011-07-31 23:06:50.114185 7f23c048c700 filestore(/osd.0) write 
temp/10000000483.000005d6/head 3145728~1048576 = 1048576
2011-07-31 23:06:50.114239 7f23c048c700 filestore(/osd.0) remove 
0.69_head/10000000483.000005d6/head = -1
2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 
0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0
2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) 
collection_remove temp/10000000483.000005d6/head = 0
2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 
0.69_head/10000000483.000005d6/head '_' len 161 = 161
2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 
0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 
0.69_head/10000000483.000005d6/head = 26
root@osd0:~#


WBR,
     Fyodor.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil July 31, 2011, 8:42 p.m. UTC | #3
On Sun, 31 Jul 2011, Fyodor Ustinov wrote:
> On 07/31/2011 08:04 PM, Sage Weil wrote:
> > dout(10)<<  "setattrs "<<  cid<<  "/"<<  oid<<  " '"<<  p->first<<  "' len
> > "<<  p->second.length()<<  " = "<<  r<<  dendl;
> Well.

Thanks!  Just to clarify a few things:
 
> root@osd0:~# grep "10000000483\.000005d6" /var/log/ceph/osd.0.log
> 2011-07-31 23:06:49.172838 7f23c048c700 filestore(/osd.0) remove
> temp/10000000483.000005d6/head = -1
       ^ The actual filename is a 10000000483.000005d6_head, ignore that 
last /.

> 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write
> temp/10000000483.000005d6/head 0~1048576 = 1048576
> 2011-07-31 23:06:49.900837 7f23c048c700 filestore(/osd.0) write
> temp/10000000483.000005d6/head 1048576~1048576 = 1048576
> 2011-07-31 23:06:49.985600 7f23c048c700 filestore(/osd.0) write
> temp/10000000483.000005d6/head 2097152~1048576 = 1048576
> 2011-07-31 23:06:50.114185 7f23c048c700 filestore(/osd.0) write
> temp/10000000483.000005d6/head 3145728~1048576 = 1048576
> 2011-07-31 23:06:50.114239 7f23c048c700 filestore(/osd.0) remove
> 0.69_head/10000000483.000005d6/head = -1
> 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add
> 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0

This is link(2)

> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove
> temp/10000000483.000005d6/head = 0

This is unlink(2)

> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs
> 0.69_head/10000000483.000005d6/head '_' len 161 = 161

The xattr names are prefixed with 'user.ceph.'

> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs
> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs
> 0.69_head/10000000483.000005d6/head = 26

Does that tell you what to need, Ted?

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Aug. 1, 2011, 10:53 a.m. UTC | #4
On Jul 31, 2011, at 4:42 PM, Sage Weil wrote:

> This is link(2)
> 
>> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove
>> temp/10000000483.000005d6/head = 0
> 
> This is unlink(2)
> 
>> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs
>> 0.69_head/10000000483.000005d6/head '_' len 161 = 161
> 
> The xattr names are prefixed with 'user.ceph.'
> 
>> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs
>> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
>> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs
>> 0.69_head/10000000483.000005d6/head = 26
> 
> Does that tell you what to need, Ted?

So let me make sure I understand what happened.  Ceph made sure no file existed with this name, and created with the first write log entry:

2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576

It then linked it into this directory:

2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0

… and removed it from the temp directory:

2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0

and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers)

2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161
2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26

… and then presumably the file was closed and that would have been the end of this file being touched by ceph, correct?

Thanks!!

-- Ted--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 1, 2011, 4:20 p.m. UTC | #5
On Mon, 1 Aug 2011, Theodore Tso wrote:
> On Jul 31, 2011, at 4:42 PM, Sage Weil wrote:
> 
> > This is link(2)
> > 
> >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove
> >> temp/10000000483.000005d6/head = 0
> > 
> > This is unlink(2)
> > 
> >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs
> >> 0.69_head/10000000483.000005d6/head '_' len 161 = 161
> > 
> > The xattr names are prefixed with 'user.ceph.'
> > 
> >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs
> >> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
> >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs
> >> 0.69_head/10000000483.000005d6/head = 26
> > 
> > Does that tell you what to need, Ted?
> 
> So let me make sure I understand what happened.  Ceph made sure no file existed with this name, and created with the first write log entry:
> 
> 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576

The fd was actually closed here.

> It then linked it into this directory:
> 
> 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0
> 
> ? and removed it from the temp directory:
> 
> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0
> 
> and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers)
                         xattrs
> 
> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161
> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26
> 
> ? and then presumably the file was closed and that would have been the 
> end of this file being touched by ceph, correct?

Yep!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Brunner Aug. 3, 2011, 2:16 p.m. UTC | #6
2011/8/1 Sage Weil <sage@newdream.net>:
> On Mon, 1 Aug 2011, Theodore Tso wrote:
>> On Jul 31, 2011, at 4:42 PM, Sage Weil wrote:
>>
>> > This is link(2)
>> >
>> >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove
>> >> temp/10000000483.000005d6/head = 0
>> >
>> > This is unlink(2)
>> >
>> >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs
>> >> 0.69_head/10000000483.000005d6/head '_' len 161 = 161
>> >
>> > The xattr names are prefixed with 'user.ceph.'
>> >
>> >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs
>> >> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
>> >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs
>> >> 0.69_head/10000000483.000005d6/head = 26
>> >
>> > Does that tell you what to need, Ted?
>>
>> So let me make sure I understand what happened.  Ceph made sure no file existed with this name, and created with the first write log entry:
>>
>> 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576
>
> The fd was actually closed here.
>
>> It then linked it into this directory:
>>
>> 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0
>>
>> ? and removed it from the temp directory:
>>
>> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0
>>
>> and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers)
>                         xattrs
>>
>> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161
>> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
>> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26
>>
>> ? and then presumably the file was closed and that would have been the
>> end of this file being touched by ceph, correct?
>
> Yep!

I tried to reproduce this without ceph, but wasn't able to...

In the meantime it seams, that I can also see the side effects on the
librbd side: I get an "librbd: data error!" when I do an "rbd copy".

When I look at the librbd code this is related to a sparse_read not
returning the right size of the object.

I don't know if it helps, but I think that the problem is also related
to sparse file usage.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yehuda Sadeh Weinraub Aug. 3, 2011, 3:41 p.m. UTC | #7
On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote:
...
> I tried to reproduce this without ceph, but wasn't able to...
>
> In the meantime it seams, that I can also see the side effects on the
> librbd side: I get an "librbd: data error!" when I do an "rbd copy".
>
> When I look at the librbd code this is related to a sparse_read not
> returning the right size of the object.
>
> I don't know if it helps, but I think that the problem is also related
> to sparse file usage.
>

There were a few sparse-read issues that we fixed not too long ago,
but should have been fixed for at least the previous ceph version. I'm
not sure what version you're using.
There was a ext4 fiemap issue that I was hitting on specific
environments but couldn't determine whether it was fixed in later
kernel versions (I was using 2.6.32). Now is a good time to try and
get to the bottom of it. Here's a script I was using to reproduce it:

#!/bin/sh
dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync
dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync

You can compile and run the following utility to dump all the extents:
http://pastebin.com/h2Cnpk2Q

Thanks,
Yehuda

Oh, btw, You can effectively disable the use of fiemap by setting the
'filestore fiemap threshold' config option with large enough value
(e.g., anything bigger than 4 MB should be enough for rbd).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Brunner Aug. 8, 2011, 8:07 p.m. UTC | #8
I tried 3.0.1 today, which contains the commit Theodore suggested and
was no longer able to reproduce the problem.

So I think the corruption we have seen is indeed related to:

commit 7132de744ba76930d13033061018ddd7e3e8cd91
Author: Maxim Patlasov <maxim.patlasov@gmail.com>
Date:   Sun Jul 10 19:37:48 2011 -0400

   ext4: fix i_blocks/quota accounting when extent insertion fails


I will now try to apply this patch to the RHEL6.1 kernel and see what
happens...

Thanks for your help.

Christian


2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>:
> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote:
> ...
>> I tried to reproduce this without ceph, but wasn't able to...
>>
>> In the meantime it seams, that I can also see the side effects on the
>> librbd side: I get an "librbd: data error!" when I do an "rbd copy".
>>
>> When I look at the librbd code this is related to a sparse_read not
>> returning the right size of the object.
>>
>> I don't know if it helps, but I think that the problem is also related
>> to sparse file usage.
>>
>
> There were a few sparse-read issues that we fixed not too long ago,
> but should have been fixed for at least the previous ceph version. I'm
> not sure what version you're using.
> There was a ext4 fiemap issue that I was hitting on specific
> environments but couldn't determine whether it was fixed in later
> kernel versions (I was using 2.6.32). Now is a good time to try and
> get to the bottom of it. Here's a script I was using to reproduce it:
>
> #!/bin/sh
> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync
>
> You can compile and run the following utility to dump all the extents:
> http://pastebin.com/h2Cnpk2Q
>
> Thanks,
> Yehuda
>
> Oh, btw, You can effectively disable the use of fiemap by setting the
> 'filestore fiemap threshold' config option with large enough value
> (e.g., anything bigger than 4 MB should be enough for rbd).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Brunner Aug. 18, 2011, 9:19 a.m. UTC | #9
I'm sorry, that I have to correct this:

The problem is still happening with 3.0.1. Although it only seems to
happen under high load now.

I also did some tracing (with 3.0.0 as the problem is easier to
reproduce here). What might be interesting to note is, that the
corruption does not occur, when I do an "strace -f cosd". (Maybe a
race condition?).

To reproduce the problem I have now setup a ceph cluster on a single machine
with replication between /ceph/osd.000 and /ceph/osd.001.

My setup now has only two active placement groups with 2 objects.

The corruption is happening, when I start replication from osd.000 to
osd.001. It is reproducible most of the time (but not allways), when I
do the following:

# mkfs.ext4 -T largefile /dev/sdb1
# mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/
# cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap
# /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf


### wait until replication has finished and then stop the cosd

# umount /dev/sdb1
# fsck.ext4 -f /dev/sdb
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Inode 43, i_blocks is 8, should be 16.  Fix<y>? no

Inode 2078, i_blocks is 24, should be 16.  Fix<y>? no



I can also provide an e2image with the metadata and the strace output
of the cosd, if this would be helpful.

Regards,
Christian


2011/8/8 Christian Brunner <chb@muc.de>:
> I tried 3.0.1 today, which contains the commit Theodore suggested and
> was no longer able to reproduce the problem.
>
> So I think the corruption we have seen is indeed related to:
>
> commit 7132de744ba76930d13033061018ddd7e3e8cd91
> Author: Maxim Patlasov <maxim.patlasov@gmail.com>
> Date:   Sun Jul 10 19:37:48 2011 -0400
>
>   ext4: fix i_blocks/quota accounting when extent insertion fails
>
>
> I will now try to apply this patch to the RHEL6.1 kernel and see what
> happens...
>
> Thanks for your help.
>
> Christian
>
>
> 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>:
>> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote:
>> ...
>>> I tried to reproduce this without ceph, but wasn't able to...
>>>
>>> In the meantime it seams, that I can also see the side effects on the
>>> librbd side: I get an "librbd: data error!" when I do an "rbd copy".
>>>
>>> When I look at the librbd code this is related to a sparse_read not
>>> returning the right size of the object.
>>>
>>> I don't know if it helps, but I think that the problem is also related
>>> to sparse file usage.
>>>
>>
>> There were a few sparse-read issues that we fixed not too long ago,
>> but should have been fixed for at least the previous ceph version. I'm
>> not sure what version you're using.
>> There was a ext4 fiemap issue that I was hitting on specific
>> environments but couldn't determine whether it was fixed in later
>> kernel versions (I was using 2.6.32). Now is a good time to try and
>> get to the bottom of it. Here's a script I was using to reproduce it:
>>
>> #!/bin/sh
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync
>>
>> You can compile and run the following utility to dump all the extents:
>> http://pastebin.com/h2Cnpk2Q
>>
>> Thanks,
>> Yehuda
>>
>> Oh, btw, You can effectively disable the use of fiemap by setting the
>> 'filestore fiemap threshold' config option with large enough value
>> (e.g., anything bigger than 4 MB should be enough for rbd).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
index 8bdee6b..fb00cff 100644
--- a/src/os/FileStore.cc
+++ b/src/os/FileStore.cc
@@ -3570,6 +3570,7 @@  int FileStore::_setattrs(coll_t cid, const sobject_t& oid, map<string,bufferptr>
       val = "";
     // ??? Why do we skip setting all the other attrs if one fails?
     r = lfn_setxattr(cid, oid, n, val, p->second.length());
+    dout(10) << "setattrs " << cid << "/" << oid << " '" << p->first << "' len " << p->second.length() << " = " << r << dendl;
     if (r < 0) {
       derr << "FileStore::_setattrs: do_setxattr returned " << r << dendl;
       break;