Message ID | Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/31/2011 08:04 PM, Sage Weil wrote: > > Hrm, I was hoping it wouldn't be a setattrs call. The below will tell us > what xattrs it is setting, but sadly you'll need to reproduce the whole > thing again. The above is only telling us the size of the last xattr of > the set. ok, until hour or two I do it. WBR, Fyodor. P.S. muttered under his breath: "? ??? ??? 15 ??? ?? ???????????. ? ??????? ????????? ?????????????. ?? ?? ??? ??? ??? ??? ??????" -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/31/2011 08:04 PM, Sage Weil wrote:
> dout(10)<< "setattrs "<< cid<< "/"<< oid<< " '"<< p->first<< "' len "<< p->second.length()<< " = "<< r<< dendl;
Well.
root@osd0:~# grep "10000000483\.000005d6" /var/log/ceph/osd.0.log
2011-07-31 23:06:49.172838 7f23c048c700 filestore(/osd.0) remove
temp/10000000483.000005d6/head = -1
2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write
temp/10000000483.000005d6/head 0~1048576 = 1048576
2011-07-31 23:06:49.900837 7f23c048c700 filestore(/osd.0) write
temp/10000000483.000005d6/head 1048576~1048576 = 1048576
2011-07-31 23:06:49.985600 7f23c048c700 filestore(/osd.0) write
temp/10000000483.000005d6/head 2097152~1048576 = 1048576
2011-07-31 23:06:50.114185 7f23c048c700 filestore(/osd.0) write
temp/10000000483.000005d6/head 3145728~1048576 = 1048576
2011-07-31 23:06:50.114239 7f23c048c700 filestore(/osd.0) remove
0.69_head/10000000483.000005d6/head = -1
2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add
0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0
2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0)
collection_remove temp/10000000483.000005d6/head = 0
2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs
0.69_head/10000000483.000005d6/head '_' len 161 = 161
2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs
0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26
2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs
0.69_head/10000000483.000005d6/head = 26
root@osd0:~#
WBR,
Fyodor.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 31 Jul 2011, Fyodor Ustinov wrote: > On 07/31/2011 08:04 PM, Sage Weil wrote: > > dout(10)<< "setattrs "<< cid<< "/"<< oid<< " '"<< p->first<< "' len > > "<< p->second.length()<< " = "<< r<< dendl; > Well. Thanks! Just to clarify a few things: > root@osd0:~# grep "10000000483\.000005d6" /var/log/ceph/osd.0.log > 2011-07-31 23:06:49.172838 7f23c048c700 filestore(/osd.0) remove > temp/10000000483.000005d6/head = -1 ^ The actual filename is a 10000000483.000005d6_head, ignore that last /. > 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write > temp/10000000483.000005d6/head 0~1048576 = 1048576 > 2011-07-31 23:06:49.900837 7f23c048c700 filestore(/osd.0) write > temp/10000000483.000005d6/head 1048576~1048576 = 1048576 > 2011-07-31 23:06:49.985600 7f23c048c700 filestore(/osd.0) write > temp/10000000483.000005d6/head 2097152~1048576 = 1048576 > 2011-07-31 23:06:50.114185 7f23c048c700 filestore(/osd.0) write > temp/10000000483.000005d6/head 3145728~1048576 = 1048576 > 2011-07-31 23:06:50.114239 7f23c048c700 filestore(/osd.0) remove > 0.69_head/10000000483.000005d6/head = -1 > 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add > 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0 This is link(2) > 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove > temp/10000000483.000005d6/head = 0 This is unlink(2) > 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs > 0.69_head/10000000483.000005d6/head '_' len 161 = 161 The xattr names are prefixed with 'user.ceph.' > 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs > 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 > 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs > 0.69_head/10000000483.000005d6/head = 26 Does that tell you what to need, Ted? Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 31, 2011, at 4:42 PM, Sage Weil wrote: > This is link(2) > >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove >> temp/10000000483.000005d6/head = 0 > > This is unlink(2) > >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs >> 0.69_head/10000000483.000005d6/head '_' len 161 = 161 > > The xattr names are prefixed with 'user.ceph.' > >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs >> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs >> 0.69_head/10000000483.000005d6/head = 26 > > Does that tell you what to need, Ted? So let me make sure I understand what happened. Ceph made sure no file existed with this name, and created with the first write log entry: 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576 It then linked it into this directory: 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0 … and removed it from the temp directory: 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0 and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers) 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26 … and then presumably the file was closed and that would have been the end of this file being touched by ceph, correct? Thanks!! -- Ted-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 1 Aug 2011, Theodore Tso wrote: > On Jul 31, 2011, at 4:42 PM, Sage Weil wrote: > > > This is link(2) > > > >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove > >> temp/10000000483.000005d6/head = 0 > > > > This is unlink(2) > > > >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs > >> 0.69_head/10000000483.000005d6/head '_' len 161 = 161 > > > > The xattr names are prefixed with 'user.ceph.' > > > >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs > >> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 > >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs > >> 0.69_head/10000000483.000005d6/head = 26 > > > > Does that tell you what to need, Ted? > > So let me make sure I understand what happened. Ceph made sure no file existed with this name, and created with the first write log entry: > > 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576 The fd was actually closed here. > It then linked it into this directory: > > 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0 > > ? and removed it from the temp directory: > > 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0 > > and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers) xattrs > > 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161 > 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 > 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26 > > ? and then presumably the file was closed and that would have been the > end of this file being touched by ceph, correct? Yep! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2011/8/1 Sage Weil <sage@newdream.net>: > On Mon, 1 Aug 2011, Theodore Tso wrote: >> On Jul 31, 2011, at 4:42 PM, Sage Weil wrote: >> >> > This is link(2) >> > >> >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove >> >> temp/10000000483.000005d6/head = 0 >> > >> > This is unlink(2) >> > >> >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs >> >> 0.69_head/10000000483.000005d6/head '_' len 161 = 161 >> > >> > The xattr names are prefixed with 'user.ceph.' >> > >> >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs >> >> 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 >> >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs >> >> 0.69_head/10000000483.000005d6/head = 26 >> > >> > Does that tell you what to need, Ted? >> >> So let me make sure I understand what happened. Ceph made sure no file existed with this name, and created with the first write log entry: >> >> 2011-07-31 23:06:49.173941 7f23c048c700 filestore(/osd.0) write temp/10000000483.000005d6/head 0~1048576 = 1048576 > > The fd was actually closed here. > >> It then linked it into this directory: >> >> 2011-07-31 23:06:50.114287 7f23c048c700 filestore(/osd.0) collection_add 0.69_head/10000000483.000005d6/head temp/10000000483.000005d6/head = 0 >> >> ? and removed it from the temp directory: >> >> 2011-07-31 23:06:50.114316 7f23c048c700 filestore(/osd.0) collection_remove temp/10000000483.000005d6/head = 0 >> >> and then created three inodes that totaled approximately 240 bytes in length or so (including the name and headers) > xattrs >> >> 2011-07-31 23:06:50.114384 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head '_' len 161 = 161 >> 2011-07-31 23:06:50.114406 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head 'snapset' len 26 = 26 >> 2011-07-31 23:06:50.114413 7f23c048c700 filestore(/osd.0) setattrs 0.69_head/10000000483.000005d6/head = 26 >> >> ? and then presumably the file was closed and that would have been the >> end of this file being touched by ceph, correct? > > Yep! I tried to reproduce this without ceph, but wasn't able to... In the meantime it seams, that I can also see the side effects on the librbd side: I get an "librbd: data error!" when I do an "rbd copy". When I look at the librbd code this is related to a sparse_read not returning the right size of the object. I don't know if it helps, but I think that the problem is also related to sparse file usage. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote: ... > I tried to reproduce this without ceph, but wasn't able to... > > In the meantime it seams, that I can also see the side effects on the > librbd side: I get an "librbd: data error!" when I do an "rbd copy". > > When I look at the librbd code this is related to a sparse_read not > returning the right size of the object. > > I don't know if it helps, but I think that the problem is also related > to sparse file usage. > There were a few sparse-read issues that we fixed not too long ago, but should have been fixed for at least the previous ceph version. I'm not sure what version you're using. There was a ext4 fiemap issue that I was hitting on specific environments but couldn't determine whether it was fixed in later kernel versions (I was using 2.6.32). Now is a good time to try and get to the bottom of it. Here's a script I was using to reproduce it: #!/bin/sh dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync You can compile and run the following utility to dump all the extents: http://pastebin.com/h2Cnpk2Q Thanks, Yehuda Oh, btw, You can effectively disable the use of fiemap by setting the 'filestore fiemap threshold' config option with large enough value (e.g., anything bigger than 4 MB should be enough for rbd). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I tried 3.0.1 today, which contains the commit Theodore suggested and was no longer able to reproduce the problem. So I think the corruption we have seen is indeed related to: commit 7132de744ba76930d13033061018ddd7e3e8cd91 Author: Maxim Patlasov <maxim.patlasov@gmail.com> Date: Sun Jul 10 19:37:48 2011 -0400 ext4: fix i_blocks/quota accounting when extent insertion fails I will now try to apply this patch to the RHEL6.1 kernel and see what happens... Thanks for your help. Christian 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>: > On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote: > ... >> I tried to reproduce this without ceph, but wasn't able to... >> >> In the meantime it seams, that I can also see the side effects on the >> librbd side: I get an "librbd: data error!" when I do an "rbd copy". >> >> When I look at the librbd code this is related to a sparse_read not >> returning the right size of the object. >> >> I don't know if it helps, but I think that the problem is also related >> to sparse file usage. >> > > There were a few sparse-read issues that we fixed not too long ago, > but should have been fixed for at least the previous ceph version. I'm > not sure what version you're using. > There was a ext4 fiemap issue that I was hitting on specific > environments but couldn't determine whether it was fixed in later > kernel versions (I was using 2.6.32). Now is a good time to try and > get to the bottom of it. Here's a script I was using to reproduce it: > > #!/bin/sh > dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync > > You can compile and run the following utility to dump all the extents: > http://pastebin.com/h2Cnpk2Q > > Thanks, > Yehuda > > Oh, btw, You can effectively disable the use of fiemap by setting the > 'filestore fiemap threshold' config option with large enough value > (e.g., anything bigger than 4 MB should be enough for rbd). > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I'm sorry, that I have to correct this: The problem is still happening with 3.0.1. Although it only seems to happen under high load now. I also did some tracing (with 3.0.0 as the problem is easier to reproduce here). What might be interesting to note is, that the corruption does not occur, when I do an "strace -f cosd". (Maybe a race condition?). To reproduce the problem I have now setup a ceph cluster on a single machine with replication between /ceph/osd.000 and /ceph/osd.001. My setup now has only two active placement groups with 2 objects. The corruption is happening, when I start replication from osd.000 to osd.001. It is reproducible most of the time (but not allways), when I do the following: # mkfs.ext4 -T largefile /dev/sdb1 # mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/ # cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap # /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf ### wait until replication has finished and then stop the cosd # umount /dev/sdb1 # fsck.ext4 -f /dev/sdb e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16. Fix<y>? no Inode 2078, i_blocks is 24, should be 16. Fix<y>? no I can also provide an e2image with the metadata and the strace output of the cosd, if this would be helpful. Regards, Christian 2011/8/8 Christian Brunner <chb@muc.de>: > I tried 3.0.1 today, which contains the commit Theodore suggested and > was no longer able to reproduce the problem. > > So I think the corruption we have seen is indeed related to: > > commit 7132de744ba76930d13033061018ddd7e3e8cd91 > Author: Maxim Patlasov <maxim.patlasov@gmail.com> > Date: Sun Jul 10 19:37:48 2011 -0400 > > ext4: fix i_blocks/quota accounting when extent insertion fails > > > I will now try to apply this patch to the RHEL6.1 kernel and see what > happens... > > Thanks for your help. > > Christian > > > 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>: >> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote: >> ... >>> I tried to reproduce this without ceph, but wasn't able to... >>> >>> In the meantime it seams, that I can also see the side effects on the >>> librbd side: I get an "librbd: data error!" when I do an "rbd copy". >>> >>> When I look at the librbd code this is related to a sparse_read not >>> returning the right size of the object. >>> >>> I don't know if it helps, but I think that the problem is also related >>> to sparse file usage. >>> >> >> There were a few sparse-read issues that we fixed not too long ago, >> but should have been fixed for at least the previous ceph version. I'm >> not sure what version you're using. >> There was a ext4 fiemap issue that I was hitting on specific >> environments but couldn't determine whether it was fixed in later >> kernel versions (I was using 2.6.32). Now is a good time to try and >> get to the bottom of it. Here's a script I was using to reproduce it: >> >> #!/bin/sh >> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync >> >> You can compile and run the following utility to dump all the extents: >> http://pastebin.com/h2Cnpk2Q >> >> Thanks, >> Yehuda >> >> Oh, btw, You can effectively disable the use of fiemap by setting the >> 'filestore fiemap threshold' config option with large enough value >> (e.g., anything bigger than 4 MB should be enough for rbd). >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc index 8bdee6b..fb00cff 100644 --- a/src/os/FileStore.cc +++ b/src/os/FileStore.cc @@ -3570,6 +3570,7 @@ int FileStore::_setattrs(coll_t cid, const sobject_t& oid, map<string,bufferptr> val = ""; // ??? Why do we skip setting all the other attrs if one fails? r = lfn_setxattr(cid, oid, n, val, p->second.length()); + dout(10) << "setattrs " << cid << "/" << oid << " '" << p->first << "' len " << p->second.length() << " = " << r << dendl; if (r < 0) { derr << "FileStore::_setattrs: do_setxattr returned " << r << dendl; break;