diff mbox

Kernel 3.0.0 + ext4 + ceph == ...

Message ID 20110730165001.GI7361@thunk.org (mailing list archive)
State New, archived
Headers show

Commit Message

Theodore Ts'o July 30, 2011, 4:50 p.m. UTC
On Sat, Jul 30, 2011 at 07:36:12PM +0300, Fyodor Ustinov wrote:
> As it is written in subject - 3.0.0 release.
> 
> It's Ubuntu 11.04 with custom kernel

Right, sorry, I missed that.  And just to be clear this wasn't an -rc
kernel but 3.0 final, right?

Hmm, looking through recent commits which will shortly be merged into
3.1, this one leaps out, but I'm not sure it's the cause --- how full
was your disk at the end of this exercise?

I haven't looked at Ceph in quite a while.  As I recall it was
primarily doing Direct I/O writes, correct?  Or does it use buffered
I/O?  And does it use the new "punch" ioctl to release blocks from the
middle of a file?  Ext4 added punch support in 3.0, and there are some
bug fixes that are going into 3.1, but I don't think there were any
that would lead to the failure mode you are seeing.

					- Ted

commit 7132de744ba76930d13033061018ddd7e3e8cd91
Author: Maxim Patlasov <maxim.patlasov@gmail.com>
Date:   Sun Jul 10 19:37:48 2011 -0400

    ext4: fix i_blocks/quota accounting when extent insertion fails
    
    The current implementation of ext4_free_blocks() always calls
    dquot_free_block This looks quite sensible in the most cases: blocks
    to be freed are associated with inode and were accounted in quota and
    i_blocks some time ago.
    
    However, there is a case when blocks to free were not accounted by the
    time calling ext4_free_blocks() yet:
    
    1. delalloc is on, write_begin pre-allocated some space in quota
    2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
    3. then ext4_ext_map_blocks() gets an error (e.g.  ENOSPC) from
       ext4_ext_insert_extent() and calls ext4_free_blocks().
    
    In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
    turn, decrements i_blocks for blocks which were not accounted yet (due
    to delalloc) After clean umount, e2fsck reports something like:
    
    > Inode 21, i_blocks is 5080, should be 5128.  Fix<y>?
    because i_blocks was erroneously decremented as explained above.
    
    The patch fixes the problem by passing the new flag
    EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
    that the dquot_free_block() call be skipped.
    
    Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@kernel.org

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Fyodor Ustinov July 30, 2011, 5:16 p.m. UTC | #1
Ted Ts'o <tytso <at> mit.edu> writes:


 > Right, sorry, I missed that.  And just to be clear this wasn't an -rc
 > kernel but 3.0 final, right?
Yes.

 >
 > Hmm, looking through recent commits which will shortly be merged into
 > 3.1, this one leaps out, but I'm not sure it's the cause --- how full
 > was your disk at the end of this exercise?
root@osd0:~# df -h /osd.0/
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc1             3.6T  1.3T  2.2T  37% /osd.0
root@osd0:~#

 >
 > I haven't looked at Ceph in quite a while.  As I recall it was
 > primarily doing Direct I/O writes, correct?  Or does it use buffered
 > I/O?  And does it use the new "punch" ioctl to release blocks from the
 > middle of a file?  Ext4 added punch support in 3.0, and there are some
 > bug fixes that are going into 3.1, but I don't think there were any
 > that would lead to the failure mode you are seeing.

Ted, I'm sorry, I'm not a developer. I can not answer these questions, 
but I can try build 3.1

WBR,
     Fyodor

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil July 30, 2011, 5:21 p.m. UTC | #2
On Sat, 30 Jul 2011, Ted Ts'o wrote:
> On Sat, Jul 30, 2011 at 07:36:12PM +0300, Fyodor Ustinov wrote:
> > As it is written in subject - 3.0.0 release.
> > 
> > It's Ubuntu 11.04 with custom kernel
> 
> Right, sorry, I missed that.  And just to be clear this wasn't an -rc
> kernel but 3.0 final, right?
> 
> Hmm, looking through recent commits which will shortly be merged into
> 3.1, this one leaps out, but I'm not sure it's the cause --- how full
> was your disk at the end of this exercise?
> 
> I haven't looked at Ceph in quite a while.  As I recall it was
> primarily doing Direct I/O writes, correct?  Or does it use buffered
> I/O?  And does it use the new "punch" ioctl to release blocks from the
> middle of a file?  Ext4 added punch support in 3.0, and there are some
> bug fixes that are going into 3.1, but I don't think there were any
> that would lead to the failure mode you are seeing.

Direct-io is used for the osd journal only; is that on the ext4 partition, 
Fyodor?  Everything else is buffered io.

We don't use the new punch ioctl.

We do use xattrs extensively, though; that was the last extN bug we 
uncovered.  That's where my money is.

Fyodor, if you set 'debug filestore = 10' you'll get a log of every 
operation on the fs in the osd log.  (Or close to it; there may be a few 
that we missed, but to a first approximation at least it'll describe the 
workload pretty well.)

sage

(BTW we'll be really happy if/when the large xattr patches from the Lustre 
guys make it into mainline!  The (4k?) limit on total xattrs is a problem 
for us.)


> 
> 					- Ted
> 
> commit 7132de744ba76930d13033061018ddd7e3e8cd91
> Author: Maxim Patlasov <maxim.patlasov@gmail.com>
> Date:   Sun Jul 10 19:37:48 2011 -0400
> 
>     ext4: fix i_blocks/quota accounting when extent insertion fails
>     
>     The current implementation of ext4_free_blocks() always calls
>     dquot_free_block This looks quite sensible in the most cases: blocks
>     to be freed are associated with inode and were accounted in quota and
>     i_blocks some time ago.
>     
>     However, there is a case when blocks to free were not accounted by the
>     time calling ext4_free_blocks() yet:
>     
>     1. delalloc is on, write_begin pre-allocated some space in quota
>     2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
>     3. then ext4_ext_map_blocks() gets an error (e.g.  ENOSPC) from
>        ext4_ext_insert_extent() and calls ext4_free_blocks().
>     
>     In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
>     turn, decrements i_blocks for blocks which were not accounted yet (due
>     to delalloc) After clean umount, e2fsck reports something like:
>     
>     > Inode 21, i_blocks is 5080, should be 5128.  Fix<y>?
>     because i_blocks was erroneously decremented as explained above.
>     
>     The patch fixes the problem by passing the new flag
>     EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
>     that the dquot_free_block() call be skipped.
>     
>     Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@kernel.org
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 49d2cea..d13f3b5 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -526,6 +526,7 @@ struct ext4_new_group_data {
>  #define EXT4_FREE_BLOCKS_METADATA	0x0001
>  #define EXT4_FREE_BLOCKS_FORGET		0x0002
>  #define EXT4_FREE_BLOCKS_VALIDATED	0x0004
> +#define EXT4_FREE_BLOCKS_NO_QUOT_UPDATE	0x0008
>  
>  /*
>   * ioctl commands
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 31ae5fb..a862138 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3565,12 +3565,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>  
>  	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
>  	if (err) {
> +		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
> +			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
>  		/* free data blocks we just allocated */
>  		/* not a good idea to call discard here directly,
>  		 * but otherwise we'd need to call it every free() */
>  		ext4_discard_preallocations(inode);
>  		ext4_free_blocks(handle, inode, NULL, ext4_ext_pblock(&newex),
> -				 ext4_ext_get_actual_len(&newex), 0);
> +				 ext4_ext_get_actual_len(&newex), fb_flags);
>  		goto out2;
>  	}
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 389386b..1900ec7 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4637,7 +4637,7 @@ do_more:
>  	}
>  	ext4_mark_super_dirty(sb);
>  error_return:
> -	if (freed)
> +	if (freed && !(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
>  		dquot_free_block(inode, freed);
>  	brelse(bitmap_bh);
>  	ext4_std_error(sb, err);
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fyodor Ustinov July 30, 2011, 5:27 p.m. UTC | #3
On 07/30/2011 08:21 PM, Sage Weil wrote:
>
>> Hmm, looking through recent commits which will shortly be merged into
>> 3.1, this one leaps out, but I'm not sure it's the cause --- how full
>> was your disk at the end of this exercise?
>>
>> I haven't looked at Ceph in quite a while.  As I recall it was
>> primarily doing Direct I/O writes, correct?  Or does it use buffered
>> I/O?  And does it use the new "punch" ioctl to release blocks from the
>> middle of a file?  Ext4 added punch support in 3.0, and there are some
>> bug fixes that are going into 3.1, but I don't think there were any
>> that would lead to the failure mode you are seeing.
> Direct-io is used for the osd journal only; is that on the ext4 partition,
> Fyodor?  Everything else is buffered io.
No, journal placed on tempfs.
> We don't use the new punch ioctl.
>
> We do use xattrs extensively, though; that was the last extN bug we
> uncovered.  That's where my money is.
>
> Fyodor, if you set 'debug filestore = 10' you'll get a log of every
> operation on the fs in the osd log.  (Or close to it; there may be a few
> that we missed, but to a first approximation at least it'll describe the
> workload pretty well.)
Ok, I will try it. But my system fs have only 16G. I'm not sure that it 
fits.

WBR,
     Fyodor.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fyodor Ustinov July 30, 2011, 5:54 p.m. UTC | #4
On 07/30/2011 08:21 PM, Sage Weil wrote:
>
> Fyodor, if you set 'debug filestore = 10' you'll get a log of every
> operation on the fs in the osd log.  (Or close to it; there may be a few
> that we missed, but to a first approximation at least it'll describe the
> workload pretty well.)
>
Well, I do it. No need to wait for complete synchronisation.

Who wants to get this small file? :)

-rw------- 1 root root 412M 2011-07-30 20:51 /var/log/ceph/osd.0.log

WBR,
     Fyodor.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 30, 2011, 10:19 p.m. UTC | #5
On Sat, Jul 30, 2011 at 10:21:13AM -0700, Sage Weil wrote:
> 
> We do use xattrs extensively, though; that was the last extN bug we 
> uncovered.  That's where my money is.

Hmm, yes.  That could very well be.  How big are the xattrs, and are
there cases where you:

a) start with a small xattr (where the total size is less than 128
bytes, so it can be stored in the inode table), and then increase it
something where it needs to be stored in an external block?

b) start with enough xattrs so it's large, and then delete all or most
of them?

I could easily believe we might have some bugs as we transition from
in-inode to external block storage, or vice versa.  I'll take a look
at the code and try to create some reproduction cases, but if you
could give me a handle on workload patterns of ceph around xattrs,
that would be interesting.

Another thing to try might be to format the disk with 128 byte inodes
(mke2fs -t ext4 -I 128 /dev/hdXX) and see if you can reproduce the
problem that way.  The support for in-inode xattrs is a new feature
(to ext4), and so it's a bit more likely that if there is a bug, it's
related to our in-inode xattr handling --- and using a 128 byte inode
would suppress that feature.  I don't recommend running that way, of
course, but it might help tell us if that's where we should be looking
for a bug.

> (BTW we'll be really happy if/when the large xattr patches from the Lustre 
> guys make it into mainline!  The (4k?) limit on total xattrs is a problem 
> for us.)

OK, good to know.  It hadn't been high priority for the ext4 team
(since I thought it was only the Lustre folks that really needed it),
but I'll escalate the priority of that on our todo list.

Thanks, regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil July 31, 2011, 4:54 a.m. UTC | #6
On Sat, 30 Jul 2011, Ted Ts'o wrote:
> On Sat, Jul 30, 2011 at 10:21:13AM -0700, Sage Weil wrote:
> > 
> > We do use xattrs extensively, though; that was the last extN bug we 
> > uncovered.  That's where my money is.
> 
> Hmm, yes.  That could very well be.  How big are the xattrs, and are
> there cases where you:
> 
> a) start with a small xattr (where the total size is less than 128
> bytes, so it can be stored in the inode table), and then increase it
> something where it needs to be stored in an external block?
> 
> b) start with enough xattrs so it's large, and then delete all or most
> of them?
> 
> I could easily believe we might have some bugs as we transition from
> in-inode to external block storage, or vice versa.  I'll take a look
> at the code and try to create some reproduction cases, but if you
> could give me a handle on workload patterns of ceph around xattrs,
> that would be interesting.

I would guess a, but it could also be a+b. 

Fyodor, can you take some of the corrupt inos that fsck complained about 
and see what files/directories they are?  find /osd.0 -inum NNN.  (I'm 
guessing the largest xattrs are on the collection directories, like 
/osd.0/current/something_head/.)  Then grep that filename out of the log 
to see exactly which operations took place.  The setattr log normally 
includes xattr size.

> Another thing to try might be to format the disk with 128 byte inodes
> (mke2fs -t ext4 -I 128 /dev/hdXX) and see if you can reproduce the
> problem that way.  The support for in-inode xattrs is a new feature
> (to ext4), and so it's a bit more likely that if there is a bug, it's
> related to our in-inode xattr handling --- and using a 128 byte inode
> would suppress that feature.  I don't recommend running that way, of
> course, but it might help tell us if that's where we should be looking
> for a bug.
>
> > (BTW we'll be really happy if/when the large xattr patches from the Lustre 
> > guys make it into mainline!  The (4k?) limit on total xattrs is a problem 
> > for us.)
> 
> OK, good to know.  It hadn't been high priority for the ext4 team
> (since I thought it was only the Lustre folks that really needed it),
> but I'll escalate the priority of that on our todo list.

Wonderful.

Thanks, Ted!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fyodor Ustinov July 31, 2011, 11:33 a.m. UTC | #7
On 07/31/2011 07:54 AM, Sage Weil wrote:
> On Sat, 30 Jul 2011, Ted Ts'o wrote:
>> On Sat, Jul 30, 2011 at 10:21:13AM -0700, Sage Weil wrote:
>>> We do use xattrs extensively, though; that was the last extN bug we
>>> uncovered.  That's where my money is.
>> Hmm, yes.  That could very well be.  How big are the xattrs, and are
>> there cases where you:
>>
>> a) start with a small xattr (where the total size is less than 128
>> bytes, so it can be stored in the inode table), and then increase it
>> something where it needs to be stored in an external block?
>>
>> b) start with enough xattrs so it's large, and then delete all or most
>> of them?
>>
>> I could easily believe we might have some bugs as we transition from
>> in-inode to external block storage, or vice versa.  I'll take a look
>> at the code and try to create some reproduction cases, but if you
>> could give me a handle on workload patterns of ceph around xattrs,
>> that would be interesting.
> I would guess a, but it could also be a+b.
>
> Fyodor, can you take some of the corrupt inos that fsck complained about
> and see what files/directories they are?  find /osd.0 -inum NNN.  (I'm
> guessing the largest xattrs are on the collection directories, like
> /osd.0/current/something_head/.)  Then grep that filename out of the log
> to see exactly which operations took place.  The setattr log normally
> includes xattr size.


/etc/init.d/ceph stop
umount /mnt/osd.0
mke2fs -t ext4 -I 128 /dev/sdc1
tune2fs -o journal_data_writeback /dev/sdc1
mount -a
mon getmap -o /tmp/monmap
cosd --mkfs -i 0 --monmap /tmp/monmap
/etc/init.d/ceph start
sleep 300
/etc/init.d/ceph stop
umount /osd.0
fsck.ext4 -f /dev/sdc1

Inode 99356878, i_blocks is 8208, should be 8200.

mount -a
root@osd0:~# find /osd.0 -inum 99356878

/osd.0/current/0.2a4_head/10000000468.0000007e_head

root@osd0:~# grep "10000000468\.0000007e" /var/log/ceph/osd.0.log
2011-07-31 09:57:20.859834 7f624c82a700 filestore(/osd.0) remove 
temp/10000000468.0000007e/head = -1
2011-07-31 09:57:20.861166 7f624c82a700 filestore(/osd.0) write 
temp/10000000468.0000007e/head 0~1048576 = 1048576
2011-07-31 09:57:20.990464 7f624c029700 filestore(/osd.0) write 
temp/10000000468.0000007e/head 1048576~1048576 = 1048576
2011-07-31 09:57:21.121648 7f624c029700 filestore(/osd.0) write 
temp/10000000468.0000007e/head 2097152~1048576 = 1048576
2011-07-31 09:57:21.265879 7f624c029700 filestore(/osd.0) write 
temp/10000000468.0000007e/head 3145728~1048576 = 1048576
2011-07-31 09:57:21.265952 7f624c029700 filestore(/osd.0) remove 
0.2a4_head/10000000468.0000007e/head = -1
2011-07-31 09:57:21.265995 7f624c029700 filestore(/osd.0) collection_add 
0.2a4_head/10000000468.0000007e/head temp/10000000468.0000007e/head = 0
2011-07-31 09:57:21.266025 7f624c029700 filestore(/osd.0) 
collection_remove temp/10000000468.0000007e/head = 0
2011-07-31 09:57:21.266134 7f624c029700 filestore(/osd.0) setattrs 
0.2a4_head/10000000468.0000007e/head = 26


WBR,
     Fyodor.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 49d2cea..d13f3b5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -526,6 +526,7 @@  struct ext4_new_group_data {
 #define EXT4_FREE_BLOCKS_METADATA	0x0001
 #define EXT4_FREE_BLOCKS_FORGET		0x0002
 #define EXT4_FREE_BLOCKS_VALIDATED	0x0004
+#define EXT4_FREE_BLOCKS_NO_QUOT_UPDATE	0x0008
 
 /*
  * ioctl commands
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 31ae5fb..a862138 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3565,12 +3565,14 @@  int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 
 	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
 	if (err) {
+		int fb_flags = flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ?
+			EXT4_FREE_BLOCKS_NO_QUOT_UPDATE : 0;
 		/* free data blocks we just allocated */
 		/* not a good idea to call discard here directly,
 		 * but otherwise we'd need to call it every free() */
 		ext4_discard_preallocations(inode);
 		ext4_free_blocks(handle, inode, NULL, ext4_ext_pblock(&newex),
-				 ext4_ext_get_actual_len(&newex), 0);
+				 ext4_ext_get_actual_len(&newex), fb_flags);
 		goto out2;
 	}
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 389386b..1900ec7 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4637,7 +4637,7 @@  do_more:
 	}
 	ext4_mark_super_dirty(sb);
 error_return:
-	if (freed)
+	if (freed && !(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
 		dquot_free_block(inode, freed);
 	brelse(bitmap_bh);
 	ext4_std_error(sb, err);