diff mbox

Btrfs: Don't allocate inode that is already in use

Message ID 71e7d4e789d0d2d0cc9f1138cb4411a1d0ec5125.1381860425.git.sbehrens@giantdisaster.de (mailing list archive)
State Accepted, archived
Headers show

Commit Message

Stefan Behrens Oct. 15, 2013, 6:08 p.m. UTC
Due to an off-by-one error, it is possible to reproduce a bug
when the inode cache is used.

The same inode number is assigned twice, the second time this
leads to an EEXIST in btrfs_insert_empty_items().

The issue can happen when a file is removed right after a subvolume
is created and then a new inode number is created before the
inodes in free_inode_pinned are processed.
unlink() calls btrfs_return_ino() which calls start_caching() in this
case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
searching for the highest inode (which already cannot find the
unlinked one anymore in btrfs_find_free_objectid()). So if this
unlinked inode's number is equal to the highest_ino + 1 (or >= this value
instead of > this value which was the off-by-one error), we mustn't add
the inode number to free_ino_pinned (caching_thread() does it right).
In this case we need to try directly to add the number to the inode_cache
which will fail in this case.

When this inode number is allocated while it is still in free_ino_pinned,
it is allocated and still added to the free inode cache when the
pinned inodes are processed, thus one of the following inode number
allocations will get an inode that is already in use and fail with EEXIST
in btrfs_insert_empty_items().

One example which was created with the reproducer below:
Create a snapshot, work in the newly created snapshot for the rest.
In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
start_caching() calls add_free_space [34284, 18446744073709517077].
In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
btrfs_unpin_free_ino calls add_free_space [34284, 1].
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
EEXIST when the new inode is inserted.

One possible reproducer is this one:
 #!/bin/sh
 # preparation
TEST_DEV=/dev/sdc1
TEST_MNT=/mnt
umount ${TEST_MNT} 2>/dev/null || true
mkfs.btrfs -f ${TEST_DEV}
mount ${TEST_DEV} ${TEST_MNT} -o \
 rw,relatime,compress=lzo,space_cache,inode_cache
btrfs subv create ${TEST_MNT}/s1
for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
rm ${TEST_MNT}/s2/$FILENAME
touch ${TEST_MNT}/s2/$FILENAME
 # the following steps can be repeated to reproduce the issue again and again
[ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
rm ${TEST_MNT}/s3/$FILENAME
touch ${TEST_MNT}/s3/$FILENAME
ls -alFi ${TEST_MNT}/s?/$FILENAME
touch ${TEST_MNT}/s3/_1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_1
touch ${TEST_MNT}/s3/_2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_2
touch ${TEST_MNT}/s3/__1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__1
touch ${TEST_MNT}/s3/__2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__2
 # if the above is not enough, add the following loop:
for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
 # already in use that btrfs_find_ino_for_alloc() returns.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/inode-map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jan Schmidt Oct. 15, 2013, 6:54 p.m. UTC | #1
On Tue, October 15, 2013 at 20:08 (+0200), Stefan Behrens wrote:
> Due to an off-by-one error, it is possible to reproduce a bug
> when the inode cache is used.
> 
> The same inode number is assigned twice, the second time this
> leads to an EEXIST in btrfs_insert_empty_items().
> 
> The issue can happen when a file is removed right after a subvolume
> is created and then a new inode number is created before the
> inodes in free_inode_pinned are processed.
> unlink() calls btrfs_return_ino() which calls start_caching() in this
> case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
> searching for the highest inode (which already cannot find the
> unlinked one anymore in btrfs_find_free_objectid()). So if this
> unlinked inode's number is equal to the highest_ino + 1 (or >= this value
> instead of > this value which was the off-by-one error), we mustn't add
> the inode number to free_ino_pinned (caching_thread() does it right).
> In this case we need to try directly to add the number to the inode_cache
> which will fail in this case.
> 
> When this inode number is allocated while it is still in free_ino_pinned,
> it is allocated and still added to the free inode cache when the
> pinned inodes are processed, thus one of the following inode number
> allocations will get an inode that is already in use and fail with EEXIST
> in btrfs_insert_empty_items().
> 
> One example which was created with the reproducer below:
> Create a snapshot, work in the newly created snapshot for the rest.
> In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
> start_caching() calls add_free_space [34284, 18446744073709517077].
> In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
> mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
> btrfs_unpin_free_ino calls add_free_space [34284, 1].
> mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
> EEXIST when the new inode is inserted.
> 
> One possible reproducer is this one:
>  #!/bin/sh
>  # preparation
> TEST_DEV=/dev/sdc1
> TEST_MNT=/mnt
> umount ${TEST_MNT} 2>/dev/null || true
> mkfs.btrfs -f ${TEST_DEV}
> mount ${TEST_DEV} ${TEST_MNT} -o \
>  rw,relatime,compress=lzo,space_cache,inode_cache
> btrfs subv create ${TEST_MNT}/s1
> for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
> btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
> FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
> rm ${TEST_MNT}/s2/$FILENAME
> touch ${TEST_MNT}/s2/$FILENAME
>  # the following steps can be repeated to reproduce the issue again and again
> [ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
> btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
> rm ${TEST_MNT}/s3/$FILENAME
> touch ${TEST_MNT}/s3/$FILENAME
> ls -alFi ${TEST_MNT}/s?/$FILENAME
> touch ${TEST_MNT}/s3/_1 || logger FAILED
> ls -alFi ${TEST_MNT}/s?/_1
> touch ${TEST_MNT}/s3/_2 || logger FAILED
> ls -alFi ${TEST_MNT}/s?/_2
> touch ${TEST_MNT}/s3/__1 || logger FAILED
> ls -alFi ${TEST_MNT}/s?/__1
> touch ${TEST_MNT}/s3/__2 || logger FAILED
> ls -alFi ${TEST_MNT}/s?/__2
>  # if the above is not enough, add the following loop:
> for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
>  #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
>  # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
>  # already in use that btrfs_find_ino_for_alloc() returns.

Probably a bit too obscure to turn this into  an xfstest? At least nobody
complained so far, and this reproducer takes me 1m57 to run, so nothing I want
in each xfstest cycle.

If we ever introduce a similar problem, this reproducer probably won't find it
(at least if it's really dependent on the exact number of files and the exact
inode number), unless we're effectively reversing this patch. So no real use for
a regression test in my opinion, I'm okay with just fixing it.

> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
> ---
>  fs/btrfs/inode-map.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
> index 014de49..ec08004 100644
> --- a/fs/btrfs/inode-map.c
> +++ b/fs/btrfs/inode-map.c
> @@ -237,7 +237,7 @@ again:
>  		start_caching(root);
>  
>  		if (objectid <= root->cache_progress ||
> -		    objectid > root->highest_objectid)
> +		    objectid >= root->highest_objectid)
>  			__btrfs_add_free_space(ctl, objectid, 1);
>  		else
>  			__btrfs_add_free_space(pinned, objectid, 1);
> 

Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>

... although this is not the most beautiful commit message I've ever seen ;-)

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown Oct. 15, 2013, 8:41 p.m. UTC | #2
> Probably a bit too obscure to turn this into  an xfstest? At least nobody
> complained so far, and this reproducer takes me 1m57 to run, so nothing I want
> in each xfstest cycle.

I disagree.  The entire point of regression tests is to trigger bugs
that the usual processes failed to find, like this one.

If you think that two minutes is too long for a test to run then mark it
as "stress" (is that the xfstests group for boring long running tests?)
or take the time to make a tighter test.

Don't just skip regression testing.  Please.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Schmidt Oct. 16, 2013, 12:19 p.m. UTC | #3
On Tue, October 15, 2013 at 22:41 (+0200), Zach Brown wrote:
>> Probably a bit too obscure to turn this into  an xfstest? At least nobody
>> complained so far, and this reproducer takes me 1m57 to run, so nothing I want
>> in each xfstest cycle.
> 
> I disagree.  The entire point of regression tests is to trigger bugs
> that the usual processes failed to find, like this one.
> 
> If you think that two minutes is too long for a test to run then mark it
> as "stress" (is that the xfstests group for boring long running tests?)
> or take the time to make a tighter test.
> 
> Don't just skip regression testing.  Please.

You are mixing up my points. The first argument you're quoting is not against
regression testing in this case, and it deserves the "stress" answer, I agree.

You don't quote my second argument, which is not "just skip regression testing".
I'll try again in other words: A regression test only makes sense if it can
prevent us from making the same mistake again. As far as I see, the reproducer
script is so specific, that the only thing it can prevent is an exact revert of
Stefan's patch. If you argue that we should have a test for just this, fair
enough, then we could use exactly Stefan's script. I don't think that gains us
anything. We're not normally reverting bugfix patches deliberately, especially
not for very short patches with very long descriptions.

I'd very much like to see a more generic test to avoid similar regressions, if
that can be created. I don't have a good plan how to trigger such a situation
(i.e. know which inodes are on the free_inode_pinned list) in a more general way.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Brown Oct. 16, 2013, 4:46 p.m. UTC | #4
> > Don't just skip regression testing.  Please.
> 
> You are mixing up my points. The first argument you're quoting is not against
> regression testing in this case, and it deserves the "stress" answer, I agree.

Great, then we're done.

(Yes, I'm very much ignoring your other argument that's based on your
predictions of what bugs will exist in the future and whether or not
tests will find them.)

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tim Landscheidt Oct. 16, 2013, 5:26 p.m. UTC | #5
Jan Schmidt <list.btrfs@jan-o-sch.net> wrote:

> [...]

> You don't quote my second argument, which is not "just skip regression testing".
> I'll try again in other words: A regression test only makes sense if it can
> prevent us from making the same mistake again. As far as I see, the reproducer
> script is so specific, that the only thing it can prevent is an exact revert of
> Stefan's patch. If you argue that we should have a test for just this, fair
> enough, then we could use exactly Stefan's script. I don't think that gains us
> anything. We're not normally reverting bugfix patches deliberately, especially
> not for very short patches with very long descriptions.

> [...]

The presence of the bug in the current code indicates that
someone in the past made an error, and that suggests that
this error can be repeated (by someone else) for example in
a rewrite months or years in the future.  The purpose of a
regression test is to spare anyone who touches the code to
go through all the commits to see if they're unintentionally
reverting a bug fix, but give them a nice "FAIL"/"PASS"
traffic light.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 014de49..ec08004 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -237,7 +237,7 @@  again:
 		start_caching(root);
 
 		if (objectid <= root->cache_progress ||
-		    objectid > root->highest_objectid)
+		    objectid >= root->highest_objectid)
 			__btrfs_add_free_space(ctl, objectid, 1);
 		else
 			__btrfs_add_free_space(pinned, objectid, 1);