diff mbox

[1/2,v2] fs: add SEEK_HOLE and SEEK_DATA flags

Message ID 1304531920-2890-1-git-send-email-josef@redhat.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Josef Bacik May 4, 2011, 5:58 p.m. UTC
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags.  Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck.  So we have

-SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
given position.  If the given position is a hole then pos won't move.  A "hole"
is defined by whatever the fs feels like defining it to be.  In simple things
like ext2/3 it will simplly mean an unallocated space in the file.  For more
complex things where you have preallocated space then that is left up to the
filesystem.  Since preallocated space is supposed to return all 0's it is
perfectly legitimate to have SEEK_HOLE dump you out at the start of a
preallocated extent, but then again if this is not something you can do and be
sure the extent isn't in the middle of being converted to a real extent then it
is also perfectly legitimate to skip preallocated extents and only park f_pos at
a truly unallocated section.

-SEEK_DATA: this is obviously a little more self-explanatory.  Again the only
ambiguity comes in with preallocated extents.  If you have an fs that can't
reliably tell that the preallocated extent is in the process of turning into a
real extent, it is correct for SEEK_DATA to park you at a preallocated extent.

In the generic case we will just assume the entire file is data and there is a
virtual hole at i_size, so SEEK_DATA will return -ENXIO unless you provide an
offset of 0 and the file size is larger than 0, and SEEK_HOLE will put you at
i_size unless pos is i_size or larger, and i_size is larger than 0.

Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
---
v1->v2: Make the generic case assume that the entire file is data and there is a
virtual hole at the end of the file.
 fs/read_write.c    |   22 ++++++++++++++++++++++
 include/linux/fs.h |    4 +++-
 2 files changed, 25 insertions(+), 1 deletions(-)

Comments

Valdis Kl ē tnieks May 4, 2011, 7:04 p.m. UTC | #1
On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:

> -SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
> given position. 

Nearest, or next? Solaris defines it as "next", for a good reason - otherwise
you can get stuck in a case where the "nearest" hole is back towards the
start of the file - and "seek data" will bounce back to the next byte at
the other end of the hole.

Consider a file with this layout:

< 40K of data>  A  < 32K hole> B < 32K data> C < 8K hole> D <32K data> E ....

If you're in the range between "8K-1 before C" and "8K-1 after D", there's no
application of seeks to "nearest" data/hole that doesn't leave you oscillating
between C and D, and unable to reach B or E.  If youre at C, "nearest hole" is
where you are, and "nearest data" is at D, not B. Similarly for D - nearest
data is C, not E.

However, this is easily dealt with if you define it as "next", as then it is
simple to discover exactly where A/B/C/D/E are.
Josef Bacik May 4, 2011, 7:10 p.m. UTC | #2
On 05/04/2011 03:04 PM, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>
>> -SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
>> given position.
>
> Nearest, or next? Solaris defines it as "next", for a good reason - otherwise
> you can get stuck in a case where the "nearest" hole is back towards the
> start of the file - and "seek data" will bounce back to the next byte at
> the other end of the hole.
>

Yeah sorry the log says "nearest" but the code says "next", if you look 
at it thats how it works.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Valdis Kl ē tnieks May 4, 2011, 7:20 p.m. UTC | #3
On Wed, 04 May 2011 15:10:20 EDT, Josef Bacik said:

> Yeah sorry the log says "nearest" but the code says "next", if you look 
> at it thats how it works.  Thanks,

Oh good - the changelog is usually easier to fix than the code is. :)

Probably want to fix the changelog before it gets committed, as there's a fair
chance that text will end up being used as the basis for a manpage or other
documentation.
Josef Bacik May 4, 2011, 7:22 p.m. UTC | #4
On 05/04/2011 03:20 PM, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 04 May 2011 15:10:20 EDT, Josef Bacik said:
>
>> Yeah sorry the log says "nearest" but the code says "next", if you look
>> at it thats how it works.  Thanks,
>
> Oh good - the changelog is usually easier to fix than the code is. :)
>
> Probably want to fix the changelog before it gets committed, as there's a fair
> chance that text will end up being used as the basis for a manpage or other
> documentation.
>

Yeah agreed I meant to change it this time around but forgot, I will 
make the log all nice and pretty next time around, as I doubt this will 
be the last iteration of these patches ;).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Valdis Kl ē tnieks May 4, 2011, 7:31 p.m. UTC | #5
On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:

> +#define SEEK_HOLE	3	/* seek to the closest hole */
> +#define SEEK_DATA	4	/* seek to the closest data */

Comments here need nearest/next fixing as well - otherwise the ext[34] crew may
actually implement the commented semantics. ;)

Other than that, patch 1/2 looks OK to me (not that there's much code to
review), and 2/2 *seems* sane and implement the "next" semantics, though I only
examined the while/if structure and am assuming the btrfs innards are done
correctly.  In particular, that 'while (1)' looks like it can be painful for a
sufficiently large and fragmented file (think a gigabyte file in 4K chunks,
producing a million extents), but I'll let a btrfs expert analyse that
performance issue ;)
Josef Bacik May 4, 2011, 7:33 p.m. UTC | #6
On 05/04/2011 03:31 PM, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>
>> +#define SEEK_HOLE	3	/* seek to the closest hole */
>> +#define SEEK_DATA	4	/* seek to the closest data */
>
> Comments here need nearest/next fixing as well - otherwise the ext[34] crew may
> actually implement the commented semantics. ;)
>

Balls, thanks I'll fix that.

> Other than that, patch 1/2 looks OK to me (not that there's much code to
> review), and 2/2 *seems* sane and implement the "next" semantics, though I only
> examined the while/if structure and am assuming the btrfs innards are done
> correctly.  In particular, that 'while (1)' looks like it can be painful for a
> sufficiently large and fragmented file (think a gigabyte file in 4K chunks,
> producing a million extents), but I'll let a btrfs expert analyse that
> performance issue ;)
>

Heh well we do while (1) in btrfs _everywhere_, so this isn't anything 
new, tho I should probably throw a cond_resched() in there.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Kleikamp May 4, 2011, 9:54 p.m. UTC | #7
On 05/04/2011 02:10 PM, Josef Bacik wrote:
> On 05/04/2011 03:04 PM, Valdis.Kletnieks@vt.edu wrote:
>> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>>
>>> -SEEK_HOLE: this moves the file pos to the nearest hole in the file
>>> from the
>>> given position.
>>
>> Nearest, or next? Solaris defines it as "next", for a good reason -
>> otherwise
>> you can get stuck in a case where the "nearest" hole is back towards the
>> start of the file - and "seek data" will bounce back to the next byte at
>> the other end of the hole.
>>
>
> Yeah sorry the log says "nearest" but the code says "next", if you look
> at it thats how it works. Thanks,

The comments in fs.h say "closest".  You may want to change them to 
"next" as well.

Thanks,
Shaggy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Kleikamp May 4, 2011, 9:55 p.m. UTC | #8
On 05/04/2011 04:54 PM, Dave Kleikamp wrote:

> The comments in fs.h say "closest". You may want to change them to
> "next" as well.

Sorry.  Missed some of the replies before I responded.  Already addressed.

Shaggy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marco May 5, 2011, 6:54 p.m. UTC | #9
Il 04/05/2011 19:58, Josef Bacik ha scritto:
> +		if (offset>= i_size_read(inode)) {
> +			mutex_unlock(&inode->i_mutex);
> +			return -ENXIO;
> +		}
> +		offset = i_size_read(inode);
> +		break;

Here maybe it's possible to use offset bigger than i_size, because 
i_size_read is "atomic" but something can happen between two calls, 
isn't it?

Marco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marco May 5, 2011, 6:58 p.m. UTC | #10
Il 05/05/2011 21:01, Josef Bacik ha scritto:
> On 05/05/2011 02:54 PM, Marco Stornelli wrote:
>> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>>> + if (offset>= i_size_read(inode)) {
>>> + mutex_unlock(&inode->i_mutex);
>>> + return -ENXIO;
>>> + }
>>> + offset = i_size_read(inode);
>>> + break;
>>
>> Here maybe it's possible to use offset bigger than i_size, because
>> i_size_read is "atomic" but something can happen between two calls,
>> isn't it?
>>
>
> We're holding the i_mutex so we are safe, i_size_read is used just for
> consistency sake. Thanks,
>
> Josef
>

Oh, I'm sorry, I misread the patch, ok. Maybe we can use i_size at this 
point without i_size_read.

Marco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik May 5, 2011, 7:01 p.m. UTC | #11
On 05/05/2011 02:54 PM, Marco Stornelli wrote:
> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>> + if (offset>= i_size_read(inode)) {
>> + mutex_unlock(&inode->i_mutex);
>> + return -ENXIO;
>> + }
>> + offset = i_size_read(inode);
>> + break;
>
> Here maybe it's possible to use offset bigger than i_size, because
> i_size_read is "atomic" but something can happen between two calls,
> isn't it?
>

We're holding the i_mutex so we are safe, i_size_read is used just for 
consistency sake.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marco May 5, 2011, 7:19 p.m. UTC | #12
Il 04/05/2011 19:58, Josef Bacik ha scritto:
> +		if (offset>= i_size_read(inode)) {
> +			mutex_unlock(&inode->i_mutex);
> +			return -ENXIO;
> +		}
> +		offset = i_size_read(inode);
> +		break;

I can add that generic_file_llseek_unlocked means *unlocked* so you 
shouldn't unlock any mutex but only return a value. The current version, 
in case of SEEK_END uses directly i_size indeed, so maybe I'm missing 
something.

Marco
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik May 5, 2011, 7:35 p.m. UTC | #13
On 05/05/2011 03:19 PM, Marco Stornelli wrote:
> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>> + if (offset>= i_size_read(inode)) {
>> + mutex_unlock(&inode->i_mutex);
>> + return -ENXIO;
>> + }
>> + offset = i_size_read(inode);
>> + break;
>
> I can add that generic_file_llseek_unlocked means *unlocked* so you
> shouldn't unlock any mutex but only return a value. The current version,
> in case of SEEK_END uses directly i_size indeed, so maybe I'm missing
> something.

Yeah this was a copy+paste mistake, ext4 has it's own llseek that I 
modified to run my tests against and then I just copied and pasted it 
over to the generic things.  I've fixed this earlier, I'll be sending a 
refreshed set out soon.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 5520f8a..6ee63a4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -64,6 +64,28 @@  generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
 			return file->f_pos;
 		offset += file->f_pos;
 		break;
+	case SEEK_DATA:
+		/*
+		 * In the generic case the entire file is data, so data only
+		 * starts at position 0 provided the file has an i_size,
+		 * otherwise it's an empty file and will always be ENXIO.
+		 */
+		if (offset != 0 || i_size_read(inode)) {
+			mutex_unlock(&inode->i_mutex);
+			return -ENXIO;
+		}
+		break;
+	case SEEK_HOLE:
+		/*
+		 * There is a virtual hole at the end of the file, so as long as
+		 * offset isn't i_size or larger, return i_size.
+		 */
+		if (offset >= i_size_read(inode)) {
+			mutex_unlock(&inode->i_mutex);
+			return -ENXIO;
+		}
+		offset = i_size_read(inode);
+		break;
 	}
 
 	if (offset < 0 && !unsigned_offsets(file))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dbd860a..1b72e0c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -31,7 +31,9 @@ 
 #define SEEK_SET	0	/* seek relative to beginning of file */
 #define SEEK_CUR	1	/* seek relative to current file position */
 #define SEEK_END	2	/* seek relative to end of file */
-#define SEEK_MAX	SEEK_END
+#define SEEK_HOLE	3	/* seek to the closest hole */
+#define SEEK_DATA	4	/* seek to the closest data */
+#define SEEK_MAX	SEEK_DATA
 
 struct fstrim_range {
 	__u64 start;