Message ID | 20230418-btrfs-extent-reads-v1-2-47ba9839f0cc@codewreck.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix and improve read code | expand |
On 2023/4/18 09:17, Dominique Martinet wrote: > From: Dominique Martinet <dominique.martinet@atmark-techno.com> > > btrfs_file_read main 'aligned read' loop would limit the last read to > the aligned end even if the data is present in the extent: there is no > reason not to read the data that we know will be presented correctly. > > If that somehow fails cur only advances up to the aligned_end anyway and > the file tail is read through the old code unchanged; this could be > required if e.g. the end data is inlined. > > Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com> > --- > fs/btrfs/inode.c | 17 +++++++++-------- > 1 file changed, 9 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 3d6e39e6544d..efffec0f2e68 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -663,7 +663,8 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, > struct btrfs_path path; > struct btrfs_key key; > u64 aligned_start = round_down(file_offset, fs_info->sectorsize); > - u64 aligned_end = round_down(file_offset + len, fs_info->sectorsize); > + u64 end = file_offset + len; > + u64 aligned_end = round_down(end, fs_info->sectorsize); > u64 next_offset; > u64 cur = aligned_start; > int ret = 0; > @@ -743,26 +744,26 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, > extent_num_bytes = btrfs_file_extent_num_bytes(path.nodes[0], > fi); > ret = btrfs_read_extent_reg(&path, fi, cur, > - min(extent_num_bytes, aligned_end - cur), > + min(extent_num_bytes, end - cur), > dest + cur - file_offset); > if (ret < 0) > goto out; > - cur += min(extent_num_bytes, aligned_end - cur); > + cur += min(extent_num_bytes, end - cur); > } > > /* Read the tailing unaligned part*/ Can we remove this part completely? IIRC if we read until the target end, the unaligned end part can be completely removed then. Thanks, Qu > - if (file_offset + len != aligned_end) { > + if (file_offset + len != cur) { > btrfs_release_path(&path); > - ret = lookup_data_extent(root, &path, ino, aligned_end, > + ret = lookup_data_extent(root, &path, ino, cur, > &next_offset); > /* <0 is error, >0 means no extent */ > if (ret) > goto out; > fi = btrfs_item_ptr(path.nodes[0], path.slots[0], > struct btrfs_file_extent_item); > - ret = read_and_truncate_page(&path, fi, aligned_end, > - file_offset + len - aligned_end, > - dest + aligned_end - file_offset); > + ret = read_and_truncate_page(&path, fi, cur, > + file_offset + len - cur, > + dest + cur - file_offset); > } > out: > btrfs_release_path(&path); >
Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:02:00AM +0800: > > /* Read the tailing unaligned part*/ > > Can we remove this part completely? > > IIRC if we read until the target end, the unaligned end part can be > completely removed then. The "Read the aligned part" loop stops at aligned_end: > while (cur < aligned_end) So it should be possible that the last aligned extent we consider does not contain data until the end, e.g. an offset that ends with the aligned end: 0 4096 4123 [extent1-----|extent2] In this case the main loop will only read extent1 and we need the "trailing unaligned part" if for extent2. I have a feeling the loop could just be updated to go to the end `while (cur < end)` as it doesn't seem to care about the end alignment... Should I update v2 to do that instead? This made me look at the "Read out the leading unaligned part" initial part, and its check only looks at the sector size but it probably has the same problem -- we want to make sure we read any leftover from a previous extent e.g. this file: 0 4096 8192 [extent1------------------|extent2] and reading from 4k with a sectorsize of 4k is probably bad will enter the aligned main loop right away... And I think that'll fail?... Actually not quite sure what's expecting what to be aligned in there, but I just tried some partial reads from non-zero offsets and my board resets almost all the time so I guess I've found something else to dig into. This isn't a priority for me right now but I'll look a bit more when I have more time if you haven't first.
On 2023/4/18 10:41, Dominique Martinet wrote: > Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:02:00AM +0800: >>> /* Read the tailing unaligned part*/ >> >> Can we remove this part completely? >> >> IIRC if we read until the target end, the unaligned end part can be >> completely removed then. > > The "Read the aligned part" loop stops at aligned_end: >> while (cur < aligned_end) > > So it should be possible that the last aligned extent we consider does > not contain data until the end, e.g. an offset that ends with the > aligned end: > > 0 4096 4123 > [extent1-----|extent2] > > In this case the main loop will only read extent1 and we need the > "trailing unaligned part" if for extent2. > > I have a feeling the loop could just be updated to go to the end > `while (cur < end)` as it doesn't seem to care about the end > alignment... Should I update v2 to do that instead? Yeah, it would be very awesome if you can remove the tailing part completely. > > > > This made me look at the "Read out the leading unaligned part" initial > part, and its check only looks at the sector size but it probably has > the same problem -- we want to make sure we read any leftover from a > previous extent e.g. this file: If you're talking about the same problem mentioned in patch 1, then yes, any read in the middle of an extent would cause problems. No matter if it's aligned or not, as btrfs would convert any unaligned part into aligned read. But if we fix the bug you mentioned, I see nothing can going wrong. Unaligned read is converted into aligned one (then copy the target range into the dest buf), and aligned part (including the tailing unaligned part) would be properly handled then. > 0 4096 8192 > [extent1------------------|extent2] > and reading from 4k with a sectorsize of 4k is probably bad will enter > the aligned main loop right away... > And I think that'll fail?... > Actually not quite sure what's expecting what to be aligned in there, > but I just tried some partial reads from non-zero offsets and my board > resets almost all the time so I guess I've found something else to dig > into. > This isn't a priority for me right now but I'll look a bit more when I > have more time if you haven't first. > My initial plan is to merge the tailing part into the aligned loop, but since you're already working in this part, feel free to do it. And I'm very happy to provide any help on this. Thanks, Qu
Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:53:41AM +0800: > > I have a feeling the loop could just be updated to go to the end > > `while (cur < end)` as it doesn't seem to care about the end > > alignment... Should I update v2 to do that instead? > > Yeah, it would be very awesome if you can remove the tailing part > completely. Ok, will give it a try. I'll want to test this a bit so might take a day or two as I have other work to finish first. > > This made me look at the "Read out the leading unaligned part" initial > > part, and its check only looks at the sector size but it probably has > > the same problem -- we want to make sure we read any leftover from a > > previous extent e.g. this file: > > If you're talking about the same problem mentioned in patch 1, then yes, any > read in the middle of an extent would cause problems. No, was just thinking the leading part being a separate loop doesn't seem to make sense either as the code shouldn't care about sector size alignemnt but about full extents. If the main loop handles everything correctly then the leading if can also be removed along with "read_and_truncate_page" that would no longer be used. I'll give this a try as well and report back. > No matter if it's aligned or not, as btrfs would convert any unaligned part > into aligned read. Yes I don't see where it would fail (except that my board does crash), I guess that at this point I should spend some time building a qemu u-boot and hooking up gdb will be faster.. > My initial plan is to merge the tailing part into the aligned loop, but > since you're already working in this part, feel free to do it. Yes, sure -- removing the if is easy, I'd just rather not make it fail for someone else :)
On 2023/4/18 11:07, Dominique Martinet wrote: > Qu Wenruo wrote on Tue, Apr 18, 2023 at 10:53:41AM +0800: >>> I have a feeling the loop could just be updated to go to the end >>> `while (cur < end)` as it doesn't seem to care about the end >>> alignment... Should I update v2 to do that instead? >> >> Yeah, it would be very awesome if you can remove the tailing part >> completely. > > Ok, will give it a try. > I'll want to test this a bit so might take a day or two as I have other > work to finish first. > >>> This made me look at the "Read out the leading unaligned part" initial >>> part, and its check only looks at the sector size but it probably has >>> the same problem -- we want to make sure we read any leftover from a >>> previous extent e.g. this file: >> >> If you're talking about the same problem mentioned in patch 1, then yes, any >> read in the middle of an extent would cause problems. > > No, was just thinking the leading part being a separate loop doesn't > seem to make sense either as the code shouldn't care about sector size > alignemnt but about full extents. The main concern related to the leading unaligned part is, we need to skip something unaligned from the beginning , while all other situations never need to skip such case (they at most skip the tailing part). Maybe we can do some extra calculation to handle it, but I'm not in a hurry to address the leading part. Another thing is, for all other fs-ish interface (kernel/fuse etc), they all read/write in certain block size, then handle unaligned part from a higher layer. Thus I prefer to have most things handled in an aligned fashion, and only handle the unaligned part specially. But if you can find an elegant way to handle all cases, that would be really awesome! Thanks, Qu > If the main loop handles everything correctly then the leading if can > also be removed along with "read_and_truncate_page" that would no longer > be used. > > I'll give this a try as well and report back. > >> No matter if it's aligned or not, as btrfs would convert any unaligned part >> into aligned read. > > Yes I don't see where it would fail (except that my board does crash), > I guess that at this point I should spend some time building a qemu > u-boot and hooking up gdb will be faster.. > >> My initial plan is to merge the tailing part into the aligned loop, but >> since you're already working in this part, feel free to do it. > > Yes, sure -- removing the if is easy, I'd just rather not make it fail > for someone else :) >
Qu Wenruo wrote on Tue, Apr 18, 2023 at 11:21:00AM +0800: > > No, was just thinking the leading part being a separate loop doesn't > > seem to make sense either as the code shouldn't care about sector size > > alignemnt but about full extents. > > The main concern related to the leading unaligned part is, we need to skip > something unaligned from the beginning , while all other situations never > need to skip such case (they at most skip the tailing part). Ok, there is one exception for inline extents apparently.. But I'm not still not convinced the `aligned_start != file_offset` check is enough for that either; I'd say it's unlikely but the inline part can be compressed, so we could have a file which has > 4k (sectorsize) of expanded data, so a read from the 4k offset would skip the special handling and fail (reading the whole extent in dest) Even if that's not possible, reading just the first 10 bytes of an inline extent will be aligned and go through the main loop which just reads the whole extent, so it'll need the same handling as the regular btrfs_read_extent_reg handling at which point it might just as well handle start offset too. That aside taking the loop in order: - lookup_data_extent doesn't care (used in heading/tail) - skipping holes don't care as they explicitely put cursor at start of next extent (or bail out if nothing next) - inline needs fixing anyway as said above - PREALLOC or nothing on disk also goes straight to next and is ok - ah, I see what you meant now, we need to substract the current position within the extent to extent_num_bytes... That's also already a problem, though; to take the same example: 0 8k 16k [extent1 | extent2 ... ] reading from 4k onwards will try to read min(extent_num_bytes, end-cur) = min(8k, 12k) = 8k from the 4k offset which goes over the end of the extent. That could actually be my resets from the previous mail. So either the first check should just lookup the extent and check that extent start matches current offset instead of checking for sectorsize alignment, or we can just fix the loop and remove the first if.
Dominique Martinet wrote on Tue, Apr 18, 2023 at 12:53:35PM +0900: > Ok, there is one exception for inline extents apparently.. But I'm not > still not convinced the `aligned_start != file_offset` check is enough > for that either; I'd say it's unlikely but the inline part can be > compressed, so we could have a file which has > 4k (sectorsize) of > expanded data, so a read from the 4k offset would skip the special > handling and fail (reading the whole extent in dest) (Wasn't able to create such a file, I assume that means the uncompressed data must fits in a page -- if we deal with arm machines with 16k or 64k page size that'll probably change things again but I'll just pretend sector size will magically match in this case..) > Even if that's not possible, reading just the first 10 bytes of an > inline extent will be aligned and go through the main loop which just > reads the whole extent, so it'll need the same handling as the regular > btrfs_read_extent_reg handling at which point it might just as well > handle start offset too. Question regarding inline extent: the main loop goes straight out after reading an inline extent, assuming it is always alone (or last) -- is that correct? By playing with `filefrag -v` I could sometimes see files that temporarily have inline + a delalloc extent, but wasn't able to make a file that kept the inline extent after appending more data, so I guess it is also sane enough... Just making it continue and letting the loop end seems just as simple now there is no trailing if, but if inline extents cannot be mixed I'll be happy to keep it going out. back to btrfs_read_extent_reg: merging the tail back into the main loop breaks the first assert (IS_ALIGNED(len, fs_info->sectorsize) in particular). The old loop invocation made sure it was aligned and read_and_truncate_page used to take care of calling it with a bigger buffer when it was not. I was only looking at the compressed path and that does not care about 'len' alignment because it makes an intermediate copy for decompression, but the BTRFS_COMPRESS_NONE's read_extent_data might care? I didn't see anything that actually requires alignment there (len in particular should be ok, but even offset->logical seems to properly be used as "being part of a range" in lookups so alignment doesn't actually matter), but if this isn't tested I can understand wanting to be more careful there. Ok so this is rightfully less obvious than I had first assumed -- sorry for rushing in. Let's do baby steps: - I'll resend just the first patch shortly, it's a real fix and I'll be using it right away. - I'll double check 'len' doesn't need to be aligned in btrfs_read_extent_reg and just rework the main loop to allow removing the tail exception, that'll avoid a double read in many cases and I think it's worth doing. Might take a bit more time as I want to finish some other work, let's say a few days. - That leaves the two issues I brought up with the main loop: * inline case ignoring end; this is minor enough but easy to fix * btrfs_read_extent_reg() assuming start of extent in its length agument; I believe it'll be easier to stop trying to set a upper limit in the main loop and just have btrfs_read_extent_reg() do it itself, we can use its return value to know how much was actually read. (Probably just substract offset - key.offset to extent_num_bytes to have a new cap, but I still don't understand btrfs_file_extent_offset() in all of this as it's always 0 when I looked) Will try to do that over the next few weeks, but if you want to look at it feel free to do this before me. - At this point it might be worth considering removing the initial check as that also makes an extra small read in the unaligned case, but it's not a bug and can wait. What do you think?
On 2023/4/18 11:53, Dominique Martinet wrote: > Qu Wenruo wrote on Tue, Apr 18, 2023 at 11:21:00AM +0800: >>> No, was just thinking the leading part being a separate loop doesn't >>> seem to make sense either as the code shouldn't care about sector size >>> alignemnt but about full extents. >> >> The main concern related to the leading unaligned part is, we need to skip >> something unaligned from the beginning , while all other situations never >> need to skip such case (they at most skip the tailing part). > > Ok, there is one exception for inline extents apparently.. But I'm not > still not convinced the `aligned_start != file_offset` check is enough > for that either; I'd say it's unlikely but the inline part can be > compressed, so we could have a file which has > 4k (sectorsize) of > expanded data, so a read from the 4k offset would skip the special > handling and fail (reading the whole extent in dest) Btrfs inline has a limit to sectorsize. That means, inlined compressed extent can at most be 4K sized (if 4K is our sector size). So that won't be a problem. > > Even if that's not possible, reading just the first 10 bytes of an > inline extent will be aligned and go through the main loop which just > reads the whole extent, so it'll need the same handling as the regular > btrfs_read_extent_reg handling at which point it might just as well > handle start offset too. If we just read 10 bytes, the aligned part should handle it well. My real concern is what if we read 10 bytes at offset 10 bytes. If this can be handled in the same way of aligned read (and still be reasonable readable), then it would be awesome. > > > That aside taking the loop in order: > - lookup_data_extent doesn't care (used in heading/tail) > - skipping holes don't care as they explicitely put cursor at start of > next extent (or bail out if nothing next) > - inline needs fixing anyway as said above > - PREALLOC or nothing on disk also goes straight to next and is ok > - ah, I see what you meant now, we need to substract the current > position within the extent to extent_num_bytes... > That's also already a problem, though; to take the same example: > 0 8k 16k > [extent1 | extent2 ... ] > reading from 4k onwards will try to read > min(extent_num_bytes, end-cur) = min(8k, 12k) = 8k > from the 4k offset which goes over the end of the extent. That's indeed a problem. As most of the Uboot fs drivers only care read the whole file, never really utilize the ability to read part of the file, that path is not properly tested. (Some driver, IIRC ubifs?, doesn't even allow read with non-zero offset) Thanks, Qu > > That could actually be my resets from the previous mail. > > > So either the first check should just lookup the extent and check that > extent start matches current offset instead of checking for sectorsize > alignment, or we can just fix the loop and remove the first if. >
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3d6e39e6544d..efffec0f2e68 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -663,7 +663,8 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, struct btrfs_path path; struct btrfs_key key; u64 aligned_start = round_down(file_offset, fs_info->sectorsize); - u64 aligned_end = round_down(file_offset + len, fs_info->sectorsize); + u64 end = file_offset + len; + u64 aligned_end = round_down(end, fs_info->sectorsize); u64 next_offset; u64 cur = aligned_start; int ret = 0; @@ -743,26 +744,26 @@ int btrfs_file_read(struct btrfs_root *root, u64 ino, u64 file_offset, u64 len, extent_num_bytes = btrfs_file_extent_num_bytes(path.nodes[0], fi); ret = btrfs_read_extent_reg(&path, fi, cur, - min(extent_num_bytes, aligned_end - cur), + min(extent_num_bytes, end - cur), dest + cur - file_offset); if (ret < 0) goto out; - cur += min(extent_num_bytes, aligned_end - cur); + cur += min(extent_num_bytes, end - cur); } /* Read the tailing unaligned part*/ - if (file_offset + len != aligned_end) { + if (file_offset + len != cur) { btrfs_release_path(&path); - ret = lookup_data_extent(root, &path, ino, aligned_end, + ret = lookup_data_extent(root, &path, ino, cur, &next_offset); /* <0 is error, >0 means no extent */ if (ret) goto out; fi = btrfs_item_ptr(path.nodes[0], path.slots[0], struct btrfs_file_extent_item); - ret = read_and_truncate_page(&path, fi, aligned_end, - file_offset + len - aligned_end, - dest + aligned_end - file_offset); + ret = read_and_truncate_page(&path, fi, cur, + file_offset + len - cur, + dest + cur - file_offset); } out: btrfs_release_path(&path);