Message ID | 158272447616.281342.14858371265376818660.stgit@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag. | expand |
On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: > This adds a support of physical hint for fallocate2() syscall. > In case of @physical argument is set for ext4_fallocate(), > we try to allocate blocks only from [@phisical, @physical + len] > range, while other blocks are not used. Sorry, but this is a complete bullshit interface. Userspace has absolutely no business even thinking of physical placement. If you want to align allocations to physical block granularity boundaries that is the file systems job, not the applications job.
On 26.02.2020 18:55, Christoph Hellwig wrote: > On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >> This adds a support of physical hint for fallocate2() syscall. >> In case of @physical argument is set for ext4_fallocate(), >> we try to allocate blocks only from [@phisical, @physical + len] >> range, while other blocks are not used. > > Sorry, but this is a complete bullshit interface. Userspace has > absolutely no business even thinking of physical placement. If you > want to align allocations to physical block granularity boundaries > that is the file systems job, not the applications job. Why? There are two contradictory actions that filesystem can't do at the same time: 1)place files on a distance from each other to minimize number of extents on possible future growth; 2)place small files in the same big block of block device. At initial allocation time you never know, which file will stop grow in some future, i.e. which file is suitable for compaction. This knowledge becomes available some time later. Say, if a file has not been changed for a month, it is suitable for compaction with another files like it. If at allocation time you can determine a file, which won't grow in the future, don't be afraid, and just share your algorithm here. In Virtuozzo we tried to compact ext4 with existing kernel interface: https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave excellent for everything. If this interface bad, can you suggest another interface to make block allocator to know the behavior expected from him in this specific case? Kirill
On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 26.02.2020 18:55, Christoph Hellwig wrote: >> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>> This adds a support of physical hint for fallocate2() syscall. >>> In case of @physical argument is set for ext4_fallocate(), >>> we try to allocate blocks only from [@phisical, @physical + len] >>> range, while other blocks are not used. >> >> Sorry, but this is a complete bullshit interface. Userspace has >> absolutely no business even thinking of physical placement. If you >> want to align allocations to physical block granularity boundaries >> that is the file systems job, not the applications job. > > Why? There are two contradictory actions that filesystem can't do at the same time: > > 1)place files on a distance from each other to minimize number of extents > on possible future growth; > 2)place small files in the same big block of block device. > > At initial allocation time you never know, which file will stop grow in some > future, i.e. which file is suitable for compaction. This knowledge becomes > available some time later. Say, if a file has not been changed for a month, > it is suitable for compaction with another files like it. > > If at allocation time you can determine a file, which won't grow in the future, > don't be afraid, and just share your algorithm here. Very few files grow after they are initially written/closed. Those that do are almost always opened with O_APPEND (e.g. log files). It would be reasonable to have O_APPEND cause the filesystem to reserve blocks (in memory at least, maybe some small amount on disk like 1/4 of the current file size) for the file to grow after it is closed. We might use the same heuristic for directories that grow long after initial creation. The main exception there is VM images, because they are not really "files" in the normal sense, but containers aggregating a lot of different files, each created with patterns that are not visible to the VM host. In that case, it would be better to have the VM host tell the filesystem that the IO pattern is "random" and not try to optimize until the VM is cold. > In Virtuozzo we tried to compact ext4 with existing kernel interface: > > https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c > > But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave > excellent for everything. > > If this interface bad, can you suggest another interface to make block > allocator to know the behavior expected from him in this specific case? In ext4 there is already the "group" allocator, which combines multiple small files together into a single preallocation group, so that the IO to disk is large/contiguous. The theory is that files written at the same time will have similar lifespans, but that isn't always true. If the files are large and still being written, the allocator will reserve additional blocks (default 8MB I think) on the expectation that it will continue to write until it is closed. I think (correct me if I'm wrong) that your issue is with defragmenting small files to free up contiguous space in the filesystem? I think once the free space is freed of small files that defragmenting large files is easily done. Anything with more than 8-16MB extents will max out most storage anyway (seek rate * IO size). In that case, an interesting userspace interface would be an array of inode numbers (64-bit please) that should be packed together densely in the order they are provided (maybe a flag for that). That allows the filesystem the freedom to find the physical blocks for the allocation, while userspace can tell which files are related to each other. Tools like "readahead" could also leverage this to "perfectly" allocate the files used during boot into a single stream of reads from the disk. Cheers, Andreas
On 26/02/2020 23.05, Kirill Tkhai wrote: > On 26.02.2020 18:55, Christoph Hellwig wrote: >> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>> This adds a support of physical hint for fallocate2() syscall. >>> In case of @physical argument is set for ext4_fallocate(), >>> we try to allocate blocks only from [@phisical, @physical + len] >>> range, while other blocks are not used. >> >> Sorry, but this is a complete bullshit interface. Userspace has >> absolutely no business even thinking of physical placement. If you >> want to align allocations to physical block granularity boundaries >> that is the file systems job, not the applications job. > > Why? There are two contradictory actions that filesystem can't do at the same time: > > 1)place files on a distance from each other to minimize number of extents > on possible future growth; > 2)place small files in the same big block of block device. > > At initial allocation time you never know, which file will stop grow in some future, > i.e. which file is suitable for compaction. This knowledge becomes available some time later. > Say, if a file has not been changed for a month, it is suitable for compaction with > another files like it. > > If at allocation time you can determine a file, which won't grow in the future, don't be afraid, > and just share your algorithm here. > > In Virtuozzo we tried to compact ext4 with existing kernel interface: > > https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c > > But it does not work well in many situations, and the main problem is blocks allocation > in desired place is not possible. Block allocator can't behave excellent for everything. > > If this interface bad, can you suggest another interface to make block allocator to know > the behavior expected from him in this specific case? Controlling exact place is odd. I suppose main reason for this that defragmentation process wants to control fragmentation during allocating new space. Maybe flag FALLOC_FL_DONT_FRAGMENT (allocate exactly one extent or fail) could solve that problem? Defragmentator could try allocate different sizes and automatically balance fragmentation factor without controlling exact disk offsets. Also it could reserve space for expected file growth.
On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: > On 26.02.2020 18:55, Christoph Hellwig wrote: > > On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: > >> This adds a support of physical hint for fallocate2() syscall. > >> In case of @physical argument is set for ext4_fallocate(), > >> we try to allocate blocks only from [@phisical, @physical + len] > >> range, while other blocks are not used. > > > > Sorry, but this is a complete bullshit interface. Userspace has > > absolutely no business even thinking of physical placement. If you > > want to align allocations to physical block granularity boundaries > > that is the file systems job, not the applications job. > > Why? There are two contradictory actions that filesystem can't do at the same time: > > 1)place files on a distance from each other to minimize number of extents > on possible future growth; Speculative EOF preallocation at delayed allocation reservation time provides this. > 2)place small files in the same big block of block device. Delayed allocation during writeback packs files smaller than the stripe unit of the filesystem tightly. So, yes, we do both of these things at the same time in XFS, and have for the past 10 years. > At initial allocation time you never know, which file will stop grow in some future, > i.e. which file is suitable for compaction. This knowledge becomes available some time later. > Say, if a file has not been changed for a month, it is suitable for compaction with > another files like it. > > If at allocation time you can determine a file, which won't grow in the future, don't be afraid, > and just share your algorithm here. > > In Virtuozzo we tried to compact ext4 with existing kernel interface: > > https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c > > But it does not work well in many situations, and the main problem is blocks allocation > in desired place is not possible. Block allocator can't behave excellent for everything. > > If this interface bad, can you suggest another interface to make block allocator to know > the behavior expected from him in this specific case? Write once, long term data: fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); That will allow the the storage stack to group all data with the same hint together, both in software and in hardware. Cheers, Dave.
On 27.02.2020 09:59, Konstantin Khlebnikov wrote: > On 26/02/2020 23.05, Kirill Tkhai wrote: >> On 26.02.2020 18:55, Christoph Hellwig wrote: >>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>>> This adds a support of physical hint for fallocate2() syscall. >>>> In case of @physical argument is set for ext4_fallocate(), >>>> we try to allocate blocks only from [@phisical, @physical + len] >>>> range, while other blocks are not used. >>> >>> Sorry, but this is a complete bullshit interface. Userspace has >>> absolutely no business even thinking of physical placement. If you >>> want to align allocations to physical block granularity boundaries >>> that is the file systems job, not the applications job. >> >> Why? There are two contradictory actions that filesystem can't do at the same time: >> >> 1)place files on a distance from each other to minimize number of extents >> on possible future growth; >> 2)place small files in the same big block of block device. >> >> At initial allocation time you never know, which file will stop grow in some future, >> i.e. which file is suitable for compaction. This knowledge becomes available some time later. >> Say, if a file has not been changed for a month, it is suitable for compaction with >> another files like it. >> >> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, >> and just share your algorithm here. >> >> In Virtuozzo we tried to compact ext4 with existing kernel interface: >> >> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >> >> But it does not work well in many situations, and the main problem is blocks allocation >> in desired place is not possible. Block allocator can't behave excellent for everything. >> >> If this interface bad, can you suggest another interface to make block allocator to know >> the behavior expected from him in this specific case? > > Controlling exact place is odd. I suppose main reason for this that defragmentation > process wants to control fragmentation during allocating new space. > > Maybe flag FALLOC_FL_DONT_FRAGMENT (allocate exactly one extent or fail) could solve that problem? > > Defragmentator could try allocate different sizes and automatically balance fragmentation factor > without controlling exact disk offsets. Also it could reserve space for expected file growth. I don't think this will helps. The problem is not in allocation a single extent (fallocate() already tries to allocate as small number of extents as possible), but in that it's impossible to allocate it in desired bounds. Say, you have 1Mb discard granuality on block device and two files in different block device clusters: one is 4Kb of length, another's size is 1Mb-4Kb. The biggest file is situated on the start of block device cluster: [ 1Mb cluster0 ][ 1Mb cluster1 ] [****BIG_FILE****|free 4Kb][small_file|free 1Mb-4Kb] The best defragmentation will be to move small_file into free 4Kb of cluster0. Allocation of single extent does not help here: you have to allocate very big bunch of such extents in cycle before allocator returns you desired block, and then it's need to return the rest of extents back. This has very bad performance.
On 27.02.2020 10:33, Dave Chinner wrote: > On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: >> On 26.02.2020 18:55, Christoph Hellwig wrote: >>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>>> This adds a support of physical hint for fallocate2() syscall. >>>> In case of @physical argument is set for ext4_fallocate(), >>>> we try to allocate blocks only from [@phisical, @physical + len] >>>> range, while other blocks are not used. >>> >>> Sorry, but this is a complete bullshit interface. Userspace has >>> absolutely no business even thinking of physical placement. If you >>> want to align allocations to physical block granularity boundaries >>> that is the file systems job, not the applications job. >> >> Why? There are two contradictory actions that filesystem can't do at the same time: >> >> 1)place files on a distance from each other to minimize number of extents >> on possible future growth; > > Speculative EOF preallocation at delayed allocation reservation time > provides this. > >> 2)place small files in the same big block of block device. > > Delayed allocation during writeback packs files smaller than the > stripe unit of the filesystem tightly. > > So, yes, we do both of these things at the same time in XFS, and > have for the past 10 years. > >> At initial allocation time you never know, which file will stop grow in some future, >> i.e. which file is suitable for compaction. This knowledge becomes available some time later. >> Say, if a file has not been changed for a month, it is suitable for compaction with >> another files like it. >> >> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, >> and just share your algorithm here. >> >> In Virtuozzo we tried to compact ext4 with existing kernel interface: >> >> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >> >> But it does not work well in many situations, and the main problem is blocks allocation >> in desired place is not possible. Block allocator can't behave excellent for everything. >> >> If this interface bad, can you suggest another interface to make block allocator to know >> the behavior expected from him in this specific case? > > Write once, long term data: > > fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); > > That will allow the the storage stack to group all data with the > same hint together, both in software and in hardware. This is interesting option, but it only applicable before write is made. And it's only applicable on your own applications. My usecase is defragmentation of containers, where any applications may run. Most of applications never care whether long or short-term data they write. Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME is set, fallocate() tries to allocate space around another inodes with the same hint. E.g., we have 1Mb discard granuality on block device and two files in different block device clusters: one is 4Kb of length, another's size is 1Mb-4Kb. The biggest file is situated on the start of block device cluster: [ 1Mb cluster0 ][ 1Mb cluster1 ] [****BIG_FILE****|free 4Kb][small_file|free 1Mb-4Kb] defrag util wants to move small file into free space in cluster0. To do that it opens BIG_FILE and sets F_SET_RW_HINT for its inode. Then it opens tmp file, sets the hint and calls fallocate(): fd1 = open("BIG_FILE", O_RDONLY); ioctl(fd1, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); fd2 = open("donor", O_WRONLY|O_TMPFILE|O_CREAT); ioctl(fd2, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); fallocate(fd2, 0, 0, 4Kb); // firstly seeks a space around files with RWH_WRITE_LIFE_EXTREME hint How about this? Kirill
On 27.02.2020 00:51, Andreas Dilger wrote: > On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> >> On 26.02.2020 18:55, Christoph Hellwig wrote: >>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>>> This adds a support of physical hint for fallocate2() syscall. >>>> In case of @physical argument is set for ext4_fallocate(), >>>> we try to allocate blocks only from [@phisical, @physical + len] >>>> range, while other blocks are not used. >>> >>> Sorry, but this is a complete bullshit interface. Userspace has >>> absolutely no business even thinking of physical placement. If you >>> want to align allocations to physical block granularity boundaries >>> that is the file systems job, not the applications job. >> >> Why? There are two contradictory actions that filesystem can't do at the same time: >> >> 1)place files on a distance from each other to minimize number of extents >> on possible future growth; >> 2)place small files in the same big block of block device. >> >> At initial allocation time you never know, which file will stop grow in some >> future, i.e. which file is suitable for compaction. This knowledge becomes >> available some time later. Say, if a file has not been changed for a month, >> it is suitable for compaction with another files like it. >> >> If at allocation time you can determine a file, which won't grow in the future, >> don't be afraid, and just share your algorithm here. > > Very few files grow after they are initially written/closed. Those that > do are almost always opened with O_APPEND (e.g. log files). It would be > reasonable to have O_APPEND cause the filesystem to reserve blocks (in > memory at least, maybe some small amount on disk like 1/4 of the current > file size) for the file to grow after it is closed. We might use the > same heuristic for directories that grow long after initial creation. 1)Lets see on a real example. I created a new ext4 and started the test below: https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99 Here are two files written. One file is 4Kb. One file is 1Mb-4Kb. $filefrag -e test1.tmp test2.tmp Filesystem type is: ef53 File size of test1.tmp is 4096 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 33793.. 33793: 1: last,eof test1.tmp: 1 extent found File size of test2.tmp is 1044480 (255 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 254: 33536.. 33790: 255: last,eof test2.tmp: 1 extent found $debugfs: testb 33791 Block 33791 not in use test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp placement prevents us from discarding next 1Mb block. 2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb: # filefrag -e test3.tmp test4.tmp Filesystem type is: ef53 File size of test3.tmp is 1052672 (257 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 256: 35840.. 36096: 257: last,eof test3.tmp: 1 extent found File size of test4.tmp is 1044480 (255 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 254: 35072.. 35326: 255: last,eof test4.tmp: 1 extent found They don't go sequentially, and here is fragmentation starts. After both the tests: $df -h /dev/loop0 2.0G 11M 1.8G 1% /root/mnt Filesystem is free, all last block groups are free. E.g., Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED] Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66 Inode bitmap at 288 (bg #0 + 288), csum 0x00000000 Inode table at 7969-8480 (bg #0 + 7969) 32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes Free blocks: 491520-524287 Free inodes: 122881-131072 but two files are not packed together. So, ext4 block allocator does not work good for my workload. It even does not know anything about discard granuality of underlining block device. Does it? I assume no fs knows. Should I tell it? > The main exception there is VM images, because they are not really "files" > in the normal sense, but containers aggregating a lot of different files, > each created with patterns that are not visible to the VM host. In that > case, it would be better to have the VM host tell the filesystem that the > IO pattern is "random" and not try to optimize until the VM is cold. > >> In Virtuozzo we tried to compact ext4 with existing kernel interface: >> >> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >> >> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave >> excellent for everything. >> >> If this interface bad, can you suggest another interface to make block >> allocator to know the behavior expected from him in this specific case? > > In ext4 there is already the "group" allocator, which combines multiple > small files together into a single preallocation group, so that the IO > to disk is large/contiguous. The theory is that files written at the > same time will have similar lifespans, but that isn't always true. > > If the files are large and still being written, the allocator will reserve > additional blocks (default 8MB I think) on the expectation that it will > continue to write until it is closed. > > I think (correct me if I'm wrong) that your issue is with defragmenting > small files to free up contiguous space in the filesystem? I think once > the free space is freed of small files that defragmenting large files is > easily done. Anything with more than 8-16MB extents will max out most > storage anyway (seek rate * IO size). My issue is mostly with files < 1Mb, because underlining device discard granuality is 1Mb. The result of fragmentation is that size of occupied 1Mb blocks of device is 1.5 times bigger, than size of really written data (say, df -h). And this is the problem. > In that case, an interesting userspace interface would be an array of > inode numbers (64-bit please) that should be packed together densely in > the order they are provided (maybe a flag for that). That allows the > filesystem the freedom to find the physical blocks for the allocation, > while userspace can tell which files are related to each other. So, this interface is 3-in-1: 1)finds a placement for inodes extents; 2)assigns this space to some temporary donor inode; 3)calls ext4_move_extents() for each of them. Do I understand you right? If so, then IMO it's good to start from two inodes, because here may code a very difficult algorithm of placement of many inodes, which may require much memory. Is this OK? Can we introduce a flag, that some of inode is unmovable? Can this interface use a knowledge about underlining device discard granuality? In the answer to Dave, I wrote a proposition to make fallocate() care about i_write_hint. Could you please comment what you think about that too? Thanks, Kirill
On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote: > On 27.02.2020 10:33, Dave Chinner wrote: > > On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: > >> On 26.02.2020 18:55, Christoph Hellwig wrote: > >>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: > >>>> This adds a support of physical hint for fallocate2() syscall. > >>>> In case of @physical argument is set for ext4_fallocate(), > >>>> we try to allocate blocks only from [@phisical, @physical + len] > >>>> range, while other blocks are not used. > >>> > >>> Sorry, but this is a complete bullshit interface. Userspace has > >>> absolutely no business even thinking of physical placement. If you > >>> want to align allocations to physical block granularity boundaries > >>> that is the file systems job, not the applications job. > >> > >> Why? There are two contradictory actions that filesystem can't do at the same time: > >> > >> 1)place files on a distance from each other to minimize number of extents > >> on possible future growth; > > > > Speculative EOF preallocation at delayed allocation reservation time > > provides this. > > > >> 2)place small files in the same big block of block device. > > > > Delayed allocation during writeback packs files smaller than the > > stripe unit of the filesystem tightly. > > > > So, yes, we do both of these things at the same time in XFS, and > > have for the past 10 years. > > > >> At initial allocation time you never know, which file will stop grow in some future, > >> i.e. which file is suitable for compaction. This knowledge becomes available some time later. > >> Say, if a file has not been changed for a month, it is suitable for compaction with > >> another files like it. > >> > >> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, > >> and just share your algorithm here. > >> > >> In Virtuozzo we tried to compact ext4 with existing kernel interface: > >> > >> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c > >> > >> But it does not work well in many situations, and the main problem is blocks allocation > >> in desired place is not possible. Block allocator can't behave excellent for everything. > >> > >> If this interface bad, can you suggest another interface to make block allocator to know > >> the behavior expected from him in this specific case? > > > > Write once, long term data: > > > > fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); > > > > That will allow the the storage stack to group all data with the > > same hint together, both in software and in hardware. > > This is interesting option, but it only applicable before write is made. And it's only > applicable on your own applications. My usecase is defragmentation of containers, where > any applications may run. Most of applications never care whether long or short-term > data they write. Why is that a problem? They'll be using the default write hint (i.e. NONE) and so a hint aware allocation policy will be separating that data from all the other data written with specific hints... And you've mentioned that your application has specific *never write again* selection criteria for data it is repacking. And that involves rewriting that data. IOWs, you know exactly what policy you want to apply before you rewrite the data, and so what other applications do is completely irrelevant for your repacker... > Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME > is set, fallocate() tries to allocate space around another inodes with the same hint. That's exactly what I said: > > That will allow the the storage stack to group all data with the > > same hint together, both in software and in hardware. What the filesystem does with the hint is up to the filesystem and the policies that it's developers decide are appropriate. If your filesystem doesn't do what you need, talk to the filesystem developers about implementing the policy you require. -Dave.
On 28.02.2020 00:56, Dave Chinner wrote: > On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote: >> On 27.02.2020 10:33, Dave Chinner wrote: >>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: >>>> On 26.02.2020 18:55, Christoph Hellwig wrote: >>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>>>>> This adds a support of physical hint for fallocate2() syscall. >>>>>> In case of @physical argument is set for ext4_fallocate(), >>>>>> we try to allocate blocks only from [@phisical, @physical + len] >>>>>> range, while other blocks are not used. >>>>> >>>>> Sorry, but this is a complete bullshit interface. Userspace has >>>>> absolutely no business even thinking of physical placement. If you >>>>> want to align allocations to physical block granularity boundaries >>>>> that is the file systems job, not the applications job. >>>> >>>> Why? There are two contradictory actions that filesystem can't do at the same time: >>>> >>>> 1)place files on a distance from each other to minimize number of extents >>>> on possible future growth; >>> >>> Speculative EOF preallocation at delayed allocation reservation time >>> provides this. >>> >>>> 2)place small files in the same big block of block device. >>> >>> Delayed allocation during writeback packs files smaller than the >>> stripe unit of the filesystem tightly. >>> >>> So, yes, we do both of these things at the same time in XFS, and >>> have for the past 10 years. >>> >>>> At initial allocation time you never know, which file will stop grow in some future, >>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later. >>>> Say, if a file has not been changed for a month, it is suitable for compaction with >>>> another files like it. >>>> >>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, >>>> and just share your algorithm here. >>>> >>>> In Virtuozzo we tried to compact ext4 with existing kernel interface: >>>> >>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >>>> >>>> But it does not work well in many situations, and the main problem is blocks allocation >>>> in desired place is not possible. Block allocator can't behave excellent for everything. >>>> >>>> If this interface bad, can you suggest another interface to make block allocator to know >>>> the behavior expected from him in this specific case? >>> >>> Write once, long term data: >>> >>> fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); >>> >>> That will allow the the storage stack to group all data with the >>> same hint together, both in software and in hardware. >> >> This is interesting option, but it only applicable before write is made. And it's only >> applicable on your own applications. My usecase is defragmentation of containers, where >> any applications may run. Most of applications never care whether long or short-term >> data they write. > > Why is that a problem? They'll be using the default write hint (i.e. > NONE) and so a hint aware allocation policy will be separating that > data from all the other data written with specific hints... > > And you've mentioned that your application has specific *never write > again* selection criteria for data it is repacking. And that > involves rewriting that data. IOWs, you know exactly what policy > you want to apply before you rewrite the data, and so what other > applications do is completely irrelevant for your repacker... It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT. This time extent is already allocated and its place is known. But if >> Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME >> is set, fallocate() tries to allocate space around another inodes with the same hint. > > That's exactly what I said: > >>> That will allow the the storage stack to group all data with the >>> same hint together, both in software and in hardware. ... and fallocate() cares about the hint, this should work. > What the filesystem does with the hint is up to the filesystem > and the policies that it's developers decide are appropriate. If > your filesystem doesn't do what you need, talk to the filesystem > developers about implementing the policy you require. Do XFS kernel defrag interfaces allow to pack some randomly chosen small files in 1Mb blocks? Do they allow to pack small 4Kb file into free space after a big file like in example: before: BIG file Small file [single 16 Mb extent - 4Kb][unused 4Kb][4Kb extent] after: BIG file Small file [single 16 Mb extent - 4Kb][4Kb extent][unused 4Kb] ? Kirill
On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 27.02.2020 00:51, Andreas Dilger wrote: >> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>> Why? There are two contradictory actions that filesystem can't do at the same time: >>> >>> 1)place files on a distance from each other to minimize number of extents >>> on possible future growth; >>> 2)place small files in the same big block of block device. >>> >>> At initial allocation time you never know, which file will stop grow in some >>> future, i.e. which file is suitable for compaction. This knowledge becomes >>> available some time later. Say, if a file has not been changed for a month, >>> it is suitable for compaction with another files like it. >>> >>> If at allocation time you can determine a file, which won't grow in the future, >>> don't be afraid, and just share your algorithm here. >> >> Very few files grow after they are initially written/closed. Those that >> do are almost always opened with O_APPEND (e.g. log files). It would be >> reasonable to have O_APPEND cause the filesystem to reserve blocks (in >> memory at least, maybe some small amount on disk like 1/4 of the current >> file size) for the file to grow after it is closed. We might use the >> same heuristic for directories that grow long after initial creation. > > 1)Lets see on a real example. I created a new ext4 and started the test below: > https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99 > > Here are two files written. One file is 4Kb. One file is 1Mb-4Kb. > > $filefrag -e test1.tmp test2.tmp > Filesystem type is: ef53 > File size of test1.tmp is 4096 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 33793.. 33793: 1: last,eof > test1.tmp: 1 extent found > File size of test2.tmp is 1044480 (255 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 254: 33536.. 33790: 255: last,eof > test2.tmp: 1 extent found The alignment of blocks in the filesystem is much easier to see if you use "filefrag -e -x ..." to print the values in hex. In this case, 33536 = 0x8300 so it is properly aligned on disk IMHO. > $debugfs: testb 33791 > Block 33791 not in use > > test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp > placement prevents us from discarding next 1Mb block. For most filesystem uses, aligning the almost 1MB file on a 1MB boundary is good. That allows a full-stripe read/write for RAID, and is more likely to align with the erase block for flash. If it were to be allocated after the 4KB block, then it may be that each 1MB-aligned read/write of a large file would need to read/write two unaligned chunks per syscall. > 2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb: > > # filefrag -e test3.tmp test4.tmp > Filesystem type is: ef53 > File size of test3.tmp is 1052672 (257 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 256: 35840.. 36096: 257: last,eof > test3.tmp: 1 extent found > File size of test4.tmp is 1044480 (255 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 254: 35072.. 35326: 255: last,eof > test4.tmp: 1 extent found Here again, "filefrag -e -x" would be helpful. 35840 = 0x8c00, and 35072 = 0x8900, so IMHO they are allocated properly for most uses. Packing all files together sequentially on disk is what FAT did and it always got very fragmented in the end. > They don't go sequentially, and here is fragmentation starts. > > After both the tests: > $df -h > /dev/loop0 2.0G 11M 1.8G 1% /root/mnt > > Filesystem is free, all last block groups are free. E.g., > > Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED] > Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66 > Inode bitmap at 288 (bg #0 + 288), csum 0x00000000 > Inode table at 7969-8480 (bg #0 + 7969) > 32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes > Free blocks: 491520-524287 > Free inodes: 122881-131072 > > but two files are not packed together. > > So, ext4 block allocator does not work good for my workload. It even does not > know anything about discard granuality of underlining block device. Does it? > I assume no fs knows. Should I tell it? You can tune the alignment of allocations via s_raid_stripe and s_raid_stride in the ext4 superblock. I believe these are also set by mke2fs by libdisk, but I don't know if it takes flash erase block geometry into account. >> The main exception there is VM images, because they are not really "files" >> in the normal sense, but containers aggregating a lot of different files, >> each created with patterns that are not visible to the VM host. In that >> case, it would be better to have the VM host tell the filesystem that the >> IO pattern is "random" and not try to optimize until the VM is cold. >> >>> In Virtuozzo we tried to compact ext4 with existing kernel interface: >>> >>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >>> >>> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave >>> excellent for everything. >>> >>> If this interface bad, can you suggest another interface to make block >>> allocator to know the behavior expected from him in this specific case? >> >> In ext4 there is already the "group" allocator, which combines multiple >> small files together into a single preallocation group, so that the IO >> to disk is large/contiguous. The theory is that files written at the >> same time will have similar lifespans, but that isn't always true. >> >> If the files are large and still being written, the allocator will reserve >> additional blocks (default 8MB I think) on the expectation that it will >> continue to write until it is closed. >> >> I think (correct me if I'm wrong) that your issue is with defragmenting >> small files to free up contiguous space in the filesystem? I think once >> the free space is freed of small files that defragmenting large files is >> easily done. Anything with more than 8-16MB extents will max out most >> storage anyway (seek rate * IO size). > > My issue is mostly with files < 1Mb, because underlining device discard > granuality is 1Mb. The result of fragmentation is that size of occupied > 1Mb blocks of device is 1.5 times bigger, than size of really written > data (say, df -h). And this is the problem. Sure, and the group allocator will aggregate writes << prealloc size of 8MB by default. If it is 1MB that doesn't qualify for group prealloc. I think under 64KB does qualify for aggregation and unaligned writes. >> In that case, an interesting userspace interface would be an array of >> inode numbers (64-bit please) that should be packed together densely in >> the order they are provided (maybe a flag for that). That allows the >> filesystem the freedom to find the physical blocks for the allocation, >> while userspace can tell which files are related to each other. > > So, this interface is 3-in-1: > > 1)finds a placement for inodes extents; The target allocation size would be sum(size of inodes), which should be relatively small in your case). > 2)assigns this space to some temporary donor inode; Maybe yes, or just reserves that space from being allocated by anyone. > 3)calls ext4_move_extents() for each of them. ... using the target space that was reserved earlier > Do I understand you right? Correct. That is my "5 minutes thinking about an interface for grouping small files together without exposing kernel internals" proposal for this. > If so, then IMO it's good to start from two inodes, because here may code > a very difficult algorithm of placement of many inodes, which may require > much memory. Is this OK? Well, if the files are small then it won't be a lot of memory. Even so, the kernel would only need to copy a few MB at a time in order to get any decent performance, so I don't think that is a huge problem to have several MB of dirty data in flight. > Can we introduce a flag, that some of inode is unmovable? There are very few flags left in the ext4_inode->i_flags for use. You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they also have other semantics. The EXT4_NOTAIL_FL is for not merging the tail of a file, but ext4 doesn't have tails (that was in Reiserfs), so we might consider it a generic "do not merge" flag if set? > Can this interface use a knowledge about underlining device discard granuality? As I wrote above, ext4+mballoc has a very good appreciation for alignment. That was written for RAID storage devices, but it doesn't matter what the reason is. It isn't clear if flash discard alignment is easily used (it may not be a power-of-two value or similar), but wouldn't be harmful to try. > In the answer to Dave, I wrote a proposition to make fallocate() care about > i_write_hint. Could you please comment what you think about that too? I'm not against that. How the two interact would need to be documented first and discussed to see if that makes sene, and then implemented. Cheers, Andreas
On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote: > On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 27.02.2020 00:51, Andreas Dilger wrote: > >> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> In that case, an interesting userspace interface would be an array of > >> inode numbers (64-bit please) that should be packed together densely in > >> the order they are provided (maybe a flag for that). That allows the > >> filesystem the freedom to find the physical blocks for the allocation, > >> while userspace can tell which files are related to each other. > > > > So, this interface is 3-in-1: > > > > 1)finds a placement for inodes extents; > > The target allocation size would be sum(size of inodes), which should > be relatively small in your case). > > > 2)assigns this space to some temporary donor inode; > > Maybe yes, or just reserves that space from being allocated by anyone. > > > 3)calls ext4_move_extents() for each of them. > > ... using the target space that was reserved earlier > > > Do I understand you right? > > Correct. That is my "5 minutes thinking about an interface for grouping > small files together without exposing kernel internals" proposal for this. You don't need any special kernel interface with XFS for this. It is simply: mkdir tmpdir create O_TMPFILEs in tmpdir Now all the tmpfiles you create and their data will be co-located around the location of the tmpdir inode. This is the natural placement policy of the filesystem. i..e the filesystem assumes that files in the same directory are all related, so will be accessed together and so should be located in relatively close proximity to each other. This is a locality optimisation technique that is older than XFS. It works remarkably well when the filesystem can spread directories effectively across it's address space. It also allows userspace to use simple techniques to group (or separate) data files as desired. Indeed, this is how xfs_fsr directs locality for it's tmpfiles when relocating/defragmenting data.... > > If so, then IMO it's good to start from two inodes, because here may code > > a very difficult algorithm of placement of many inodes, which may require > > much memory. Is this OK? > > Well, if the files are small then it won't be a lot of memory. Even so, > the kernel would only need to copy a few MB at a time in order to get > any decent performance, so I don't think that is a huge problem to have > several MB of dirty data in flight. > > > Can we introduce a flag, that some of inode is unmovable? > > There are very few flags left in the ext4_inode->i_flags for use. > You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they > also have other semantics. The EXT4_NOTAIL_FL is for not merging the > tail of a file, but ext4 doesn't have tails (that was in Reiserfs), > so we might consider it a generic "do not merge" flag if set? We've had that in XFS for as long as I can remember. Many applications were sensitive to the exact layout of the files they created themselves, so having xfs_fsr defrag/move them about would cause performance SLAs to be broken. Indeed, thanks to XFS, ext4 already has an interface that can be used to set/clear a "no defrag" flag such as you are asking for. It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl. In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag, and it has special meaning for directories. From the 'man 3 xfsctl' man page where this interface came from: Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG No defragment file bit - the file should be skipped during a defragmentation operation. When applied to a directory, new files and directories created will inherit the no-defrag bit. > > Can this interface use a knowledge about underlining device discard granuality? > > As I wrote above, ext4+mballoc has a very good appreciation for alignment. > That was written for RAID storage devices, but it doesn't matter what > the reason is. It isn't clear if flash discard alignment is easily > used (it may not be a power-of-two value or similar), but wouldn't be > harmful to try. Yup, XFS has the similar (but more complex) alignment controls for directing allocation to match the underlying storage characteristics. e.g. stripe unit is also the "small file size threshold" where the allocation policy changes from packing to aligning and separating. > > In the answer to Dave, I wrote a proposition to make fallocate() care about > > i_write_hint. Could you please comment what you think about that too? > > I'm not against that. How the two interact would need to be documented > first and discussed to see if that makes sene, and then implemented. Individual filesystems can make their own choices as to what they do with write hints, including ignoring them and leaving it for the storage device to decide where to physically place the data. Which, in many cases, ignoring the hint is the right thing for the filesystem to do... Cheers, Dave.
On Feb 28, 2020, at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote: >> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>> >>> So, this interface is 3-in-1: >>> >>> 1)finds a placement for inodes extents; >> >> The target allocation size would be sum(size of inodes), which should >> be relatively small in your case). >> >>> 2)assigns this space to some temporary donor inode; >> >> Maybe yes, or just reserves that space from being allocated by anyone. >> >>> 3)calls ext4_move_extents() for each of them. >> >> ... using the target space that was reserved earlier >> >>> Do I understand you right? >> >> Correct. That is my "5 minutes thinking about an interface for grouping >> small files together without exposing kernel internals" proposal for this. > > You don't need any special kernel interface with XFS for this. It is > simply: > > mkdir tmpdir > create O_TMPFILEs in tmpdir > > Now all the tmpfiles you create and their data will be co-located > around the location of the tmpdir inode. This is the natural > placement policy of the filesystem. i..e the filesystem assumes that > files in the same directory are all related, so will be accessed > together and so should be located in relatively close proximity to > each other. Sure, this will likely get inodes allocate _close_ to each other on ext4 as well (the new directory will preferentially be located in a group that has free space), but it doesn't necessarily result in all of the files being packed densely. For 1MB+4KB and 1MB-4KB files they will still prefer to be aligned on 1MB boundaries rather than packed together. >>> Can we introduce a flag, that some of inode is unmovable? >> >> There are very few flags left in the ext4_inode->i_flags for use. >> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they >> also have other semantics. The EXT4_NOTAIL_FL is for not merging the >> tail of a file, but ext4 doesn't have tails (that was in Reiserfs), >> so we might consider it a generic "do not merge" flag if set? > > Indeed, thanks to XFS, ext4 already has an interface that can be > used to set/clear a "no defrag" flag such as you are asking for. > It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl. > In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag, > and it has special meaning for directories. From the 'man 3 xfsctl' > man page where this interface came from: > > Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG > No defragment file bit - the file should be skipped during a > defragmentation operation. When applied to a directory, > new files and directories created will inherit the no-defrag > bit. The interface is not the limiting factor here, but rather the number of flags available in the inode. Since chattr/lsattr from e2fsprogs was used as "common ground" for a few years, there are a number of flags in the namespace that don't actually have any meaning for ext4. One of those flags is: #define EXT4_NOTAIL_FL 0x00008000 /* file tail should not be merged */ This was added for Reiserfs, but it is not used by any other filesystem, so generalizing it slightly to mean "no migrate" is reasonable. That doesn't affect Reiserfs in any way, and it would still be possible to also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag. It wouldn't be any issue at all to chose an arbitrary unused flag to store this in ext4 inode internally, except that chattr/lsattr are used by a variety of different filesystems, so whatever flag is chosen will immediately also apply to any other filesystem that users use those tools on. Cheers, Andreas
On Fri, Feb 28, 2020 at 03:41:51PM +0300, Kirill Tkhai wrote: > On 28.02.2020 00:56, Dave Chinner wrote: > > On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote: > >> On 27.02.2020 10:33, Dave Chinner wrote: > >>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: > >>>> On 26.02.2020 18:55, Christoph Hellwig wrote: > >>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: > >>>>>> This adds a support of physical hint for fallocate2() syscall. > >>>>>> In case of @physical argument is set for ext4_fallocate(), > >>>>>> we try to allocate blocks only from [@phisical, @physical + len] > >>>>>> range, while other blocks are not used. > >>>>> > >>>>> Sorry, but this is a complete bullshit interface. Userspace has > >>>>> absolutely no business even thinking of physical placement. If you > >>>>> want to align allocations to physical block granularity boundaries > >>>>> that is the file systems job, not the applications job. > >>>> > >>>> Why? There are two contradictory actions that filesystem can't do at the same time: > >>>> > >>>> 1)place files on a distance from each other to minimize number of extents > >>>> on possible future growth; > >>> > >>> Speculative EOF preallocation at delayed allocation reservation time > >>> provides this. > >>> > >>>> 2)place small files in the same big block of block device. > >>> > >>> Delayed allocation during writeback packs files smaller than the > >>> stripe unit of the filesystem tightly. > >>> > >>> So, yes, we do both of these things at the same time in XFS, and > >>> have for the past 10 years. > >>> > >>>> At initial allocation time you never know, which file will stop grow in some future, > >>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later. > >>>> Say, if a file has not been changed for a month, it is suitable for compaction with > >>>> another files like it. > >>>> > >>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, > >>>> and just share your algorithm here. > >>>> > >>>> In Virtuozzo we tried to compact ext4 with existing kernel interface: > >>>> > >>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c > >>>> > >>>> But it does not work well in many situations, and the main problem is blocks allocation > >>>> in desired place is not possible. Block allocator can't behave excellent for everything. > >>>> > >>>> If this interface bad, can you suggest another interface to make block allocator to know > >>>> the behavior expected from him in this specific case? > >>> > >>> Write once, long term data: > >>> > >>> fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); > >>> > >>> That will allow the the storage stack to group all data with the > >>> same hint together, both in software and in hardware. > >> > >> This is interesting option, but it only applicable before write is made. And it's only > >> applicable on your own applications. My usecase is defragmentation of containers, where > >> any applications may run. Most of applications never care whether long or short-term > >> data they write. > > > > Why is that a problem? They'll be using the default write hint (i.e. > > NONE) and so a hint aware allocation policy will be separating that > > data from all the other data written with specific hints... > > > > And you've mentioned that your application has specific *never write > > again* selection criteria for data it is repacking. And that > > involves rewriting that data. IOWs, you know exactly what policy > > you want to apply before you rewrite the data, and so what other > > applications do is completely irrelevant for your repacker... > > It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT. "rewriting" is a technical term for reading data at rest and writing it again, whether it be to the same location or to some other location. Changing physical location of data, by definition, requires rewriting data. EXT4_IOC_MOVE_EXT = data rewrite + extent swap to update the metadata in the original file to point at the new data. Hence it appears to "move" from userspace perspective (hence the name) but under the covers it is rewriting data and fiddling pointers... > > What the filesystem does with the hint is up to the filesystem > > and the policies that it's developers decide are appropriate. If > > your filesystem doesn't do what you need, talk to the filesystem > > developers about implementing the policy you require. > > Do XFS kernel defrag interfaces allow to pack some randomly chosen > small files in 1Mb blocks? Do they allow to pack small 4Kb file into > free space after a big file like in example: No. Randomly selecting small holes for small file writes is a terrible idea from a performance perspective. Hence filling tiny holes (not randomly!) is often only done for metadata allocation (e.g. extent map blocks, which are largely random access anyway) or if there is no other choice for data (e.g. at ENOSPC). Cheers, Dave.
On Sat, Feb 29, 2020 at 01:12:52PM -0700, Andreas Dilger wrote: > On Feb 28, 2020, at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote: > >> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >>> > >>> So, this interface is 3-in-1: > >>> > >>> 1)finds a placement for inodes extents; > >> > >> The target allocation size would be sum(size of inodes), which should > >> be relatively small in your case). > >> > >>> 2)assigns this space to some temporary donor inode; > >> > >> Maybe yes, or just reserves that space from being allocated by anyone. > >> > >>> 3)calls ext4_move_extents() for each of them. > >> > >> ... using the target space that was reserved earlier > >> > >>> Do I understand you right? > >> > >> Correct. That is my "5 minutes thinking about an interface for grouping > >> small files together without exposing kernel internals" proposal for this. > > > > You don't need any special kernel interface with XFS for this. It is > > simply: > > > > mkdir tmpdir > > create O_TMPFILEs in tmpdir > > > > Now all the tmpfiles you create and their data will be co-located > > around the location of the tmpdir inode. This is the natural > > placement policy of the filesystem. i..e the filesystem assumes that > > files in the same directory are all related, so will be accessed > > together and so should be located in relatively close proximity to > > each other. > > Sure, this will likely get inodes allocate _close_ to each other on > ext4 as well (the new directory will preferentially be located in a > group that has free space), but it doesn't necessarily result in > all of the files being packed densely. For 1MB+4KB and 1MB-4KB files > they will still prefer to be aligned on 1MB boundaries rather than > packed together. Userspace can control that, too, simply by choosing to relocate only small files into a single directory, then relocating the large files in a separate set of operations after flushing the small files and having the packed tightly. Seriously, userspace has a *lot* of control of how data is located and packed simply by grouping the IO it wants to be written together into the same logical groups (i.e. directories) in the same temporal IO domain... > >>> Can we introduce a flag, that some of inode is unmovable? > >> > >> There are very few flags left in the ext4_inode->i_flags for use. > >> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they > >> also have other semantics. The EXT4_NOTAIL_FL is for not merging the > >> tail of a file, but ext4 doesn't have tails (that was in Reiserfs), > >> so we might consider it a generic "do not merge" flag if set? > > > > Indeed, thanks to XFS, ext4 already has an interface that can be > > used to set/clear a "no defrag" flag such as you are asking for. > > It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl. > > In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag, > > and it has special meaning for directories. From the 'man 3 xfsctl' > > man page where this interface came from: > > > > Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG > > No defragment file bit - the file should be skipped during a > > defragmentation operation. When applied to a directory, > > new files and directories created will inherit the no-defrag > > bit. > > The interface is not the limiting factor here, but rather the number > of flags available in the inode. Yes, that's an internal ext4 issue, not a userspace API problem. > Since chattr/lsattr from e2fsprogs > was used as "common ground" for a few years, there are a number of > flags in the namespace that don't actually have any meaning for ext4. Yes, that's a shitty API bed that extN made for itself, isn't it? We've sucked at API design for a long, long time. :/ But the chattr userspace application is also irrelevant to the problem at hand: it already uses the FS_IOC_FS[GS]ETXATTR ioctl interface for changing project quota IDs and the per-inode inheritance flag. Hence how it manages the new flag is irrelevant, but it also means we can't change the definition or behaviour of existing flags it controls regardless of what filesystem those flags act on. > One of those flags is: > > #define EXT4_NOTAIL_FL 0x00008000 /* file tail should not be merged */ > > This was added for Reiserfs, but it is not used by any other filesystem, > so generalizing it slightly to mean "no migrate" is reasonable. That > doesn't affect Reiserfs in any way, and it would still be possible to > also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag. Yes, ext4 could do that, but we are not allowed to redefine long standing userspace API flags to mean something completely different. That's effectively what you are proposing here if you allow ext4 to manipulate the same on-disk flag via both FS_NOTAIL_FL and FS_XFLAG_NODEFRAG. ie. the FS_NOTAIL_FL flag is manipulated by FS_IOC_[GS]ETFLAGS and is marked both as user visible and modifiable by ext4 even though ti does nothing. IOWs, to redefine this on-disk flag we would also need to have EXT4_IOC_GETFLAGS / EXT4_IOC_SETFLAGS reject attempts to set/clear FS_NOTAIL_FL with EOPNOTSUPP or EINVAL. Which we then risk breaking applications that use this flag even though ext4 does not implement anything other than setting/clearing the flag on demand. IOWs, we cannot change the meaning of the EXT4_NOTAIL_FL on disk flag, because that either changes the user visible behaviour of the on-disk flag or it changes the definition of a userspace API flag to mean something it was never meant to mean. Neither of those things are acceptible changes to make to a generic userspace API. > It wouldn't be any issue at all to chose an arbitrary unused flag to > store this in ext4 inode internally, except that chattr/lsattr are used > by a variety of different filesystems, so whatever flag is chosen will > immediately also apply to any other filesystem that users use those > tools on. The impact on userspace is only a problem if you re-use a flag ext4 already exposes to userspace. And that is not allowed if it causes the userspace API to be globally redefined for everyone. Which, clearly, it would. Cheers, Dave.
On 01.03.2020 01:41, Dave Chinner wrote: > On Fri, Feb 28, 2020 at 03:41:51PM +0300, Kirill Tkhai wrote: >> On 28.02.2020 00:56, Dave Chinner wrote: >>> On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote: >>>> On 27.02.2020 10:33, Dave Chinner wrote: >>>>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote: >>>>>> On 26.02.2020 18:55, Christoph Hellwig wrote: >>>>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote: >>>>>>>> This adds a support of physical hint for fallocate2() syscall. >>>>>>>> In case of @physical argument is set for ext4_fallocate(), >>>>>>>> we try to allocate blocks only from [@phisical, @physical + len] >>>>>>>> range, while other blocks are not used. >>>>>>> >>>>>>> Sorry, but this is a complete bullshit interface. Userspace has >>>>>>> absolutely no business even thinking of physical placement. If you >>>>>>> want to align allocations to physical block granularity boundaries >>>>>>> that is the file systems job, not the applications job. >>>>>> >>>>>> Why? There are two contradictory actions that filesystem can't do at the same time: >>>>>> >>>>>> 1)place files on a distance from each other to minimize number of extents >>>>>> on possible future growth; >>>>> >>>>> Speculative EOF preallocation at delayed allocation reservation time >>>>> provides this. >>>>> >>>>>> 2)place small files in the same big block of block device. >>>>> >>>>> Delayed allocation during writeback packs files smaller than the >>>>> stripe unit of the filesystem tightly. >>>>> >>>>> So, yes, we do both of these things at the same time in XFS, and >>>>> have for the past 10 years. >>>>> >>>>>> At initial allocation time you never know, which file will stop grow in some future, >>>>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later. >>>>>> Say, if a file has not been changed for a month, it is suitable for compaction with >>>>>> another files like it. >>>>>> >>>>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid, >>>>>> and just share your algorithm here. >>>>>> >>>>>> In Virtuozzo we tried to compact ext4 with existing kernel interface: >>>>>> >>>>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >>>>>> >>>>>> But it does not work well in many situations, and the main problem is blocks allocation >>>>>> in desired place is not possible. Block allocator can't behave excellent for everything. >>>>>> >>>>>> If this interface bad, can you suggest another interface to make block allocator to know >>>>>> the behavior expected from him in this specific case? >>>>> >>>>> Write once, long term data: >>>>> >>>>> fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME); >>>>> >>>>> That will allow the the storage stack to group all data with the >>>>> same hint together, both in software and in hardware. >>>> >>>> This is interesting option, but it only applicable before write is made. And it's only >>>> applicable on your own applications. My usecase is defragmentation of containers, where >>>> any applications may run. Most of applications never care whether long or short-term >>>> data they write. >>> >>> Why is that a problem? They'll be using the default write hint (i.e. >>> NONE) and so a hint aware allocation policy will be separating that >>> data from all the other data written with specific hints... >>> >>> And you've mentioned that your application has specific *never write >>> again* selection criteria for data it is repacking. And that >>> involves rewriting that data. IOWs, you know exactly what policy >>> you want to apply before you rewrite the data, and so what other >>> applications do is completely irrelevant for your repacker... >> >> It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT. > > "rewriting" is a technical term for reading data at rest and writing > it again, whether it be to the same location or to some other > location. Changing physical location of data, by definition, > requires rewriting data. > > EXT4_IOC_MOVE_EXT = data rewrite + extent swap to update the > metadata in the original file to point at the new data. Hence it > appears to "move" from userspace perspective (hence the name) but > under the covers it is rewriting data and fiddling pointers... Yeah, I understand this. I mean that file remains accessible for external users, and external reads/writes are handled properly, and state of file remains consistent. >>> What the filesystem does with the hint is up to the filesystem >>> and the policies that it's developers decide are appropriate. If >>> your filesystem doesn't do what you need, talk to the filesystem >>> developers about implementing the policy you require. >> >> Do XFS kernel defrag interfaces allow to pack some randomly chosen >> small files in 1Mb blocks? Do they allow to pack small 4Kb file into >> free space after a big file like in example: > > No. Randomly selecting small holes for small file writes is a > terrible idea from a performance perspective. Hence filling tiny > holes (not randomly!) is often only done for metadata allocation > (e.g. extent map blocks, which are largely random access anyway) or > if there is no other choice for data (e.g. at ENOSPC). I'm speaking more about the possibility. "Random" is from block allocator view. But from user view they are not random, these are unmodifiable files. Say, static content of website never changes, and these files may be packed together to decrease number of occupied 1Mb disc blocks. To pack all files on a disc together is terrible idea, I'm 100% agree with you. Kirill
On 29.02.2020 00:16, Dave Chinner wrote: > On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote: >> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>> On 27.02.2020 00:51, Andreas Dilger wrote: >>>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> In that case, an interesting userspace interface would be an array of >>>> inode numbers (64-bit please) that should be packed together densely in >>>> the order they are provided (maybe a flag for that). That allows the >>>> filesystem the freedom to find the physical blocks for the allocation, >>>> while userspace can tell which files are related to each other. >>> >>> So, this interface is 3-in-1: >>> >>> 1)finds a placement for inodes extents; >> >> The target allocation size would be sum(size of inodes), which should >> be relatively small in your case). >> >>> 2)assigns this space to some temporary donor inode; >> >> Maybe yes, or just reserves that space from being allocated by anyone. >> >>> 3)calls ext4_move_extents() for each of them. >> >> ... using the target space that was reserved earlier >> >>> Do I understand you right? >> >> Correct. That is my "5 minutes thinking about an interface for grouping >> small files together without exposing kernel internals" proposal for this. > > You don't need any special kernel interface with XFS for this. It is > simply: > > mkdir tmpdir > create O_TMPFILEs in tmpdir > > Now all the tmpfiles you create and their data will be co-located > around the location of the tmpdir inode. This is the natural > placement policy of the filesystem. i..e the filesystem assumes that > files in the same directory are all related, so will be accessed > together and so should be located in relatively close proximity to > each other. Hm, but does this help for my problem? 1)allocate two files in the same directory and then 2)move source files there? In case of I have two 512K files ext4 allows the same: 1)fallocate() 1M continuous space (this should ends with success in case of disc is not almost full); 2)move both files into newly allocated space. But this doubles IO, since both of files have to be moved. The ideal solution would be to allocate space around one of them and to move the second file there. > This is a locality optimisation technique that is older than XFS. It > works remarkably well when the filesystem can spread directories > effectively across it's address space. It also allows userspace to > use simple techniques to group (or separate) data files as desired. > Indeed, this is how xfs_fsr directs locality for it's tmpfiles when > relocating/defragmenting data.... > >>> If so, then IMO it's good to start from two inodes, because here may code >>> a very difficult algorithm of placement of many inodes, which may require >>> much memory. Is this OK? >> >> Well, if the files are small then it won't be a lot of memory. Even so, >> the kernel would only need to copy a few MB at a time in order to get >> any decent performance, so I don't think that is a huge problem to have >> several MB of dirty data in flight. >> >>> Can we introduce a flag, that some of inode is unmovable? >> >> There are very few flags left in the ext4_inode->i_flags for use. >> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they >> also have other semantics. The EXT4_NOTAIL_FL is for not merging the >> tail of a file, but ext4 doesn't have tails (that was in Reiserfs), >> so we might consider it a generic "do not merge" flag if set? > > We've had that in XFS for as long as I can remember. Many > applications were sensitive to the exact layout of the files they > created themselves, so having xfs_fsr defrag/move them about would > cause performance SLAs to be broken. > > Indeed, thanks to XFS, ext4 already has an interface that can be > used to set/clear a "no defrag" flag such as you are asking for. > It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl. > In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag, > and it has special meaning for directories. From the 'man 3 xfsctl' > man page where this interface came from: > > Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG > No defragment file bit - the file should be skipped during a > defragmentation operation. When applied to a directory, > new files and directories created will inherit the no-defrag > bit. > >>> Can this interface use a knowledge about underlining device discard granuality? >> >> As I wrote above, ext4+mballoc has a very good appreciation for alignment. >> That was written for RAID storage devices, but it doesn't matter what >> the reason is. It isn't clear if flash discard alignment is easily >> used (it may not be a power-of-two value or similar), but wouldn't be >> harmful to try. > > Yup, XFS has the similar (but more complex) alignment controls for > directing allocation to match the underlying storage > characteristics. e.g. stripe unit is also the "small file size > threshold" where the allocation policy changes from packing to > aligning and separating. > >>> In the answer to Dave, I wrote a proposition to make fallocate() care about >>> i_write_hint. Could you please comment what you think about that too? >> >> I'm not against that. How the two interact would need to be documented >> first and discussed to see if that makes sene, and then implemented. > > Individual filesystems can make their own choices as to what they do > with write hints, including ignoring them and leaving it for the > storage device to decide where to physically place the data. Which, > in many cases, ignoring the hint is the right thing for the > filesystem to do... > > Cheers, > > Dave. >
On 28.02.2020 18:35, Andreas Dilger wrote: > On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >> >> On 27.02.2020 00:51, Andreas Dilger wrote: >>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote: >>>> Why? There are two contradictory actions that filesystem can't do at the same time: >>>> >>>> 1)place files on a distance from each other to minimize number of extents >>>> on possible future growth; >>>> 2)place small files in the same big block of block device. >>>> >>>> At initial allocation time you never know, which file will stop grow in some >>>> future, i.e. which file is suitable for compaction. This knowledge becomes >>>> available some time later. Say, if a file has not been changed for a month, >>>> it is suitable for compaction with another files like it. >>>> >>>> If at allocation time you can determine a file, which won't grow in the future, >>>> don't be afraid, and just share your algorithm here. >>> >>> Very few files grow after they are initially written/closed. Those that >>> do are almost always opened with O_APPEND (e.g. log files). It would be >>> reasonable to have O_APPEND cause the filesystem to reserve blocks (in >>> memory at least, maybe some small amount on disk like 1/4 of the current >>> file size) for the file to grow after it is closed. We might use the >>> same heuristic for directories that grow long after initial creation. >> >> 1)Lets see on a real example. I created a new ext4 and started the test below: >> https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99 >> >> Here are two files written. One file is 4Kb. One file is 1Mb-4Kb. >> >> $filefrag -e test1.tmp test2.tmp >> Filesystem type is: ef53 >> File size of test1.tmp is 4096 (1 block of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: flags: >> 0: 0.. 0: 33793.. 33793: 1: last,eof >> test1.tmp: 1 extent found >> File size of test2.tmp is 1044480 (255 blocks of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: flags: >> 0: 0.. 254: 33536.. 33790: 255: last,eof >> test2.tmp: 1 extent found > > The alignment of blocks in the filesystem is much easier to see if you use > "filefrag -e -x ..." to print the values in hex. In this case, 33536 = 0x8300 > so it is properly aligned on disk IMHO. > >> $debugfs: testb 33791 >> Block 33791 not in use >> >> test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp >> placement prevents us from discarding next 1Mb block. > > For most filesystem uses, aligning the almost 1MB file on a 1MB boundary > is good. That allows a full-stripe read/write for RAID, and is more > likely to align with the erase block for flash. If it were to be allocated > after the 4KB block, then it may be that each 1MB-aligned read/write of a > large file would need to read/write two unaligned chunks per syscall. > >> 2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb: >> >> # filefrag -e test3.tmp test4.tmp >> Filesystem type is: ef53 >> File size of test3.tmp is 1052672 (257 blocks of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: flags: >> 0: 0.. 256: 35840.. 36096: 257: last,eof >> test3.tmp: 1 extent found >> File size of test4.tmp is 1044480 (255 blocks of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: flags: >> 0: 0.. 254: 35072.. 35326: 255: last,eof >> test4.tmp: 1 extent found > > Here again, "filefrag -e -x" would be helpful. 35840 = 0x8c00, and > 35072 = 0x8900, so IMHO they are allocated properly for most uses. > Packing all files together sequentially on disk is what FAT did and > it always got very fragmented in the end. > >> They don't go sequentially, and here is fragmentation starts. >> >> After both the tests: >> $df -h >> /dev/loop0 2.0G 11M 1.8G 1% /root/mnt >> >> Filesystem is free, all last block groups are free. E.g., >> >> Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED] >> Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66 >> Inode bitmap at 288 (bg #0 + 288), csum 0x00000000 >> Inode table at 7969-8480 (bg #0 + 7969) >> 32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes >> Free blocks: 491520-524287 >> Free inodes: 122881-131072 >> >> but two files are not packed together. >> >> So, ext4 block allocator does not work good for my workload. It even does not >> know anything about discard granuality of underlining block device. Does it? >> I assume no fs knows. Should I tell it? > > You can tune the alignment of allocations via s_raid_stripe and s_raid_stride > in the ext4 superblock. I believe these are also set by mke2fs by libdisk, > but I don't know if it takes flash erase block geometry into account. > >>> The main exception there is VM images, because they are not really "files" >>> in the normal sense, but containers aggregating a lot of different files, >>> each created with patterns that are not visible to the VM host. In that >>> case, it would be better to have the VM host tell the filesystem that the >>> IO pattern is "random" and not try to optimize until the VM is cold. >>> >>>> In Virtuozzo we tried to compact ext4 with existing kernel interface: >>>> >>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c >>>> >>>> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave >>>> excellent for everything. >>>> >>>> If this interface bad, can you suggest another interface to make block >>>> allocator to know the behavior expected from him in this specific case? >>> >>> In ext4 there is already the "group" allocator, which combines multiple >>> small files together into a single preallocation group, so that the IO >>> to disk is large/contiguous. The theory is that files written at the >>> same time will have similar lifespans, but that isn't always true. >>> >>> If the files are large and still being written, the allocator will reserve >>> additional blocks (default 8MB I think) on the expectation that it will >>> continue to write until it is closed. >>> >>> I think (correct me if I'm wrong) that your issue is with defragmenting >>> small files to free up contiguous space in the filesystem? I think once >>> the free space is freed of small files that defragmenting large files is >>> easily done. Anything with more than 8-16MB extents will max out most >>> storage anyway (seek rate * IO size). >> >> My issue is mostly with files < 1Mb, because underlining device discard >> granuality is 1Mb. The result of fragmentation is that size of occupied >> 1Mb blocks of device is 1.5 times bigger, than size of really written >> data (say, df -h). And this is the problem. > > > Sure, and the group allocator will aggregate writes << prealloc size of > 8MB by default. If it is 1MB that doesn't qualify for group prealloc. > I think under 64KB does qualify for aggregation and unaligned writes. > >>> In that case, an interesting userspace interface would be an array of >>> inode numbers (64-bit please) that should be packed together densely in >>> the order they are provided (maybe a flag for that). That allows the >>> filesystem the freedom to find the physical blocks for the allocation, >>> while userspace can tell which files are related to each other. >> >> So, this interface is 3-in-1: >> >> 1)finds a placement for inodes extents; > > The target allocation size would be sum(size of inodes), which should > be relatively small in your case). > >> 2)assigns this space to some temporary donor inode; > > Maybe yes, or just reserves that space from being allocated by anyone. > >> 3)calls ext4_move_extents() for each of them. > > ... using the target space that was reserved earlier > >> Do I understand you right? > > Correct. That is my "5 minutes thinking about an interface for grouping > small files together without exposing kernel internals" proposal for this. Ok. I'll think about the prototype and then public to the mailing list. >> If so, then IMO it's good to start from two inodes, because here may code >> a very difficult algorithm of placement of many inodes, which may require >> much memory. Is this OK? > > Well, if the files are small then it won't be a lot of memory. Even so, > the kernel would only need to copy a few MB at a time in order to get > any decent performance, so I don't think that is a huge problem to have > several MB of dirty data in flight. I mean not in-flight IO, but memory for all logic of files placement. Userspace may build multi-step algoritm, which is hidden for kernel: pack two files together, then decrease number of extents of some third file, then pack something else. Also, files related to different directories should be packed together, but it does not look good for kernel to look for files directories by inodes (our interface is about 64-bit inodes numbers, sure?). For me it does not look good, kernel iterates over all files and looks for a placement for a specific file, since this is just excess work for kernel. Usually, both the files are chosen by userspace, and the userspace does not want to move more then one of them at time. >> Can we introduce a flag, that some of inode is unmovable? > > There are very few flags left in the ext4_inode->i_flags for use. > You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they > also have other semantics. The EXT4_NOTAIL_FL is for not merging the > tail of a file, but ext4 doesn't have tails (that was in Reiserfs), > so we might consider it a generic "do not merge" flag if set? > >> Can this interface use a knowledge about underlining device discard granuality? > > As I wrote above, ext4+mballoc has a very good appreciation for alignment. > That was written for RAID storage devices, but it doesn't matter what > the reason is. It isn't clear if flash discard alignment is easily > used (it may not be a power-of-two value or similar), but wouldn't be > harmful to try. > >> In the answer to Dave, I wrote a proposition to make fallocate() care about >> i_write_hint. Could you please comment what you think about that too? > > I'm not against that. How the two interact would need to be documented > first and discussed to see if that makes sene, and then implemented. Thanks, Kirill
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 5a98081c5369..299fbb8350ac 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -181,6 +181,7 @@ struct ext4_allocation_request { struct ext4_map_blocks { ext4_fsblk_t m_pblk; ext4_lblk_t m_lblk; + ext4_fsblk_t m_goal_pblk; unsigned int m_len; unsigned int m_flags; }; @@ -621,6 +622,8 @@ enum { /* Caller will submit data before dropping transaction handle. This * allows jbd2 to avoid submitting data before commit. */ #define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400 + /* Caller wants blocks from provided physical offset */ +#define EXT4_GET_BLOCKS_FROM_GOAL 0x0800 /* * The bit position of these flags must not overlap with any of the diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 10d0188a712d..5f2790c1c4fb 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4412,7 +4412,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, /* allocate new block */ ar.inode = inode; - ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk); ar.logical = map->m_lblk; /* * We calculate the offset from the beginning of the cluster @@ -4437,6 +4436,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, ar.flags |= EXT4_MB_DELALLOC_RESERVED; if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL) ar.flags |= EXT4_MB_USE_RESERVED; + if (flags & EXT4_GET_BLOCKS_FROM_GOAL) { + ar.flags |= EXT4_MB_HINT_TRY_GOAL|EXT4_MB_HINT_GOAL_ONLY; + ar.goal = map->m_goal_pblk; + } else { + ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk); + } + newblock = ext4_mb_new_blocks(handle, &ar, &err); if (!newblock) goto out2; @@ -4580,8 +4586,8 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) } static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, - ext4_lblk_t len, loff_t new_size, - int flags) + ext4_lblk_t len, ext4_fsblk_t goal_pblk, + loff_t new_size, int flags) { struct inode *inode = file_inode(file); handle_t *handle; @@ -4603,6 +4609,10 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, */ if (len <= EXT_UNWRITTEN_MAX_LEN) flags |= EXT4_GET_BLOCKS_NO_NORMALIZE; + if (goal_pblk != (ext4_fsblk_t)-1) { + map.m_goal_pblk = goal_pblk; + flags |= EXT4_GET_BLOCKS_FROM_GOAL; + } /* * credits to insert 1 extent into extent tree @@ -4637,6 +4647,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, break; } map.m_lblk += ret; + map.m_goal_pblk += ret; map.m_len = len = len - ret; epos = (loff_t)map.m_lblk << inode->i_blkbits; inode->i_ctime = current_time(inode); @@ -4746,6 +4757,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, round_down(offset, 1 << blkbits) >> blkbits, (round_up((offset + len), 1 << blkbits) - round_down(offset, 1 << blkbits)) >> blkbits, + (ext4_fsblk_t)-1, new_size, flags); if (ret) goto out_mutex; @@ -4778,8 +4790,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, truncate_pagecache_range(inode, start, end - 1); inode->i_mtime = inode->i_ctime = current_time(inode); - ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, - flags); + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, + (ext4_fsblk_t)-1, new_size, flags); up_write(&EXT4_I(inode)->i_mmap_sem); if (ret) goto out_mutex; @@ -4839,10 +4851,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len, u64 physical) { struct inode *inode = file_inode(file); + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); loff_t new_size = 0; unsigned int max_blocks; int ret = 0; int flags; + ext4_fsblk_t pblk; ext4_lblk_t lblk; unsigned int blkbits = inode->i_blkbits; @@ -4862,7 +4876,8 @@ long ext4_fallocate(struct file *file, int mode, FALLOC_FL_INSERT_RANGE)) return -EOPNOTSUPP; - if (physical != (u64)-1) + if (((mode & ~FALLOC_FL_KEEP_SIZE) || sbi->s_cluster_ratio > 1) && + physical != (u64)-1) return -EOPNOTSUPP; if (mode & FALLOC_FL_PUNCH_HOLE) @@ -4883,6 +4898,7 @@ long ext4_fallocate(struct file *file, int mode, trace_ext4_fallocate_enter(inode, offset, len, mode); lblk = offset >> blkbits; + pblk = physical == (u64)-1 ? (ext4_fsblk_t)-1 : physical >> blkbits; max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits); flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; @@ -4911,7 +4927,8 @@ long ext4_fallocate(struct file *file, int mode, /* Wait all existing dio workers, newcomers will block on i_mutex */ inode_dio_wait(inode); - ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags); + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, pblk, + new_size, flags); if (ret) goto out; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fa0ff78dc033..1054ba65cc1b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -580,6 +580,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, return ret; } + if (retval > 0 && flags & EXT4_GET_BLOCKS_FROM_GOAL && + map->m_pblk != map->m_goal_pblk) + return -EEXIST; + /* If it is only a block(s) look up */ if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) return retval; @@ -672,6 +676,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, } } + /* + * Concurrent thread could allocate extent with other m_pblk, + * and we got it during second call of ext4_ext_map_blocks(). + */ + if (retval > 0 && flags & EXT4_GET_BLOCKS_FROM_GOAL && + map->m_pblk != map->m_goal_pblk) { + retval = -EEXIST; + goto out_sem; + } + /* * If the extent has been zeroed out, we don't need to update * extent status tree. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b1b3c5526d1a..ed25f47748a0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3426,6 +3426,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) struct ext4_prealloc_space *pa, *cpa = NULL; ext4_fsblk_t goal_block; + goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex); + /* only data can be preallocated */ if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return 0; @@ -3436,7 +3438,11 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) /* all fields in this condition don't change, * so we can skip locking for them */ - if (ac->ac_o_ex.fe_logical < pa->pa_lstart || + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) && + (goal_block < pa->pa_pstart || + goal_block >= pa->pa_pstart + pa->pa_len)) + continue; + else if (ac->ac_o_ex.fe_logical < pa->pa_lstart || ac->ac_o_ex.fe_logical >= (pa->pa_lstart + EXT4_C2B(sbi, pa->pa_len))) continue; @@ -3465,6 +3471,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)) return 0; + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) + return 0; + /* inode may have no locality group for some reason */ lg = ac->ac_lg; if (lg == NULL) @@ -3474,7 +3483,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) /* The max size of hash table is PREALLOC_TB_SIZE */ order = PREALLOC_TB_SIZE - 1; - goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex); /* * search for the prealloc space that is having * minimal distance from the goal block. @@ -4261,8 +4269,11 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac, /* start searching from the goal */ goal = ar->goal; if (goal < le32_to_cpu(es->s_first_data_block) || - goal >= ext4_blocks_count(es)) + goal >= ext4_blocks_count(es)) { + if (ar->flags & EXT4_MB_HINT_GOAL_ONLY) + return -EINVAL; goal = le32_to_cpu(es->s_first_data_block); + } ext4_get_group_no_and_offset(sb, goal, &group, &block); /* set up allocation goals */
This adds a support of physical hint for fallocate2() syscall. In case of @physical argument is set for ext4_fallocate(), we try to allocate blocks only from [@phisical, @physical + len] range, while other blocks are not used. ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len, u64 physical) In case of some of blocks from the range are occupied, the syscall returns with error. This is the only difference from fallocate(). The same as fallocate(), less then @len blocks may be allocated with error as a return value. We try to find hint blocks both in preallocated and ordinary blocks. Note, that ext4_mb_use_preallocated() looks for the hint only in inode's preallocations. In case of there are no desired block, further ext4_mb_discard_preallocations() tries to release group preallocations. Note, that this patch makes EXT4_MB_HINT_GOAL_ONLY flag be used, it used to be unused before for years. New EXT4_GET_BLOCKS_FROM_GOAL flag of ext4_map_blocks() is added. It indicates, that struct ext4_map_blocks::m_goal_pblk is valid. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> --- fs/ext4/ext4.h | 3 +++ fs/ext4/extents.c | 31 ++++++++++++++++++++++++------- fs/ext4/inode.c | 14 ++++++++++++++ fs/ext4/mballoc.c | 17 ++++++++++++++--- 4 files changed, 55 insertions(+), 10 deletions(-)