Message ID | 20230225010927.813929-1-slava@dubeyko.com (mailing list archive) |
---|---|
Headers | show |
Series | SSDFS: flash-friendly LFS file system for ZNS SSD | expand |
> Benchmarking results show that SSDFS is capable: Is there performance data showing IOPS? These comparisions include file systems that don't support zoned devices natively, maybe that's why IOPS comparisons cannot be made? > (3) decrease the write amplification factor compared with: > 1.3x - 116x (ext4), > 14x - 42x (xfs), > 6x - 9x (btrfs), > 1.5x - 50x (f2fs), > 1.2x - 20x (nilfs2); > (4) prolong SSD lifetime compared with: Is this measuring how many times blocks are erased? I guess this measurement includes the background I/O from ssdfs migration and moving? > 1.4x - 7.8x (ext4), > 15x - 60x (xfs), > 6x - 12x (btrfs), > 1.5x - 7x (f2fs), > 1x - 4.6x (nilfs2). Thanks, Stefan
> On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> Benchmarking results show that SSDFS is capable: > > Is there performance data showing IOPS? > Yeah, I completely see your point. :) Everybody would like to see the performance estimation. My first goal was to check how SSDFS can prolong SSD lifetime and decrease write amplification. So, I used blktrace output to estimate the write amplification factor, lifetime prolongation and compare various file systems. Also, blktrace output contains timestamps information. And I realized during comparison of timestamps that, currently, policy of distribution of offset translation table among logs affects read performance of SSDFS file system. So, if I compare the whole cycle between mount and unmount (read + write I/O), then I can see that SSDFS can be faster than NILFS2, XFS but slower than ext4, btrfs, F2FS. However, write I/O path should be faster because SSDFS can decrease the amount of write I/O requests. I am changing the offset translation table’s distribution policy (patch is under testing) and I expect to improve the SSDFS read performance significantly. So, performance benchmarking will make sense after this fix. > These comparisions include file systems that don't support zoned devices > natively, maybe that's why IOPS comparisons cannot be made? > Performance comparison can be made for conventional SSD devices. Of course, ZNS SSD has some peculiarities (limited number of open/active zones, zone size, write pointer, strict append-only mode) and it requires fair comparison. Because, these peculiarities/restrictions can as help as make life more difficult. However, even if we can compare file systems for the same type of storage device, then various configuration options (logical block size, erase block size, segment size, and so on) or particular workload can significantly change a file system behavior. It’s always not so easy statement that this file system faster than another one. >> (3) decrease the write amplification factor compared with: >> 1.3x - 116x (ext4), >> 14x - 42x (xfs), >> 6x - 9x (btrfs), >> 1.5x - 50x (f2fs), >> 1.2x - 20x (nilfs2); >> (4) prolong SSD lifetime compared with: > > Is this measuring how many times blocks are erased? I guess this > measurement includes the background I/O from ssdfs migration and moving? > So, first of all, I need to explain the testing methodology. Testing included: (1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file. Every particular test-case is executed as multiple mount/unmount operations sequence. For example, total number of file creation operations were 1000 and 10000, but one mount cycle included 10, 100, or 1000 file creation, file update, or file delete operations. Finally, file system must flush all dirty metadata and user data during unmount operation. The blktrace tool registers LBAs and size for every I/O request. These data are the basis for estimation how many erase blocks have been involved into operations. SSDFS volumes have been created by using 128KB, 512KB, and 8MB erase block sizes. So, I used these erase block sizes for estimation. Generally speaking, we can estimate the total number of erase blocks that were involved into file system operations for particular use-case by means of calculation of number of bytes of all I/O requests and division on erase block size. If file system uses in-place updates, then it is possible to estimate how many times the same erase block (we know LBA numbers) has been completely re-written. For example, if erase block (starting from LBA #32) received 1310720 bytes of write I/O requests, then erase block of 128KB in size has been re-written 10x times. So, it means that FTL needs to store all these data into 10 X 128KB erase blocks in the background or execute around 9 erase operation to keep the actual state of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility. However, if we would like to estimate the total number of erase operation, then we need to take into account: E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention) The estimation of erase operation on the basis of retention issue is tricky and it shows negligibly small number for such short testing. So, we can ignore it. However, retention issue is important factor of decreasing SSD lifetime. I executed the estimation of this factor and I made comparison for various file systems. But this factor is deeply depends on time, workload, and payload size. So, it’s really hard to share any stable and reasonable numbers for this factor. Especially, it heavily depends on FTL implementation. It is possible to make estimation of read disturbance but, again, it heavily depends on NAND flash type, organization, and FTL algorithms. Also, this estimation shows really small numbers that can be ignored for short testing. I’ve made this estimation and I can see that, currently, SSDFS has read-intensive nature because of offset translation table distribution policy. I am testing the fix and I have hope to remove this issue. SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations even for such “short" test-cases. As far as I can see, no other file system issues discard operations for the same test-cases. I included TRIM/erase operations into the calculation of total number of erase operations. Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one. I’ve made estimation of number of erase operations that FS GC can generate. However, as far as I can see, even without taking into account the FS GC erase operations, SSDFS looks better compared with F2FS and NILFS2. I need to add here that SSDFS uses migration scheme and doesn’t need in classical GC. But even for such “short” test-cases migration scheme shows really efficient TRIM/erase policy. So, write amplification factor was estimated on the basis of write I/O requests comparison. And SSD lifetime prolongation has been estimated and compared by using the model that I explained above. I hope I explained it's clear enough. Feel free to ask additional questions if I missed something. The measurement includes all operations (foreground and background) that file system initiates because of using mount/unmount model. However, migration scheme requires additional explanation. Generally speaking, migration scheme doesn’t generate additional I/O requests. Oppositely, migration scheme decreases number of I/O requests. It could be tricky to follow. SSDFS uses compression, delta-encoding, compaction scheme, and migration stimulation. It means that reqular file system’s update operations are the main vehicle of migration scheme. Let imagine that application updates 4KB logical block. It means that SSDFS tries to compress (or delta-encode) this piece of data. Let compression gives us 1KB compressed piece of data (4KB uncompressed size). It means that we can place 1KB into 4KB memory page and we have 3KB free space. So, migration logic checks that exhausted (completely full) old erase block that received update operation has another valid block(s). If we have such valid logical blocks, then we can compress this logical blocks and store it into free space of 4K memory page. So, we can finally store 4 compressed logical blocks (1KB in size each), for example, into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical blocks instead of 4 ones. I simplify the explanation, but idea remains the same. I hope I clarified the point. Feel free to ask additional questions if I missed something. Thanks, Slava. >> 1.4x - 7.8x (ext4), >> 15x - 60x (xfs), >> 6x - 12x (btrfs), >> 1.5x - 7x (f2fs), >> 1x - 4.6x (nilfs2). > > Thanks, > Stefan
On Mon, Feb 27, 2023 at 02:59:08PM -0800, Viacheslav A.Dubeyko wrote: > > On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > These comparisions include file systems that don't support zoned devices > > natively, maybe that's why IOPS comparisons cannot be made? > > > > Performance comparison can be made for conventional SSD devices. > Of course, ZNS SSD has some peculiarities (limited number of open/active > zones, zone size, write pointer, strict append-only mode) and it requires > fair comparison. Because, these peculiarities/restrictions can as help as > make life more difficult. However, even if we can compare file systems for > the same type of storage device, then various configuration options > (logical block size, erase block size, segment size, and so on) or particular > workload can significantly change a file system behavior. It’s always not so > easy statement that this file system faster than another one. I incorrectly assumed ssdfs was only for zoned devices. > > >> (3) decrease the write amplification factor compared with: > >> 1.3x - 116x (ext4), > >> 14x - 42x (xfs), > >> 6x - 9x (btrfs), > >> 1.5x - 50x (f2fs), > >> 1.2x - 20x (nilfs2); > >> (4) prolong SSD lifetime compared with: > > > > Is this measuring how many times blocks are erased? I guess this > > measurement includes the background I/O from ssdfs migration and moving? > > > > So, first of all, I need to explain the testing methodology. Testing included: > (1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file. > Every particular test-case is executed as multiple mount/unmount operations > sequence. For example, total number of file creation operations were 1000 and > 10000, but one mount cycle included 10, 100, or 1000 file creation, file update, > or file delete operations. Finally, file system must flush all dirty metadata and > user data during unmount operation. > > The blktrace tool registers LBAs and size for every I/O request. These data are > the basis for estimation how many erase blocks have been involved into > operations. SSDFS volumes have been created by using 128KB, 512KB, and > 8MB erase block sizes. So, I used these erase block sizes for estimation. > Generally speaking, we can estimate the total number of erase blocks that > were involved into file system operations for particular use-case by means of > calculation of number of bytes of all I/O requests and division on erase block size. > If file system uses in-place updates, then it is possible to estimate how many times > the same erase block (we know LBA numbers) has been completely re-written. > For example, if erase block (starting from LBA #32) received 1310720 bytes of > write I/O requests, then erase block of 128KB in size has been re-written 10x times. > So, it means that FTL needs to store all these data into 10 X 128KB erase blocks > in the background or execute around 9 erase operation to keep the actual state > of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility. > > However, if we would like to estimate the total number of erase operation, then > we need to take into account: > > E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention) > > The estimation of erase operation on the basis of retention issue is tricky and > it shows negligibly small number for such short testing. So, we can ignore it. > However, retention issue is important factor of decreasing SSD lifetime. > I executed the estimation of this factor and I made comparison for various > file systems. But this factor is deeply depends on time, workload, and > payload size. So, it’s really hard to share any stable and reasonable numbers > for this factor. Especially, it heavily depends on FTL implementation. > > It is possible to make estimation of read disturbance but, again, it heavily > depends on NAND flash type, organization, and FTL algorithms. Also, this > estimation shows really small numbers that can be ignored for short testing. > I’ve made this estimation and I can see that, currently, SSDFS has read-intensive > nature because of offset translation table distribution policy. I am testing the fix > and I have hope to remove this issue. > > SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations > even for such “short" test-cases. As far as I can see, no other file system issues > discard operations for the same test-cases. I included TRIM/erase operations > into the calculation of total number of erase operations. > > Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one. > I’ve made estimation of number of erase operations that FS GC can generate. > However, as far as I can see, even without taking into account the FS GC erase > operations, SSDFS looks better compared with F2FS and NILFS2. > I need to add here that SSDFS uses migration scheme and doesn’t need > in classical GC. But even for such “short” test-cases migration scheme shows > really efficient TRIM/erase policy. > > So, write amplification factor was estimated on the basis of write I/O requests > comparison. And SSD lifetime prolongation has been estimated and compared > by using the model that I explained above. I hope I explained it's clear enough. > Feel free to ask additional questions if I missed something. > > The measurement includes all operations (foreground and background) that > file system initiates because of using mount/unmount model. However, migration > scheme requires additional explanation. Generally speaking, migration scheme > doesn’t generate additional I/O requests. Oppositely, migration scheme decreases > number of I/O requests. It could be tricky to follow. SSDFS uses compression, > delta-encoding, compaction scheme, and migration stimulation. It means that > reqular file system’s update operations are the main vehicle of migration scheme. > Let imagine that application updates 4KB logical block. It means that SSDFS > tries to compress (or delta-encode) this piece of data. Let compression gives us > 1KB compressed piece of data (4KB uncompressed size). It means that we can > place 1KB into 4KB memory page and we have 3KB free space. So, migration > logic checks that exhausted (completely full) old erase block that received update > operation has another valid block(s). If we have such valid logical blocks, then > we can compress this logical blocks and store it into free space of 4K memory page. > So, we can finally store 4 compressed logical blocks (1KB in size each), for example, > into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical > blocks instead of 4 ones. I simplify the explanation, but idea remains the same. > I hope I clarified the point. Feel free to ask additional questions if I missed something. Thanks for these explanations, that clarifies things! Stefan