Message ID | 20120925164335.GZ14582@twin.jikos.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote: > On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote: > > I'm going to look at fixing some of the performance issues that crop up because > > of our reservation system. Before I go and do a whole lot of work I want some > > feedback. I've done a brain dump here > > https://btrfs.wiki.kernel.org/index.php/ENOSPC > > Thanks for writing it down, much appreciated. > > My first and probably naive approach is described in the page, quoting > here: > > "Attempt to address how to flush less stated below. The > over-reservation of a 4k block can go up to 96k as the worst case > calculation (see above). This accounts for splitting the full tree path > from 8th level root down to the leaf plus the node splits. My question: > how often do we need to go up to the level N+1 from current level N? > for levels 0 and 1 it may happen within one transaction, maybe not so > often for level 2 and with exponentially decreasing frequency for the > higher levels. Therefore, is it possible to check the tree level first > and adapt the calculation according to that? Let's say we can reduce > the 4k reservation size from 96k to 32k on average (for a many-gigabyte > filesystem), thus increasing the space available for reservations by > some factor. The expected gain is less pressure to the flusher because > more reservations will succeed immediately. > The idea behind is to make the initial reservation more accurate to > current state than blindly overcommitting by some random factor (1/2). > Another hint to the tree root level may be the usage of the root node: > eg. if the root is less than half full, splitting will not happen > unless there are K concurrent reservations running where K is > proportional to overwriting the whole subtree (same exponential > decrease with increasing level) and this will not be possible within > one transaction or there will not be enough space to satisfy all > reservations. (This attempts to fine-tune the currently hardcoded level > 8 up to the best value). The safe value for the level in the > calculations would be like N+1, ie. as if all the possible splits > happen with respect to current tree height." > > implemented as follows on top of next/master, in short: > * disable overcommit completely > * do the optimistically best guess for the metadata and reserve only up > to the current tree height > So I had tried to do this before, the problem is when height changes our reserve changes. So for things like delalloc we say we have X number of extents and we reserve that much space, but then when we run delalloc we re-calculate the metadata size for X number extents we've removed and that number could come out differently since the height of the tree would have changed. One thing we could do is to store the actual reservation with the extent in the io_tree, but I think we already use the private for something else so we'd have to add it somewhere else. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
when testing, please also do something like this: # create big squashfs image somewhere: # mksquashfs / /big.img -noappend -no-sparse -e big.img # then unpack into fresh filesystem with (and no) compression: # unsquashfs -f -d /subvol /big.img this is how i was always able to trigger ENOSPC while trying to make a full system installation from squashfs image. you should also try different compression algos (i only use lzo) btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a i686 Pentium M Notebook with only 1GB of Memory and fresh FSthis way, otherwise havent seen ENOSPC for long time. Ahmet On Tue, Sep 25, 2012 at 7:02 PM, Josef Bacik <jbacik@fusionio.com> wrote: > On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote: >> On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote: >> > I'm going to look at fixing some of the performance issues that crop up because >> > of our reservation system. Before I go and do a whole lot of work I want some >> > feedback. I've done a brain dump here >> > https://btrfs.wiki.kernel.org/index.php/ENOSPC >> >> Thanks for writing it down, much appreciated. >> >> My first and probably naive approach is described in the page, quoting >> here: >> >> "Attempt to address how to flush less stated below. The >> over-reservation of a 4k block can go up to 96k as the worst case >> calculation (see above). This accounts for splitting the full tree path >> from 8th level root down to the leaf plus the node splits. My question: >> how often do we need to go up to the level N+1 from current level N? >> for levels 0 and 1 it may happen within one transaction, maybe not so >> often for level 2 and with exponentially decreasing frequency for the >> higher levels. Therefore, is it possible to check the tree level first >> and adapt the calculation according to that? Let's say we can reduce >> the 4k reservation size from 96k to 32k on average (for a many-gigabyte >> filesystem), thus increasing the space available for reservations by >> some factor. The expected gain is less pressure to the flusher because >> more reservations will succeed immediately. >> The idea behind is to make the initial reservation more accurate to >> current state than blindly overcommitting by some random factor (1/2). >> Another hint to the tree root level may be the usage of the root node: >> eg. if the root is less than half full, splitting will not happen >> unless there are K concurrent reservations running where K is >> proportional to overwriting the whole subtree (same exponential >> decrease with increasing level) and this will not be possible within >> one transaction or there will not be enough space to satisfy all >> reservations. (This attempts to fine-tune the currently hardcoded level >> 8 up to the best value). The safe value for the level in the >> calculations would be like N+1, ie. as if all the possible splits >> happen with respect to current tree height." >> >> implemented as follows on top of next/master, in short: >> * disable overcommit completely >> * do the optimistically best guess for the metadata and reserve only up >> to the current tree height >> > > So I had tried to do this before, the problem is when height changes our reserve > changes. So for things like delalloc we say we have X number of extents and we > reserve that much space, but then when we run delalloc we re-calculate the > metadata size for X number extents we've removed and that number could come out > differently since the height of the tree would have changed. One thing we could > do is to store the actual reservation with the extent in the io_tree, but I > think we already use the private for something else so we'd have to add it > somewhere else. Thanks, > > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 26, 2012 at 01:55:47AM -0600, Ahmet Inan wrote: > when testing, please also do something like this: > > # create big squashfs image somewhere: > # mksquashfs / /big.img -noappend -no-sparse -e big.img > > # then unpack into fresh filesystem with (and no) compression: > # unsquashfs -f -d /subvol /big.img > > this is how i was always able to trigger ENOSPC while trying to > make a full system installation from squashfs image. > > you should also try different compression algos (i only use lzo) > > btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a > i686 Pentium M Notebook with only 1GB of Memory and > fresh FSthis way, otherwise havent seen ENOSPC for long time. > Thanks I will try that now, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 26, 2012 at 01:55:47AM -0600, Ahmet Inan wrote: > when testing, please also do something like this: > > # create big squashfs image somewhere: > # mksquashfs / /big.img -noappend -no-sparse -e big.img > > # then unpack into fresh filesystem with (and no) compression: > # unsquashfs -f -d /subvol /big.img > > this is how i was always able to trigger ENOSPC while trying to > make a full system installation from squashfs image. > > you should also try different compression algos (i only use lzo) > > btw: i was able to trigger ENOSPC with for-linus on 3.5.4 on a > i686 Pentium M Notebook with only 1GB of Memory and > fresh FSthis way, otherwise havent seen ENOSPC for long time. > Have you actually seen this problem on 3.5 with no compression? I fixed a problem in the 3.5 timeframe that should have fixed this for no-compression, and then I've since fixed the compression part of it in btrfs-next. Can you retest with btrfs-next and see if you still see the problem? In the meantime I'm running mksquashfs, but it's going to take a few years to complete ;). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef, > Have you actually seen this problem on 3.5 with no compression? youre right, no problems with no compression. > problem in the 3.5 timeframe that should have fixed this for no-compression, and > then I've since fixed the compression part of it in btrfs-next. Can you retest i cherry picked on top of 3.5.4 + for-linus all enospc patches and dependencies from your btrfs-next in that order: ddabdd663aa819f206aaaf689b12475b1b11d71e Btrfs: do not allocate chunks as agressively dd314c3ef057dde30d1c8ae67c98019a71713ea4 Btrfs: turbo charge fsync 4d9f18a35d456c27e81c8e0de29c858f85d35c91 Btrfs: do not needlessly restart the transaction for enospc 82b8aa2bef5c5267d74ad8e5a82572d4fa85bf89 Btrfs: wait on async pages when shrinking delalloc 4af89eb8ce93a1ec64fc698f82f67a0e3bd8cbef Btrfs: delay block group item insertion i hope those are the patches you wanted me to test. > with btrfs-next and see if you still see the problem? did and got some nice results: no problems and its faster! these results are from an old i686 Pentium M Notebook with 30GB HDD and 1.2GB RAM: 3.5.4 + for-linus + lzo: 6x enospc problems at 30% unsquashfs *snip* [================| ] 269645/910412 29% Write on output file failed because No space left on device writer: failed to write data block 0 Failed to write /mnt/point/something, skipping *snip* # df -h /mnt/point/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 28G 12G 16G 43% /mnt/point one run: real 28m44.882s user 7m4.542s sys 4m18.353s 3.5.4 + for-linus + no compression: no problems # df -h /mnt/point/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 28G 18G 9.0G 67% /mnt/point one run: real 20m57.157s user 5m37.915s sys 3m23.913s 3.5.4 + for-linus + enospc patches + lzo: no problems # df -h /mnt/point/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 28G 12G 17G 42% /mnt/point first run: real 19m35.852s user 4m33.359s sys 3m2.808s second run: real 23m12.371s user 5m37.398s sys 3m47.525s 3.5.4 + for-linus + enospc patches + no compression: no problems # df -h /mnt/point/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 28G 18G 9.5G 66% /mnt/point one run: real 22m56.940s user 4m53.571s sys 3m4.188s great work. im gonna keep those patches and deploy it on my systems here soon after some more testing. Ahmet -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 25, 2012 at 01:02:39PM -0400, Josef Bacik wrote: > > implemented as follows on top of next/master, in short: > > * disable overcommit completely > > * do the optimistically best guess for the metadata and reserve only up > > to the current tree height > > So I had tried to do this before, the problem is when height changes our reserve > changes. So for things like delalloc we say we have X number of extents and we > reserve that much space, but then when we run delalloc we re-calculate the > metadata size for X number extents we've removed and that number could come out > differently since the height of the tree would have changed. Understood, I did not came accross this during testing because I did not reproduce the required conditions that do not seem frequent but definetelly are possible. Let me proceed with V2: * disable overcommit completely * reserve only up to the current tree height + 1 You can object that the tree height may grow by 2, namely at the early stage from 0 -> 2 if the number of operations is quite high and all of this finishes within one transaction. So improvement that also covers that: * reserve up to MIN( BTRFS_MAX_LEVEL, max(tree->height + 1, 3) ) with the assumption that growing three from 3->4 will take more operations that could fit into one transaction period even on a fast device. (More accurate analysis required, but this should give the idea). The estimated savings in metadata block reservations start at 3/8 (and decreases with the larger filesystem use). This estimate does not have the property of being 'almost exact' as the V1 proposal, so some sort of overcommit math should also take place. I leave this out for now. thanks, david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2794,7 +2794,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping) static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root, unsigned num_items) { - return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) * + int level = btrfs_header_level(root->node); + + level = min(level, BTRFS_MAX_LEVEL); + + return (root->leafsize + root->nodesize * (level - 1)) * 3 * num_items; } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index efb044e..c9fa7ed 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3649,6 +3649,8 @@ static int can_overcommit(struct btrfs_root *root, u64 avail; u64 used; + return 0; + used = space_info->bytes_used + space_info->bytes_reserved + space_info->bytes_pinned + space_info->bytes_readonly + space_info->bytes_may_use;