diff mbox series

[5/8] btrfs: zoned: activate metadata block group on write time

Message ID 2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zoned: write-time activation of metadata block group | expand

Commit Message

Naohiro Aota July 24, 2023, 4:18 a.m. UTC
In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.

Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.

To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.

Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.

Fixes: b09315139136 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 11 +++++++
 fs/btrfs/extent_io.c   |  2 +-
 fs/btrfs/fs.h          |  3 ++
 fs/btrfs/zoned.c       | 72 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  1 +
 5 files changed, 88 insertions(+), 1 deletion(-)

Comments

kernel test robot July 24, 2023, 6:24 a.m. UTC | #1
Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230721]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: parisc-randconfig-r023-20230724 (https://download.01.org/0day-ci/archive/20230724/202307241420.4nG9x5F7-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241420.4nG9x5F7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241420.4nG9x5F7-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c: In function 'submit_eb_page':
>> fs/btrfs/extent_io.c:1827:58: error: passing argument 2 of 'btrfs_check_meta_write_pointer' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |                                                          ^~~
         |                                                          |
         |                                                          struct writeback_control *
   In file included from fs/btrfs/extent_io.c:30:
   fs/btrfs/zoned.h:192:54: note: expected 'struct extent_buffer *' but argument is of type 'struct writeback_control *'
     192 |                                struct extent_buffer *eb,
         |                                ~~~~~~~~~~~~~~~~~~~~~~^~
   fs/btrfs/extent_io.c:1827:63: error: passing argument 3 of 'btrfs_check_meta_write_pointer' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |                                                               ^~
         |                                                               |
         |                                                               struct extent_buffer *
   fs/btrfs/zoned.h:193:59: note: expected 'struct btrfs_block_group **' but argument is of type 'struct extent_buffer *'
     193 |                                struct btrfs_block_group **cache_ret)
         |                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
>> fs/btrfs/extent_io.c:1827:14: error: too many arguments to function 'btrfs_check_meta_write_pointer'
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: declared here
     191 | static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
         |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/btrfs_check_meta_write_pointer +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858
kernel test robot July 24, 2023, 7:36 a.m. UTC | #2
Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230724]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20230724/202307241530.1g5O8gwF-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241530.1g5O8gwF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241530.1g5O8gwF-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c:214:16: warning: variable 'pages_processed' set but not used [-Wunused-but-set-variable]
           unsigned long pages_processed = 0;
                         ^
>> fs/btrfs/extent_io.c:1827:60: error: too many arguments to function call, expected 3, have 4
           if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                       ^~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: 'btrfs_check_meta_write_pointer' declared here
   static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
                      ^
   1 warning and 1 error generated.


vim +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858
kernel test robot July 24, 2023, 7:57 a.m. UTC | #3
Hi Naohiro,

kernel test robot noticed the following build errors:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on linus/master v6.5-rc3 next-20230724]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Naohiro-Aota/btrfs-zoned-introduce-block_group-context-for-submit_eb_page/20230724-122053
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/2f8e30f966cfd9e06b0745a691bd1a5566aab780.1690171333.git.naohiro.aota%40wdc.com
patch subject: [PATCH 5/8] btrfs: zoned: activate metadata block group on write time
config: arm-randconfig-r002-20230724 (https://download.01.org/0day-ci/archive/20230724/202307241541.8w52nEnt-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230724/202307241541.8w52nEnt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307241541.8w52nEnt-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c:214:16: warning: variable 'pages_processed' set but not used [-Wunused-but-set-variable]
     214 |         unsigned long pages_processed = 0;
         |                       ^
>> fs/btrfs/extent_io.c:1827:60: error: too many arguments to function call, expected 3, have 4
    1827 |         if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
         |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                       ^~~~~~~~~~
   fs/btrfs/zoned.h:191:20: note: 'btrfs_check_meta_write_pointer' declared here
     191 | static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
         |                    ^
   1 warning and 1 error generated.


vim +1827 fs/btrfs/extent_io.c

  1766	
  1767	/*
  1768	 * Submit all page(s) of one extent buffer.
  1769	 *
  1770	 * @page:	the page of one extent buffer
  1771	 * @eb_context:	to determine if we need to submit this page, if current page
  1772	 *		belongs to this eb, we don't need to submit
  1773	 *
  1774	 * The caller should pass each page in their bytenr order, and here we use
  1775	 * @eb_context to determine if we have submitted pages of one extent buffer.
  1776	 *
  1777	 * If we have, we just skip until we hit a new page that doesn't belong to
  1778	 * current @eb_context.
  1779	 *
  1780	 * If not, we submit all the page(s) of the extent buffer.
  1781	 *
  1782	 * Return >0 if we have submitted the extent buffer successfully.
  1783	 * Return 0 if we don't need to submit the page, as it's already submitted by
  1784	 * previous call.
  1785	 * Return <0 for fatal error.
  1786	 */
  1787	static int submit_eb_page(struct page *page, struct writeback_control *wbc,
  1788				  struct extent_buffer **eb_context,
  1789				  struct btrfs_block_group **bg_context)
  1790	{
  1791		struct address_space *mapping = page->mapping;
  1792		struct extent_buffer *eb;
  1793		int ret;
  1794	
  1795		if (!PagePrivate(page))
  1796			return 0;
  1797	
  1798		if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
  1799			return submit_eb_subpage(page, wbc);
  1800	
  1801		spin_lock(&mapping->private_lock);
  1802		if (!PagePrivate(page)) {
  1803			spin_unlock(&mapping->private_lock);
  1804			return 0;
  1805		}
  1806	
  1807		eb = (struct extent_buffer *)page->private;
  1808	
  1809		/*
  1810		 * Shouldn't happen and normally this would be a BUG_ON but no point
  1811		 * crashing the machine for something we can survive anyway.
  1812		 */
  1813		if (WARN_ON(!eb)) {
  1814			spin_unlock(&mapping->private_lock);
  1815			return 0;
  1816		}
  1817	
  1818		if (eb == *eb_context) {
  1819			spin_unlock(&mapping->private_lock);
  1820			return 0;
  1821		}
  1822		ret = atomic_inc_not_zero(&eb->refs);
  1823		spin_unlock(&mapping->private_lock);
  1824		if (!ret)
  1825			return 0;
  1826	
> 1827		if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
  1828			/*
  1829			 * If for_sync, this hole will be filled with
  1830			 * trasnsaction commit.
  1831			 */
  1832			if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
  1833				ret = -EAGAIN;
  1834			else
  1835				ret = 0;
  1836			free_extent_buffer(eb);
  1837			return ret;
  1838		}
  1839	
  1840		*eb_context = eb;
  1841	
  1842		if (!lock_extent_buffer_for_io(eb, wbc)) {
  1843			free_extent_buffer(eb);
  1844			return 0;
  1845		}
  1846		if (*bg_context) {
  1847			/* Implies write in zoned mode. */
  1848			struct btrfs_block_group *bg = *bg_context;
  1849	
  1850			/* Mark the last eb in the block group. */
  1851			btrfs_schedule_zone_finish_bg(bg, eb);
  1852			bg->meta_write_pointer += eb->len;
  1853		}
  1854		write_one_eb(eb, wbc);
  1855		free_extent_buffer(eb);
  1856		return 1;
  1857	}
  1858
Christoph Hellwig July 24, 2023, 3:12 p.m. UTC | #4
On Mon, Jul 24, 2023 at 01:18:34PM +0900, Naohiro Aota wrote:
> Since metadata write-out is always allocated sequentially, when we need to
> write to a non-active block group, we can wait for the ongoing IOs to
> complete, activate a new block group, and then proceed with writing to the
> new block group.

Somewhat unrelated, but isn't the same true for data as well, and maybe
in a follow on the same activation policy should apply there as well?
I guess it won't quite work without much work as reservations are tied
to a BG, but it would seem like the better scheme.

> +	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {

Just a more or less cosmetic nit:  there is a lot of code in this
branch, and I wonder if the active zone tracking code would benefit
from being split out into one or more helpers.

> +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> +
> +		spin_lock(&cache->lock);
> +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> +			     &cache->runtime_flags)) {
> +			spin_unlock(&cache->lock);
> +			return true;
> +		}
> +
> +		spin_unlock(&cache->lock);

What is the point of the cache->lock crticial section here?  The
information can be out of date as soon as you drop the lock, so it
looks superflous to me.

> +		if (fs_info->treelog_bg == cache->start) {
> +			if (!btrfs_zone_activate(cache)) {
> +				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
> +
> +				if (ret_fin != 1 || !btrfs_zone_activate(cache))
> +					return false;

The ret_fin variable here doesn't seem to be actually needed.

> +			}
> +		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
> +			   (is_system && fs_info->active_system_bg != cache)) {
> +			struct btrfs_block_group *tgt = is_system ?
> +				fs_info->active_system_bg : fs_info->active_meta_bg;

There's a lot of checks for is_system here and later on in the
logic.  If we had a helper for this, you could just pass in a bg double
pointer argument that the callers sets to &fs_info->active_system_bg or
&fs_info->active_meta_bg and simplify a lot of the logic.
Naohiro Aota July 25, 2023, 9:28 a.m. UTC | #5
On Mon, Jul 24, 2023 at 08:12:28AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 24, 2023 at 01:18:34PM +0900, Naohiro Aota wrote:
> > Since metadata write-out is always allocated sequentially, when we need to
> > write to a non-active block group, we can wait for the ongoing IOs to
> > complete, activate a new block group, and then proceed with writing to the
> > new block group.
> 
> Somewhat unrelated, but isn't the same true for data as well, and maybe
> in a follow on the same activation policy should apply there as well?
> I guess it won't quite work without much work as reservations are tied
> to a BG, but it would seem like the better scheme.

Yeah, we may be able to apply the write-time activation to data as
well. But, there are some differences between data and metadata, which make
it not so useful.

- Allocation of data and writing out is done in the same context under
  run_delalloc_zoned(), and so they are near in the timing
  - While, metadata allocation is done while building a B-tree, and written
    on a transaction commit
- No over-commit for data in the first place

> > +	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {
> 
> Just a more or less cosmetic nit:  there is a lot of code in this
> branch, and I wonder if the active zone tracking code would benefit
> from being split out into one or more helpers.

Yeah, I'll create a helper.

> > +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> > +
> > +		spin_lock(&cache->lock);
> > +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> > +			     &cache->runtime_flags)) {
> > +			spin_unlock(&cache->lock);
> > +			return true;
> > +		}
> > +
> > +		spin_unlock(&cache->lock);
> 
> What is the point of the cache->lock crticial section here?  The
> information can be out of date as soon as you drop the lock, so it
> looks superflous to me.

The block group's activeness starts from non-active, then activated, and
finally finished. So, if its "ACTIVE" here, the block group is going to be
still active or finished. It's OK it is active. If it is finished, the IO
will fail anyway, so it is a bug itself.

> > +		if (fs_info->treelog_bg == cache->start) {
> > +			if (!btrfs_zone_activate(cache)) {
> > +				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
> > +
> > +				if (ret_fin != 1 || !btrfs_zone_activate(cache))
> > +					return false;
> 
> The ret_fin variable here doesn't seem to be actually needed.

Indeed.

> > +			}
> > +		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
> > +			   (is_system && fs_info->active_system_bg != cache)) {
> > +			struct btrfs_block_group *tgt = is_system ?
> > +				fs_info->active_system_bg : fs_info->active_meta_bg;
> 
> There's a lot of checks for is_system here and later on in the
> logic.  If we had a helper for this, you could just pass in a bg double
> pointer argument that the callers sets to &fs_info->active_system_bg or
> &fs_info->active_meta_bg and simplify a lot of the logic.
> 

Sure.
Christoph Hellwig July 26, 2023, 1:10 p.m. UTC | #6
On Tue, Jul 25, 2023 at 09:28:09AM +0000, Naohiro Aota wrote:
> > > +		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
> > > +
> > > +		spin_lock(&cache->lock);
> > > +		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
> > > +			     &cache->runtime_flags)) {
> > > +			spin_unlock(&cache->lock);
> > > +			return true;
> > > +		}
> > > +
> > > +		spin_unlock(&cache->lock);
> > 
> > What is the point of the cache->lock crticial section here?  The
> > information can be out of date as soon as you drop the lock, so it
> > looks superflous to me.
> 
> The block group's activeness starts from non-active, then activated, and
> finally finished. So, if its "ACTIVE" here, the block group is going to be
> still active or finished. It's OK it is active. If it is finished, the IO
> will fail anyway, so it is a bug itself.

Yes, but what I mean is that there is no point in taking cache->lock
here.  The only thing that's done under it is testing a single bit,
and the information from that bit is only used after dropping the
lock.  So there should be no need to take the lock in the first place.
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 91d38f38c1e7..75f8482f45e5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4274,6 +4274,17 @@  int btrfs_free_block_groups(struct btrfs_fs_info *info)
 	struct btrfs_caching_control *caching_ctl;
 	struct rb_node *n;
 
+	if (btrfs_is_zoned(info)) {
+		if (info->active_meta_bg) {
+			btrfs_put_block_group(info->active_meta_bg);
+			info->active_meta_bg = NULL;
+		}
+		if (info->active_system_bg) {
+			btrfs_put_block_group(info->active_system_bg);
+			info->active_system_bg = NULL;
+		}
+	}
+
 	write_lock(&info->block_group_cache_lock);
 	while (!list_empty(&info->caching_block_groups)) {
 		caching_ctl = list_entry(info->caching_block_groups.next,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 46a0b5357009..3f104559c0cc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1894,7 +1894,7 @@  static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	if (!ret)
 		return 0;
 
-	if (!btrfs_check_meta_write_pointer(eb->fs_info, eb, bg_context)) {
+	if (!btrfs_check_meta_write_pointer(eb->fs_info, wbc, eb, bg_context)) {
 		/*
 		 * If for_sync, this hole will be filled with
 		 * trasnsaction commit.
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 203d2a267828..1f2d33112106 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -766,6 +766,9 @@  struct btrfs_fs_info {
 	u64 data_reloc_bg;
 	struct mutex zoned_data_reloc_io_lock;
 
+	struct btrfs_block_group *active_meta_bg;
+	struct btrfs_block_group *active_system_bg;
+
 	u64 nr_global_roots;
 
 	spinlock_t zone_active_bgs_lock;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dbfa49c70c1a..f440853dff1c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -65,6 +65,9 @@ 
 
 #define SUPER_INFO_SECTORS	((u64)BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT)
 
+static void wait_eb_writebacks(struct btrfs_block_group *block_group);
+static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written);
+
 static inline bool sb_zone_is_full(const struct blk_zone *zone)
 {
 	return (zone->cond == BLK_ZONE_COND_FULL) ||
@@ -1748,6 +1751,7 @@  void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
 }
 
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct writeback_control *wbc,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret)
 {
@@ -1779,6 +1783,74 @@  bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	if (cache->meta_write_pointer != eb->start)
 		return false;
 
+	if (test_bit(BTRFS_FS_ACTIVE_ZONE_TRACKING, &fs_info->flags)) {
+		bool is_system = cache->flags & BTRFS_BLOCK_GROUP_SYSTEM;
+
+		spin_lock(&cache->lock);
+		if (test_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE,
+			     &cache->runtime_flags)) {
+			spin_unlock(&cache->lock);
+			return true;
+		}
+
+		spin_unlock(&cache->lock);
+		if (fs_info->treelog_bg == cache->start) {
+			if (!btrfs_zone_activate(cache)) {
+				int ret_fin = btrfs_zone_finish_one_bg(fs_info);
+
+				if (ret_fin != 1 || !btrfs_zone_activate(cache))
+					return false;
+			}
+		} else if ((!is_system && fs_info->active_meta_bg != cache) ||
+			   (is_system && fs_info->active_system_bg != cache)) {
+			struct btrfs_block_group *tgt = is_system ?
+				fs_info->active_system_bg : fs_info->active_meta_bg;
+
+			/*
+			 * zoned_meta_io_lock protects
+			 * fs_info->active_{meta,system}_bg.
+			 */
+			lockdep_assert_held(&fs_info->zoned_meta_io_lock);
+
+			if (tgt) {
+				/*
+				 * If there is an unsent IO left in the
+				 * allocated area, we cannot wait for them
+				 * as it may cause a deadlock.
+				 */
+				if (tgt->meta_write_pointer < tgt->start + tgt->alloc_offset) {
+					if (wbc->sync_mode == WB_SYNC_NONE ||
+					    (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync))
+						return false;
+				}
+
+				/* Pivot active metadata/system block group. */
+				btrfs_zoned_meta_io_unlock(fs_info);
+				wait_eb_writebacks(tgt);
+				do_zone_finish(tgt, true);
+				btrfs_zoned_meta_io_lock(fs_info);
+				if (is_system && fs_info->active_system_bg == tgt) {
+					btrfs_put_block_group(tgt);
+					fs_info->active_system_bg = NULL;
+				} else if (!is_system && fs_info->active_meta_bg == tgt) {
+					btrfs_put_block_group(tgt);
+					fs_info->active_meta_bg = NULL;
+				}
+			}
+			if (!btrfs_zone_activate(cache))
+				return false;
+			if (is_system && fs_info->active_system_bg != cache) {
+				ASSERT(fs_info->active_system_bg == NULL);
+				fs_info->active_system_bg = cache;
+				btrfs_get_block_group(cache);
+			} else if (!is_system && fs_info->active_meta_bg != cache) {
+				ASSERT(fs_info->active_meta_bg == NULL);
+				fs_info->active_meta_bg = cache;
+				btrfs_get_block_group(cache);
+			}
+		}
+	}
+
 	return true;
 }
 
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 42a4df94dc29..6935e04effdd 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -59,6 +59,7 @@  void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct writeback_control *wbc,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);