diff mbox series

[V3] raid1: ensure write behind bio has less than BIO_MAX_VECS sectors

Message ID 20210823074513.3208278-1-guoqing.jiang@linux.dev (mailing list archive)
State New, archived
Headers show
Series [V3] raid1: ensure write behind bio has less than BIO_MAX_VECS sectors | expand

Commit Message

Guoqing Jiang Aug. 23, 2021, 7:45 a.m. UTC
From: Guoqing Jiang <jiangguoqing@kylinos.cn>

We can't split write behind bio with more than BIO_MAX_VECS sectors,
otherwise the below call trace was triggered because we could allocate
oversized write behind bio later.

[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

Thanks for the comment from Christoph.

[1]. https://bugs.archlinux.org/task/70992

Reported-by: Jens Stutte <jens@chianterastutte.eu>
Tested-by: Jens Stutte <jens@chianterastutte.eu>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
---
V3 change:
1. add comment before test WriteMostly.
2. reduce line length.

V2 change:
1. add checking for write-behind case and relevant comments from Christoph.

 drivers/md/raid1.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Christoph Hellwig Aug. 23, 2021, 7:49 a.m. UTC | #1
Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
kernel test robot Aug. 23, 2021, 12:19 p.m. UTC | #2
Hi Guoqing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on song-md/md-next]
[also build test ERROR on v5.14-rc7 next-20210820]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-write-behind-bio-has-less-than-BIO_MAX_VECS-sectors/20210823-154653
base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/302a06a55fac4a9fb10f57dc96f6a618f3e853b4
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-write-behind-bio-has-less-than-BIO_MAX_VECS-sectors/20210823-154653
        git checkout 302a06a55fac4a9fb10f57dc96f6a618f3e853b4
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=sh SHELL=/bin/bash drivers/md/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/md/raid1.c: In function 'raid1_write_request':
>> drivers/md/raid1.c:1393:44: error: 'mirror' undeclared (first use in this function); did you mean 'md_error'?
    1393 |                 if (test_bit(WriteMostly, &mirror->rdev->flags))
         |                                            ^~~~~~
         |                                            md_error
   drivers/md/raid1.c:1393:44: note: each undeclared identifier is reported only once for each function it appears in
   In file included from include/linux/kernel.h:16,
                    from include/linux/list.h:9,
                    from include/linux/preempt.h:11,
                    from include/linux/spinlock.h:51,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:6,
                    from include/linux/slab.h:15,
                    from drivers/md/raid1.c:26:
>> drivers/md/raid1.c:1477:52: error: 'PAGE_SECTORS' undeclared (first use in this function)
    1477 |                                     BIO_MAX_VECS * PAGE_SECTORS);
         |                                                    ^~~~~~~~~~~~
   include/linux/minmax.h:20:46: note: in definition of macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                                              ^
   include/linux/minmax.h:36:31: note: in expansion of macro '__safe_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~
   include/linux/minmax.h:104:33: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y)       __careful_cmp((type)(x), (type)(y), <)
         |                                 ^~~~~~~~~~~~~
   drivers/md/raid1.c:1476:31: note: in expansion of macro 'min_t'
    1476 |                 max_sectors = min_t(int, max_sectors,
         |                               ^~~~~
>> include/linux/minmax.h:36:9: error: first argument to '__builtin_choose_expr' not a constant
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |         ^~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:104:33: note: in expansion of macro '__careful_cmp'
     104 | #define min_t(type, x, y)       __careful_cmp((type)(x), (type)(y), <)
         |                                 ^~~~~~~~~~~~~
   drivers/md/raid1.c:1476:31: note: in expansion of macro 'min_t'
    1476 |                 max_sectors = min_t(int, max_sectors,
         |                               ^~~~~

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for SND_ATMEL_SOC_PDC
   Depends on SOUND && !UML && SND && SND_SOC && SND_ATMEL_SOC && HAS_DMA
   Selected by
   - SND_ATMEL_SOC_SSC && SOUND && !UML && SND && SND_SOC && SND_ATMEL_SOC
   - SND_ATMEL_SOC_SSC_PDC && SOUND && !UML && SND && SND_SOC && SND_ATMEL_SOC && ATMEL_SSC


vim +1393 drivers/md/raid1.c

  1320	
  1321	static void raid1_write_request(struct mddev *mddev, struct bio *bio,
  1322					int max_write_sectors)
  1323	{
  1324		struct r1conf *conf = mddev->private;
  1325		struct r1bio *r1_bio;
  1326		int i, disks;
  1327		struct bitmap *bitmap = mddev->bitmap;
  1328		unsigned long flags;
  1329		struct md_rdev *blocked_rdev;
  1330		struct blk_plug_cb *cb;
  1331		struct raid1_plug_cb *plug = NULL;
  1332		int first_clone;
  1333		int max_sectors;
  1334		bool write_behind = false;
  1335	
  1336		if (mddev_is_clustered(mddev) &&
  1337		     md_cluster_ops->area_resyncing(mddev, WRITE,
  1338			     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
  1339	
  1340			DEFINE_WAIT(w);
  1341			for (;;) {
  1342				prepare_to_wait(&conf->wait_barrier,
  1343						&w, TASK_IDLE);
  1344				if (!md_cluster_ops->area_resyncing(mddev, WRITE,
  1345								bio->bi_iter.bi_sector,
  1346								bio_end_sector(bio)))
  1347					break;
  1348				schedule();
  1349			}
  1350			finish_wait(&conf->wait_barrier, &w);
  1351		}
  1352	
  1353		/*
  1354		 * Register the new request and wait if the reconstruction
  1355		 * thread has put up a bar for new requests.
  1356		 * Continue immediately if no resync is active currently.
  1357		 */
  1358		wait_barrier(conf, bio->bi_iter.bi_sector);
  1359	
  1360		r1_bio = alloc_r1bio(mddev, bio);
  1361		r1_bio->sectors = max_write_sectors;
  1362	
  1363		if (conf->pending_count >= max_queued_requests) {
  1364			md_wakeup_thread(mddev->thread);
  1365			raid1_log(mddev, "wait queued");
  1366			wait_event(conf->wait_barrier,
  1367				   conf->pending_count < max_queued_requests);
  1368		}
  1369		/* first select target devices under rcu_lock and
  1370		 * inc refcount on their rdev.  Record them by setting
  1371		 * bios[x] to bio
  1372		 * If there are known/acknowledged bad blocks on any device on
  1373		 * which we have seen a write error, we want to avoid writing those
  1374		 * blocks.
  1375		 * This potentially requires several writes to write around
  1376		 * the bad blocks.  Each set of writes gets it's own r1bio
  1377		 * with a set of bios attached.
  1378		 */
  1379	
  1380		disks = conf->raid_disks * 2;
  1381	 retry_write:
  1382		blocked_rdev = NULL;
  1383		rcu_read_lock();
  1384		max_sectors = r1_bio->sectors;
  1385		for (i = 0;  i < disks; i++) {
  1386			struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
  1387	
  1388			/*
  1389			 * The write-behind io is only attempted on drives marked as
  1390			 * write-mostly, which means we could allocated write behind
  1391			 * bio later.
  1392			 */
> 1393			if (test_bit(WriteMostly, &mirror->rdev->flags))
  1394				write_behind = true;
  1395	
  1396			if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
  1397				atomic_inc(&rdev->nr_pending);
  1398				blocked_rdev = rdev;
  1399				break;
  1400			}
  1401			r1_bio->bios[i] = NULL;
  1402			if (!rdev || test_bit(Faulty, &rdev->flags)) {
  1403				if (i < conf->raid_disks)
  1404					set_bit(R1BIO_Degraded, &r1_bio->state);
  1405				continue;
  1406			}
  1407	
  1408			atomic_inc(&rdev->nr_pending);
  1409			if (test_bit(WriteErrorSeen, &rdev->flags)) {
  1410				sector_t first_bad;
  1411				int bad_sectors;
  1412				int is_bad;
  1413	
  1414				is_bad = is_badblock(rdev, r1_bio->sector, max_sectors,
  1415						     &first_bad, &bad_sectors);
  1416				if (is_bad < 0) {
  1417					/* mustn't write here until the bad block is
  1418					 * acknowledged*/
  1419					set_bit(BlockedBadBlocks, &rdev->flags);
  1420					blocked_rdev = rdev;
  1421					break;
  1422				}
  1423				if (is_bad && first_bad <= r1_bio->sector) {
  1424					/* Cannot write here at all */
  1425					bad_sectors -= (r1_bio->sector - first_bad);
  1426					if (bad_sectors < max_sectors)
  1427						/* mustn't write more than bad_sectors
  1428						 * to other devices yet
  1429						 */
  1430						max_sectors = bad_sectors;
  1431					rdev_dec_pending(rdev, mddev);
  1432					/* We don't set R1BIO_Degraded as that
  1433					 * only applies if the disk is
  1434					 * missing, so it might be re-added,
  1435					 * and we want to know to recover this
  1436					 * chunk.
  1437					 * In this case the device is here,
  1438					 * and the fact that this chunk is not
  1439					 * in-sync is recorded in the bad
  1440					 * block log
  1441					 */
  1442					continue;
  1443				}
  1444				if (is_bad) {
  1445					int good_sectors = first_bad - r1_bio->sector;
  1446					if (good_sectors < max_sectors)
  1447						max_sectors = good_sectors;
  1448				}
  1449			}
  1450			r1_bio->bios[i] = bio;
  1451		}
  1452		rcu_read_unlock();
  1453	
  1454		if (unlikely(blocked_rdev)) {
  1455			/* Wait for this device to become unblocked */
  1456			int j;
  1457	
  1458			for (j = 0; j < i; j++)
  1459				if (r1_bio->bios[j])
  1460					rdev_dec_pending(conf->mirrors[j].rdev, mddev);
  1461			r1_bio->state = 0;
  1462			allow_barrier(conf, bio->bi_iter.bi_sector);
  1463			raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
  1464			md_wait_for_blocked_rdev(blocked_rdev, mddev);
  1465			wait_barrier(conf, bio->bi_iter.bi_sector);
  1466			goto retry_write;
  1467		}
  1468	
  1469		/*
  1470		 * When using a bitmap, we may call alloc_behind_master_bio below.
  1471		 * alloc_behind_master_bio allocates a copy of the data payload a page
  1472		 * at a time and thus needs a new bio that can fit the whole payload
  1473		 * this bio in page sized chunks.
  1474		 */
  1475		if (write_behind && bitmap)
  1476			max_sectors = min_t(int, max_sectors,
> 1477					    BIO_MAX_VECS * PAGE_SECTORS);
  1478		if (max_sectors < bio_sectors(bio)) {
  1479			struct bio *split = bio_split(bio, max_sectors,
  1480						      GFP_NOIO, &conf->bio_split);
  1481			bio_chain(split, bio);
  1482			submit_bio_noacct(bio);
  1483			bio = split;
  1484			r1_bio->master_bio = bio;
  1485			r1_bio->sectors = max_sectors;
  1486		}
  1487	
  1488		if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue))
  1489			r1_bio->start_time = bio_start_io_acct(bio);
  1490		atomic_set(&r1_bio->remaining, 1);
  1491		atomic_set(&r1_bio->behind_remaining, 0);
  1492	
  1493		first_clone = 1;
  1494	
  1495		for (i = 0; i < disks; i++) {
  1496			struct bio *mbio = NULL;
  1497			struct md_rdev *rdev = conf->mirrors[i].rdev;
  1498			if (!r1_bio->bios[i])
  1499				continue;
  1500	
  1501			if (first_clone) {
  1502				/* do behind I/O ?
  1503				 * Not if there are too many, or cannot
  1504				 * allocate memory, or a reader on WriteMostly
  1505				 * is waiting for behind writes to flush */
  1506				if (bitmap &&
  1507				    (atomic_read(&bitmap->behind_writes)
  1508				     < mddev->bitmap_info.max_write_behind) &&
  1509				    !waitqueue_active(&bitmap->behind_wait)) {
  1510					alloc_behind_master_bio(r1_bio, bio);
  1511				}
  1512	
  1513				md_bitmap_startwrite(bitmap, r1_bio->sector, r1_bio->sectors,
  1514						     test_bit(R1BIO_BehindIO, &r1_bio->state));
  1515				first_clone = 0;
  1516			}
  1517	
  1518			if (r1_bio->behind_master_bio)
  1519				mbio = bio_clone_fast(r1_bio->behind_master_bio,
  1520						      GFP_NOIO, &mddev->bio_set);
  1521			else
  1522				mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);
  1523	
  1524			if (r1_bio->behind_master_bio) {
  1525				if (test_bit(CollisionCheck, &rdev->flags))
  1526					wait_for_serialization(rdev, r1_bio);
  1527				if (test_bit(WriteMostly, &rdev->flags))
  1528					atomic_inc(&r1_bio->behind_remaining);
  1529			} else if (mddev->serialize_policy)
  1530				wait_for_serialization(rdev, r1_bio);
  1531	
  1532			r1_bio->bios[i] = mbio;
  1533	
  1534			mbio->bi_iter.bi_sector	= (r1_bio->sector +
  1535					   conf->mirrors[i].rdev->data_offset);
  1536			bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
  1537			mbio->bi_end_io	= raid1_end_write_request;
  1538			mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
  1539			if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
  1540			    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
  1541			    conf->raid_disks - mddev->degraded > 1)
  1542				mbio->bi_opf |= MD_FAILFAST;
  1543			mbio->bi_private = r1_bio;
  1544	
  1545			atomic_inc(&r1_bio->remaining);
  1546	
  1547			if (mddev->gendisk)
  1548				trace_block_bio_remap(mbio, disk_devt(mddev->gendisk),
  1549						      r1_bio->sector);
  1550			/* flush_pending_writes() needs access to the rdev so...*/
  1551			mbio->bi_bdev = (void *)conf->mirrors[i].rdev;
  1552	
  1553			cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
  1554			if (cb)
  1555				plug = container_of(cb, struct raid1_plug_cb, cb);
  1556			else
  1557				plug = NULL;
  1558			if (plug) {
  1559				bio_list_add(&plug->pending, mbio);
  1560				plug->pending_cnt++;
  1561			} else {
  1562				spin_lock_irqsave(&conf->device_lock, flags);
  1563				bio_list_add(&conf->pending_bio_list, mbio);
  1564				conf->pending_count++;
  1565				spin_unlock_irqrestore(&conf->device_lock, flags);
  1566				md_wakeup_thread(mddev->thread);
  1567			}
  1568		}
  1569	
  1570		r1_bio_write_done(r1_bio);
  1571	
  1572		/* In case raid1d snuck in to freeze_array */
  1573		wake_up(&conf->wait_barrier);
  1574	}
  1575	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Guoqing Jiang Aug. 24, 2021, 12:43 a.m. UTC | #3
Hi,

Thanks for the report!

On 8/23/21 8:19 PM, kernel test robot wrote:
> Hi Guoqing,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on song-md/md-next]
> [also build test ERROR on v5.14-rc7 next-20210820]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/raid1-ensure-write-behind-bio-has-less-than-BIO_MAX_VECS-sectors/20210823-154653
> base:   git://git.kernel.org/pub/scm/linux/kernel/git/song/md.git md-next
> config: sh-allmodconfig (attached as .config)
> compiler: sh4-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # https://github.com/0day-ci/linux/commit/302a06a55fac4a9fb10f57dc96f6a618f3e853b4
>          git remote add linux-review https://github.com/0day-ci/linux
>          git fetch --no-tags linux-review Guoqing-Jiang/raid1-ensure-write-behind-bio-has-less-than-BIO_MAX_VECS-sectors/20210823-154653
>          git checkout 302a06a55fac4a9fb10f57dc96f6a618f3e853b4
>          # save the attached .config to linux build tree
>          mkdir build_dir
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=sh SHELL=/bin/bash drivers/md/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>     drivers/md/raid1.c: In function 'raid1_write_request':
>>> drivers/md/raid1.c:1393:44: error: 'mirror' undeclared (first use in this function); did you mean 'md_error'?
>      1393 |                 if (test_bit(WriteMostly, &mirror->rdev->flags))
>           |                                            ^~~~~~
>           |                                            md_error

Oops, will fix it.

> drivers/md/raid1.c:1477:52: error: 'PAGE_SECTORS' undeclared (first use in this function)
>      1477 |                                     BIO_MAX_VECS * PAGE_SECTORS);
>           |                                                    ^~~~~~~~~~~~

This patch is supposed to be merged by block tree due to dependency.

Thanks,
Guoqing
diff mbox series

Patch

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3c44c4bb40fc..4612c0ba5123 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1329,6 +1329,7 @@  static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	struct raid1_plug_cb *plug = NULL;
 	int first_clone;
 	int max_sectors;
+	bool write_behind = false;
 
 	if (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
@@ -1381,6 +1382,15 @@  static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	max_sectors = r1_bio->sectors;
 	for (i = 0;  i < disks; i++) {
 		struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+
+		/*
+		 * The write-behind io is only attempted on drives marked as
+		 * write-mostly, which means we could allocated write behind
+		 * bio later.
+		 */
+		if (test_bit(WriteMostly, &mirror->rdev->flags))
+			write_behind = true;
+
 		if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
 			atomic_inc(&rdev->nr_pending);
 			blocked_rdev = rdev;
@@ -1454,6 +1464,15 @@  static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
+	/*
+	 * When using a bitmap, we may call alloc_behind_master_bio below.
+	 * alloc_behind_master_bio allocates a copy of the data payload a page
+	 * at a time and thus needs a new bio that can fit the whole payload
+	 * this bio in page sized chunks.
+	 */
+	if (write_behind && bitmap)
+		max_sectors = min_t(int, max_sectors,
+				    BIO_MAX_VECS * PAGE_SECTORS);
 	if (max_sectors < bio_sectors(bio)) {
 		struct bio *split = bio_split(bio, max_sectors,
 					      GFP_NOIO, &conf->bio_split);