diff mbox

[5/6] f2fs: add a kernel thread to issue discard commands asynchronously

Message ID 20170112224407.54026-5-jaegeuk@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Jaegeuk Kim Jan. 12, 2017, 10:44 p.m. UTC
This patch adds a kernel thread to issue discard commands.
It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current
bio status.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/f2fs/f2fs.h    |  11 +++++
 fs/f2fs/segment.c | 128 +++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 108 insertions(+), 31 deletions(-)

Comments

Christoph Hellwig Jan. 13, 2017, 8:01 a.m. UTC | #1
On Thu, Jan 12, 2017 at 02:44:06PM -0800, Jaegeuk Kim wrote:
> This patch adds a kernel thread to issue discard commands.
> It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current
> bio status.

Why?  Instead of creating the complexity of a thread you should look
into issuing the discard bios asynchronously, which should help to
simplify a lot of the discard logic in f2fs.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jaegeuk Kim Jan. 13, 2017, 7:12 p.m. UTC | #2
On 01/13, Christoph Hellwig wrote:
> On Thu, Jan 12, 2017 at 02:44:06PM -0800, Jaegeuk Kim wrote:
> > This patch adds a kernel thread to issue discard commands.
> > It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current
> > bio status.
> 
> Why?  Instead of creating the complexity of a thread you should look
> into issuing the discard bios asynchronously, which should help to
> simplify a lot of the discard logic in f2fs.

Previously, I've done to issue discard bios asynchronously. But the problem that
I've got is that was not enough. When testing nvme SSD with noop IO scheduler,
submit_bio() was blocked at every 8 async discard bios, resulting in very slow
checkpoint process which blocks most of other FS operations.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Jan. 16, 2017, 5:32 p.m. UTC | #3
On Fri, Jan 13, 2017 at 11:12:11AM -0800, Jaegeuk Kim wrote:
> Previously, I've done to issue discard bios asynchronously. But the problem that
> I've got is that was not enough. When testing nvme SSD with noop IO scheduler,
> submit_bio() was blocked at every 8 async discard bios, resulting in very slow
> checkpoint process which blocks most of other FS operations.

Where does it block?  Are you running out of request?  What driver is
this on top of?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Feb. 5, 2017, 8:59 a.m. UTC | #4
On Mon, Jan 16, 2017 at 09:32:20AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 13, 2017 at 11:12:11AM -0800, Jaegeuk Kim wrote:
> > Previously, I've done to issue discard bios asynchronously. But the problem that
> > I've got is that was not enough. When testing nvme SSD with noop IO scheduler,
> > submit_bio() was blocked at every 8 async discard bios, resulting in very slow
> > checkpoint process which blocks most of other FS operations.
> 
> Where does it block?  Are you running out of request?  What driver is
> this on top of?

Ping?  I'm currently spending a lot of effort on fs and block dіscard
code, and I'd like to make sure we get common infrastructure instead
of local hacks.
Jaegeuk Kim Feb. 7, 2017, 3:44 a.m. UTC | #5
On 02/05, Christoph Hellwig wrote:
> On Mon, Jan 16, 2017 at 09:32:20AM -0800, Christoph Hellwig wrote:
> > On Fri, Jan 13, 2017 at 11:12:11AM -0800, Jaegeuk Kim wrote:
> > > Previously, I've done to issue discard bios asynchronously. But the problem that
> > > I've got is that was not enough. When testing nvme SSD with noop IO scheduler,
> > > submit_bio() was blocked at every 8 async discard bios, resulting in very slow
> > > checkpoint process which blocks most of other FS operations.
> > 
> > Where does it block?  Are you running out of request?  What driver is
> > this on top of?
> 
> Ping?  I'm currently spending a lot of effort on fs and block dіscard
> code, and I'd like to make sure we get common infrastructure instead
> of local hacks.

Sorry for the late response due to the travel.

When doing fstrim with a fresh f2fs image fomatted on Intel NVMe SSD whose
model name is SSDPE2MW012T4, I've got the following trace.

...
fstrim-12620 [000] .... 334572.907534: f2fs_issue_discard: dev = (259,1), blkstart = 0x902900, blklen = 0x400
fstrim-12620 [000] .... 334572.907535: block_bio_remap: 259,0 D 75583488 + 8192 <- (259,1) 75581440
fstrim-12620 [000] .... 334572.907535: block_bio_queue: 259,0 D 75583488 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.907535: block_getrq: 259,0 D 75583488 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.907536: block_unplug: [fstrim] 1
fstrim-12620 [000] .... 334572.907536: block_rq_insert: 259,0 D 0 () 75583488 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.907536: block_rq_issue: 259,0 D 0 () 75583488 + 8192 [fstrim]
 < repeat 6 times >
fstrim-12620 [000] .... 334572.907620: f2fs_issue_discard: dev = (259,1), blkstart = 0x904500, blklen = 0x400
fstrim-12620 [000] .... 334572.907620: block_bio_remap: 259,0 D 75640832 + 8192 <- (259,1) 75638784
fstrim-12620 [000] .... 334572.907620: block_bio_queue: 259,0 D 75640832 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.907621: block_getrq: 259,0 D 75640832 + 8192 [fstrim]
<idle>-0     [000] d.h. 334572.907723: block_rq_complete: 259,0 D () 67260416 + 8192 [0]
<idle>-0     [000] d.h. 334572.907942: block_rq_complete: 259,0 D () 67268608 + 8192 [0]
<idle>-0     [000] d.h. 334572.908155: block_rq_complete: 259,0 D () 67276800 + 8192 [0]
<idle>-0     [000] d.h. 334572.908374: block_rq_complete: 259,0 D () 67284992 + 8192 [0]
<idle>-0     [000] d.h. 334572.908597: block_rq_complete: 259,0 D () 67293184 + 8192 [0]
<idle>-0     [000] d.h. 334572.908823: block_rq_complete: 259,0 D () 67301376 + 8192 [0]
<idle>-0     [000] d.h. 334572.909033: block_rq_complete: 259,0 D () 67309568 + 8192 [0]
<idle>-0     [000] d.h. 334572.909216: block_rq_complete: 259,0 D () 67317760 + 8192 [0]
fstrim-12620 [000] .... 334572.909222: block_unplug: [fstrim] 1
fstrim-12620 [000] .... 334572.909223: block_rq_insert: 259,0 D 0 () 75640832 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.909224: block_rq_issue: 259,0 D 0 () 75640832 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.909240: f2fs_issue_discard: dev = (259,1), blkstart = 0x904900, blklen = 0x400
fstrim-12620 [000] .... 334572.909241: block_bio_remap: 259,0 D 75649024 + 8192 <- (259,1) 75646976
fstrim-12620 [000] .... 334572.909241: block_bio_queue: 259,0 D 75649024 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.909241: block_getrq: 259,0 D 75649024 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.909242: block_unplug: [fstrim] 1
fstrim-12620 [000] .... 334572.909242: block_rq_insert: 259,0 D 0 () 75649024 + 8192 [fstrim]
fstrim-12620 [000] .... 334572.909242: block_rq_issue: 259,0 D 0 () 75649024 + 8192 [fstrim]
 < repeat >

So, I investigated why block_rq_complete() happened in more detail.

The root-caused call path looks like:
 - submit_bio
  - generic_make_request
   - q->make_request_fn
    - blk_mq_make_request
     - blk_mq_map_request
      - blk_mq_alloc_request
       - blk_mq_get_tag
        - __blk_mq_get_tag
         - bt_get
          - blk_mq_run_hw_queue
          - finish_wait
          --> this waits for pending 8 discard bios!

It seems the problem comes from the storage processing discard commands too
slowly comparing to normal read/write IOs.

Any thoughts?

Thanks,
Christoph Hellwig Feb. 8, 2017, 4:02 p.m. UTC | #6
On Mon, Feb 06, 2017 at 07:44:03PM -0800, Jaegeuk Kim wrote:
> Sorry for the late response due to the travel.
> 
> When doing fstrim with a fresh f2fs image fomatted on Intel NVMe SSD whose
> model name is SSDPE2MW012T4, I've got the following trace.

<snip>

> So, I investigated why block_rq_complete() happened in more detail.
> 
> The root-caused call path looks like:
>  - submit_bio
>   - generic_make_request
>    - q->make_request_fn
>     - blk_mq_make_request
>      - blk_mq_map_request
>       - blk_mq_alloc_request
>        - blk_mq_get_tag
>         - __blk_mq_get_tag
>          - bt_get
>           - blk_mq_run_hw_queue
>           - finish_wait
>           --> this waits for pending 8 discard bios!

You're blocking on tag allocation.  How many tags per queue does
your device have?, e.g. do a

cat /sys/block/nvme0n1/mq/0/nr_tags

> It seems the problem comes from the storage processing discard commands too
> slowly comparing to normal read/write IOs.
> 
> Any thoughts?

Deallocate is always going to be an exception path compared to normal
read/write… but just how much slower is going to be device
dependent.

One option would be to reuse the number of discards, for that can you
try the series here to support vectored discards:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/vectored-discard-for-axboe
Jaegeuk Kim Feb. 8, 2017, 10:05 p.m. UTC | #7
On 02/08, Christoph Hellwig wrote:
> On Mon, Feb 06, 2017 at 07:44:03PM -0800, Jaegeuk Kim wrote:
> > Sorry for the late response due to the travel.
> > 
> > When doing fstrim with a fresh f2fs image fomatted on Intel NVMe SSD whose
> > model name is SSDPE2MW012T4, I've got the following trace.
> 
> <snip>
> 
> > So, I investigated why block_rq_complete() happened in more detail.
> > 
> > The root-caused call path looks like:
> >  - submit_bio
> >   - generic_make_request
> >    - q->make_request_fn
> >     - blk_mq_make_request
> >      - blk_mq_map_request
> >       - blk_mq_alloc_request
> >        - blk_mq_get_tag
> >         - __blk_mq_get_tag
> >          - bt_get
> >           - blk_mq_run_hw_queue
> >           - finish_wait
> >           --> this waits for pending 8 discard bios!
> 
> You're blocking on tag allocation.  How many tags per queue does
> your device have?, e.g. do a
> 
> cat /sys/block/nvme0n1/mq/0/nr_tags

It shows 1023.

> > It seems the problem comes from the storage processing discard commands too
> > slowly comparing to normal read/write IOs.
> > 
> > Any thoughts?
> 
> Deallocate is always going to be an exception path compared to normal
> read/write… but just how much slower is going to be device
> dependent.
> 
> One option would be to reuse the number of discards, for that can you
> try the series here to support vectored discards:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/vectored-discard-for-axboe

I tried this, but couldn't see any difference.

Thanks,
Chao Yu Feb. 22, 2017, 9:40 a.m. UTC | #8
On 2017/1/13 6:44, Jaegeuk Kim wrote:
> This patch adds a kernel thread to issue discard commands.
> It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current
> bio status.
> 
> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

Reviewed-by: Chao Yu <yuchao0@huawei.com>
diff mbox

Patch

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index e7b403cbd70f..ef5e3709c161 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -129,6 +129,7 @@  enum {
 };
 
 #define MAX_DISCARD_BLOCKS(sbi) (1 << (sbi)->log_blocks_per_seg)
+#define DISCARD_ISSUE_RATE	8
 #define DEF_CP_INTERVAL			60	/* 60 secs */
 #define DEF_IDLE_INTERVAL		5	/* 5 secs */
 
@@ -177,18 +178,28 @@  struct discard_entry {
 	int len;		/* # of consecutive blocks of the discard */
 };
 
+enum {
+	D_PREP,
+	D_SUBMIT,
+	D_DONE,
+};
+
 struct discard_cmd {
 	struct list_head list;		/* command list */
 	struct completion wait;		/* compleation */
 	block_t lstart;			/* logical start address */
 	block_t len;			/* length */
 	struct bio *bio;		/* bio */
+	int state;			/* state */
 };
 
 struct discard_cmd_control {
+	struct task_struct *f2fs_issue_discard;	/* discard thread */
 	struct list_head discard_entry_list;	/* 4KB discard entry list */
 	int nr_discards;			/* # of discards in the list */
 	struct list_head discard_cmd_list;	/* discard cmd list */
+	wait_queue_head_t discard_wait_queue;	/* waiting queue for wake-up */
+	struct mutex cmd_lock;
 	int max_discards;			/* max. discards to be issued */
 };
 
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index e3bec31de961..74364006bfa6 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -628,7 +628,7 @@  static void locate_dirty_segment(struct f2fs_sb_info *sbi, unsigned int segno)
 	mutex_unlock(&dirty_i->seglist_lock);
 }
 
-static struct discard_cmd *__add_discard_cmd(struct f2fs_sb_info *sbi,
+static void __add_discard_cmd(struct f2fs_sb_info *sbi,
 			struct bio *bio, block_t lstart, block_t len)
 {
 	struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
@@ -638,12 +638,30 @@  static struct discard_cmd *__add_discard_cmd(struct f2fs_sb_info *sbi,
 	dc = f2fs_kmem_cache_alloc(discard_cmd_slab, GFP_NOFS);
 	INIT_LIST_HEAD(&dc->list);
 	dc->bio = bio;
+	bio->bi_private = dc;
 	dc->lstart = lstart;
 	dc->len = len;
+	dc->state = D_PREP;
 	init_completion(&dc->wait);
+
+	mutex_lock(&dcc->cmd_lock);
 	list_add_tail(&dc->list, cmd_list);
+	mutex_unlock(&dcc->cmd_lock);
+}
+
+static void __remove_discard_cmd(struct f2fs_sb_info *sbi, struct discard_cmd *dc)
+{
+	int err = dc->bio->bi_error;
 
-	return dc;
+	if (err == -EOPNOTSUPP)
+		err = 0;
+
+	if (err)
+		f2fs_msg(sbi->sb, KERN_INFO,
+				"Issue discard failed, ret: %d", err);
+	bio_put(dc->bio);
+	list_del(&dc->list);
+	kmem_cache_free(discard_cmd_slab, dc);
 }
 
 /* This should be covered by global mutex, &sit_i->sentry_lock */
@@ -653,31 +671,28 @@  void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, block_t blkaddr)
 	struct list_head *wait_list = &(dcc->discard_cmd_list);
 	struct discard_cmd *dc, *tmp;
 
+	mutex_lock(&dcc->cmd_lock);
 	list_for_each_entry_safe(dc, tmp, wait_list, list) {
-		struct bio *bio = dc->bio;
-		int err;
 
-		if (!completion_done(&dc->wait)) {
-			if ((dc->lstart <= blkaddr &&
-					blkaddr < dc->lstart + dc->len) ||
-					blkaddr == NULL_ADDR)
+		if (blkaddr == NULL_ADDR) {
+			if (dc->state == D_PREP) {
+				dc->state = D_SUBMIT;
+				submit_bio(dc->bio);
+			}
+			wait_for_completion_io(&dc->wait);
+
+			__remove_discard_cmd(sbi, dc);
+			continue;
+		}
+
+		if (dc->lstart <= blkaddr && blkaddr < dc->lstart + dc->len) {
+			if (dc->state == D_SUBMIT)
 				wait_for_completion_io(&dc->wait);
 			else
-				continue;
+				__remove_discard_cmd(sbi, dc);
 		}
-
-		err = bio->bi_error;
-		if (err == -EOPNOTSUPP)
-			err = 0;
-
-		if (err)
-			f2fs_msg(sbi->sb, KERN_INFO,
-				"Issue discard failed, ret: %d", err);
-
-		bio_put(bio);
-		list_del(&dc->list);
-		kmem_cache_free(discard_cmd_slab, dc);
 	}
+	mutex_unlock(&dcc->cmd_lock);
 }
 
 static void f2fs_submit_discard_endio(struct bio *bio)
@@ -685,8 +700,48 @@  static void f2fs_submit_discard_endio(struct bio *bio)
 	struct discard_cmd *dc = (struct discard_cmd *)bio->bi_private;
 
 	complete(&dc->wait);
+	dc->state = D_DONE;
 }
 
+static int issue_discard_thread(void *data)
+{
+	struct f2fs_sb_info *sbi = data;
+	struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
+	wait_queue_head_t *q = &dcc->discard_wait_queue;
+	struct list_head *cmd_list = &dcc->discard_cmd_list;
+	struct discard_cmd *dc, *tmp;
+	struct blk_plug plug;
+	int iter = 0;
+repeat:
+	if (kthread_should_stop())
+		return 0;
+
+	blk_start_plug(&plug);
+
+	mutex_lock(&dcc->cmd_lock);
+	list_for_each_entry_safe(dc, tmp, cmd_list, list) {
+		if (dc->state == D_PREP) {
+			dc->state = D_SUBMIT;
+			submit_bio(dc->bio);
+			if (iter++ > DISCARD_ISSUE_RATE)
+				break;
+		} else if (dc->state == D_DONE) {
+			__remove_discard_cmd(sbi, dc);
+		}
+	}
+	mutex_unlock(&dcc->cmd_lock);
+
+	blk_finish_plug(&plug);
+
+	iter = 0;
+	congestion_wait(BLK_RW_SYNC, HZ/50);
+
+	wait_event_interruptible(*q,
+		kthread_should_stop() || !list_empty(&dcc->discard_cmd_list));
+	goto repeat;
+}
+
+
 /* this function is copied from blkdev_issue_discard from block/blk-lib.c */
 static int __f2fs_issue_discard_async(struct f2fs_sb_info *sbi,
 		struct block_device *bdev, block_t blkstart, block_t blklen)
@@ -707,13 +762,11 @@  static int __f2fs_issue_discard_async(struct f2fs_sb_info *sbi,
 				SECTOR_FROM_BLOCK(blklen),
 				GFP_NOFS, 0, &bio);
 	if (!err && bio) {
-		struct discard_cmd *dc = __add_discard_cmd(sbi, bio,
-						lblkstart, blklen);
-
-		bio->bi_private = dc;
 		bio->bi_end_io = f2fs_submit_discard_endio;
 		bio->bi_opf |= REQ_SYNC;
-		submit_bio(bio);
+
+		__add_discard_cmd(sbi, bio, lblkstart, blklen);
+		wake_up(&SM_I(sbi)->dcc_info->discard_wait_queue);
 	}
 	return err;
 }
@@ -920,14 +973,11 @@  void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 	struct list_head *head = &(SM_I(sbi)->dcc_info->discard_entry_list);
 	struct discard_entry *entry, *this;
 	struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
-	struct blk_plug plug;
 	unsigned long *prefree_map = dirty_i->dirty_segmap[PRE];
 	unsigned int start = 0, end = -1;
 	unsigned int secno, start_segno;
 	bool force = (cpc->reason == CP_DISCARD);
 
-	blk_start_plug(&plug);
-
 	mutex_lock(&dirty_i->seglist_lock);
 
 	while (1) {
@@ -980,12 +1030,11 @@  void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 		SM_I(sbi)->dcc_info->nr_discards -= entry->len;
 		kmem_cache_free(discard_entry_slab, entry);
 	}
-
-	blk_finish_plug(&plug);
 }
 
 int create_discard_cmd_control(struct f2fs_sb_info *sbi)
 {
+	dev_t dev = sbi->sb->s_bdev->bd_dev;
 	struct discard_cmd_control *dcc;
 	int err = 0;
 
@@ -1000,11 +1049,22 @@  int create_discard_cmd_control(struct f2fs_sb_info *sbi)
 
 	INIT_LIST_HEAD(&dcc->discard_entry_list);
 	INIT_LIST_HEAD(&dcc->discard_cmd_list);
+	mutex_init(&dcc->cmd_lock);
 	dcc->nr_discards = 0;
 	dcc->max_discards = 0;
 
+	init_waitqueue_head(&dcc->discard_wait_queue);
 	SM_I(sbi)->dcc_info = dcc;
 init_thread:
+	dcc->f2fs_issue_discard = kthread_run(issue_discard_thread, sbi,
+				"f2fs_discard-%u:%u", MAJOR(dev), MINOR(dev));
+	if (IS_ERR(dcc->f2fs_issue_discard)) {
+		err = PTR_ERR(dcc->f2fs_issue_discard);
+		kfree(dcc);
+		SM_I(sbi)->dcc_info = NULL;
+		return err;
+	}
+
 	return err;
 }
 
@@ -1012,6 +1072,12 @@  void destroy_discard_cmd_control(struct f2fs_sb_info *sbi, bool free)
 {
 	struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
 
+	if (dcc && dcc->f2fs_issue_discard) {
+		struct task_struct *discard_thread = dcc->f2fs_issue_discard;
+
+		dcc->f2fs_issue_discard = NULL;
+		kthread_stop(discard_thread);
+	}
 	if (free) {
 		kfree(dcc);
 		SM_I(sbi)->dcc_info = NULL;