dm-crypt: limit the number of allocated pages

Message ID	alpine.LRH.2.02.1708132240130.18210@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State	Accepted, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 03C672A4ADD Date: Sun, 13 Aug 2017 22:45:08 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: dm-devel@redhat.com Message-ID: <alpine.LRH.2.02.1708132240130.18210@file01.intranet.prod.int.rdu2.redhat.com> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Cc: Ondrej Kozina <okozina@redhat.com>, Mike Snitzer <msnitzer@redhat.com>, Tom Yan <tom.ty89@gmail.com>, hch@lst.de, Milan Broz <mbroz@redhat.com> Subject: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages Precedence: junk Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Mikulas Patocka Aug. 14, 2017, 2:45 a.m. UTC

dm-crypt consumes excessive amount memory when the user attempts to zero
a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
__blkdev_issue_zeroout sends large amount of write bios that contain the
zero page as their payload.

For each incoming page, dm-crypt allocates another page that holds the
encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
allocate the amount of memory that is equal to the size of the device.
This can trigger OOM killer or cause system crash.

This patch fixes the bug by limiting the amount of memory that dm-crypt
allocates to 2% of total system memory.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 drivers/md/dm-crypt.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

John Stoffel Aug. 14, 2017, 8:19 p.m. UTC | #1

>>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:

Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the
Mikulas> zero page as their payload.

Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.

Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.

Is this limit per-device or system wide?  it's not clear to me if the
minimum is just one page, or more so it might be nice to clarify
that.

Also, for large memory systems, that's still alot of pages.  Maybe it
should be more exponential in it's clamping as memory sizes go up?  2%
of 2T is 4G, which is pretty darn big...


Mikulas> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Mikulas> Cc: stable@vger.kernel.org

Mikulas> ---
Mikulas>  drivers/md/dm-crypt.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
Mikulas>  1 file changed, 59 insertions(+), 1 deletion(-)

Mikulas> Index: linux-2.6/drivers/md/dm-crypt.c
Mikulas> ===================================================================
Mikulas> --- linux-2.6.orig/drivers/md/dm-crypt.c
Mikulas> +++ linux-2.6/drivers/md/dm-crypt.c
Mikulas> @@ -148,6 +148,8 @@ struct crypt_config {
Mikulas>  	mempool_t *tag_pool;
Mikulas>  	unsigned tag_pool_max_sectors;
 
Mikulas> +	struct percpu_counter n_allocated_pages;
Mikulas> +
Mikulas>  	struct bio_set *bs;
Mikulas>  	struct mutex bio_alloc_lock;
 
Mikulas> @@ -219,6 +221,12 @@ struct crypt_config {
Mikulas>  #define MAX_TAG_SIZE	480
Mikulas>  #define POOL_ENTRY_SIZE	512
 
Mikulas> +static DEFINE_SPINLOCK(dm_crypt_clients_lock);
Mikulas> +static unsigned dm_crypt_clients_n = 0;
Mikulas> +static volatile unsigned long dm_crypt_pages_per_client;
Mikulas> +#define DM_CRYPT_MEMORY_PERCENT			2
Mikulas> +#define DM_CRYPT_MIN_PAGES_PER_CLIENT		(BIO_MAX_PAGES * 16)
Mikulas> +
Mikulas>  static void clone_init(struct dm_crypt_io *, struct bio *);
Mikulas>  static void kcryptd_queue_crypt(struct dm_crypt_io *io);
Mikulas>  static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
Mikulas> @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c
Mikulas>  	return r;
Mikulas>  }
 
Mikulas> +static void crypt_calculate_pages_per_client(void)
Mikulas> +{
Mikulas> +	unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100;
Mikulas> +	if (!dm_crypt_clients_n)
Mikulas> +		return;
Mikulas> +	pages /= dm_crypt_clients_n;

Would it make sense to use a shift here, or does this only run once on
setup?  And shouldn't you return a value here, if only to make it
clear what you're defaulting to?  

Mikulas> +	if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT)
Mikulas> +		pages = DM_CRYPT_MIN_PAGES_PER_CLIENT;
Mikulas> +	dm_crypt_pages_per_client = pages;
Mikulas> +}
Mikulas> +
Mikulas> +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
Mikulas> +{
Mikulas> +	struct crypt_config *cc = pool_data;
Mikulas> +	struct page *page;
Mikulas> +	if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
Mikulas> +	    likely(gfp_mask & __GFP_NORETRY))
Mikulas> +		return NULL;
Mikulas> +	page = alloc_page(gfp_mask);
Mikulas> +	if (likely(page != NULL))
Mikulas> +		percpu_counter_add(&cc->n_allocated_pages, 1);
Mikulas> +	return page;
Mikulas> +}
Mikulas> +
Mikulas> +static void crypt_page_free(void *page, void *pool_data)
Mikulas> +{
Mikulas> +	struct crypt_config *cc = pool_data;
Mikulas> +	__free_page(page);
Mikulas> +	percpu_counter_sub(&cc->n_allocated_pages, 1);
Mikulas> +}
Mikulas> +
Mikulas>  static void crypt_dtr(struct dm_target *ti)
Mikulas>  {
Mikulas>  	struct crypt_config *cc = ti->private;
Mikulas> @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target *
Mikulas>  	mempool_destroy(cc->req_pool);
Mikulas>  	mempool_destroy(cc->tag_pool);
 
Mikulas> +	if (cc->page_pool)
Mikulas> +		WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
Mikulas> +	percpu_counter_destroy(&cc->n_allocated_pages);
Mikulas> +
Mikulas>  	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
cc-> iv_gen_ops->dtr(cc);
 
Mikulas> @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target *
 
Mikulas>  	/* Must zero key material before freeing */
Mikulas>  	kzfree(cc);
Mikulas> +
Mikulas> +	spin_lock(&dm_crypt_clients_lock);
Mikulas> +	WARN_ON(!dm_crypt_clients_n);
Mikulas> +	dm_crypt_clients_n--;
Mikulas> +	crypt_calculate_pages_per_client();
Mikulas> +	spin_unlock(&dm_crypt_clients_lock);
Mikulas>  }
 
Mikulas>  static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
Mikulas> @@ -2636,6 +2685,15 @@ static int crypt_ctr(struct dm_target *t
 
ti-> private = cc;
 
Mikulas> +	spin_lock(&dm_crypt_clients_lock);
Mikulas> +	dm_crypt_clients_n++;
Mikulas> +	crypt_calculate_pages_per_client();
Mikulas> +	spin_unlock(&dm_crypt_clients_lock);
Mikulas> +
Mikulas> +	ret = percpu_counter_init(&cc->n_allocated_pages, 0, GFP_KERNEL);
Mikulas> +	if (ret < 0)
Mikulas> +		goto bad;
Mikulas> +
Mikulas>  	/* Optional parameters need to be read before cipher constructor */
Mikulas>  	if (argc > 5) {
Mikulas>  		ret = crypt_ctr_optional(ti, argc - 5, &argv[5]);
Mikulas> @@ -2690,7 +2748,7 @@ static int crypt_ctr(struct dm_target *t
Mikulas>  		ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
Mikulas>  		      ARCH_KMALLOC_MINALIGN);
 
Mikulas> -	cc->page_pool = mempool_create_page_pool(BIO_MAX_PAGES, 0);
Mikulas> +	cc->page_pool = mempool_create(BIO_MAX_PAGES, crypt_page_alloc, crypt_page_free, cc);
Mikulas>  	if (!cc->page_pool) {
ti-> error = "Cannot allocate page mempool";
Mikulas>  		goto bad;

Mikulas> --
Mikulas> dm-devel mailing list
Mikulas> dm-devel@redhat.com
Mikulas> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tom Yan Aug. 14, 2017, 8:22 p.m. UTC | #2

Just tested the patch with kernel 4.12.6. Well it sort of worked. No
more OOM or kernel panic. Memory takeup is around ~250M on a machine
with 8G RAM. However I keep getting this:

Aug 15 04:04:10 archlinux kernel: INFO: task blkdiscard:538 blocked
for more than 120 seconds.
Aug 15 04:04:10 archlinux kernel:       Tainted: P         C O
4.12.6-1-ARCH #1
Aug 15 04:04:10 archlinux kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 04:04:10 archlinux kernel: blkdiscard      D    0   538    537 0x00000000
Aug 15 04:04:10 archlinux kernel: Call Trace:
Aug 15 04:04:10 archlinux kernel:  __schedule+0x236/0x870
Aug 15 04:04:10 archlinux kernel:  schedule+0x3d/0x90
Aug 15 04:04:10 archlinux kernel:  schedule_timeout+0x21f/0x330
Aug 15 04:04:10 archlinux kernel:  io_schedule_timeout+0x1e/0x50
Aug 15 04:04:10 archlinux kernel:  ? io_schedule_timeout+0x1e/0x50
Aug 15 04:04:10 archlinux kernel:  wait_for_completion_io+0xa5/0x120
Aug 15 04:04:10 archlinux kernel:  ? wake_up_q+0x80/0x80
Aug 15 04:04:10 archlinux kernel:  submit_bio_wait+0x68/0x90
Aug 15 04:04:10 archlinux kernel:  blkdev_issue_zeroout+0x80/0xc0
Aug 15 04:04:10 archlinux kernel:  blkdev_ioctl+0x707/0x940
Aug 15 04:04:10 archlinux kernel:  ? blkdev_ioctl+0x707/0x940
Aug 15 04:04:10 archlinux kernel:  block_ioctl+0x3d/0x50
Aug 15 04:04:10 archlinux kernel:  do_vfs_ioctl+0xa5/0x600
Aug 15 04:04:10 archlinux kernel:  ? SYSC_newfstat+0x44/0x70
Aug 15 04:04:10 archlinux kernel:  ? getrawmonotonic64+0x36/0xc0
Aug 15 04:04:10 archlinux kernel:  SyS_ioctl+0x79/0x90
Aug 15 04:04:10 archlinux kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa5
Aug 15 04:04:10 archlinux kernel: RIP: 0033:0x7f2b463378b7
Aug 15 04:04:10 archlinux kernel: RSP: 002b:00007fffb2dad8b8 EFLAGS:
00000246 ORIG_RAX: 0000000000000010
Aug 15 04:04:10 archlinux kernel: RAX: ffffffffffffffda RBX:
000000568a3922c8 RCX: 00007f2b463378b7
Aug 15 04:04:10 archlinux kernel: RDX: 00007fffb2dad910 RSI:
000000000000127f RDI: 0000000000000003
Aug 15 04:04:10 archlinux kernel: RBP: 0000000000000000 R08:
0000000000000200 R09: 0000000000000000
Aug 15 04:04:10 archlinux kernel: R10: 00007fffb2dad870 R11:
0000000000000246 R12: 0000000000000000
Aug 15 04:04:10 archlinux kernel: R13: 0000000000000003 R14:
00007fffb2dadae8 R15: 0000000000000000

which I do not get if I do `blkdiscard -z` on the underlying device
(which does not support SCSI WRITE SAME) instead of the dm-crypt
container.

In the first trial I got some more lower-level errors. blkdiscard
could not exit after the job was seemingly finished (no more write
according to iostat). The container could not be closed either so I
had to just disconnect the drive. I could not reproduce it in second
trial though, so I am not sure if it was just some coincidental
hardware hiccup. My systemd journal happened to be broken as well so
the log of that trial was lost.

I also have doubt in the approach. Can't we split the bio chain as per
how it was chained and allocate memory bio per bio, and if it's not
enough, also limit the memory allocation with a maximum number
(arbitrary or not) of bios?

On 14 August 2017 at 10:45, Mikulas Patocka <mpatocka@redhat.com> wrote:
> dm-crypt consumes excessive amount memory when the user attempts to zero
> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
> __blkdev_issue_zeroout sends large amount of write bios that contain the
> zero page as their payload.
>
> For each incoming page, dm-crypt allocates another page that holds the
> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
> allocate the amount of memory that is equal to the size of the device.
> This can trigger OOM killer or cause system crash.
>
> This patch fixes the bug by limiting the amount of memory that dm-crypt
> allocates to 2% of total system memory.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
>  drivers/md/dm-crypt.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 59 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/md/dm-crypt.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-crypt.c
> +++ linux-2.6/drivers/md/dm-crypt.c
> @@ -148,6 +148,8 @@ struct crypt_config {
>         mempool_t *tag_pool;
>         unsigned tag_pool_max_sectors;
>
> +       struct percpu_counter n_allocated_pages;
> +
>         struct bio_set *bs;
>         struct mutex bio_alloc_lock;
>
> @@ -219,6 +221,12 @@ struct crypt_config {
>  #define MAX_TAG_SIZE   480
>  #define POOL_ENTRY_SIZE        512
>
> +static DEFINE_SPINLOCK(dm_crypt_clients_lock);
> +static unsigned dm_crypt_clients_n = 0;
> +static volatile unsigned long dm_crypt_pages_per_client;
> +#define DM_CRYPT_MEMORY_PERCENT                        2
> +#define DM_CRYPT_MIN_PAGES_PER_CLIENT          (BIO_MAX_PAGES * 16)
> +
>  static void clone_init(struct dm_crypt_io *, struct bio *);
>  static void kcryptd_queue_crypt(struct dm_crypt_io *io);
>  static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
> @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c
>         return r;
>  }
>
> +static void crypt_calculate_pages_per_client(void)
> +{
> +       unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100;
> +       if (!dm_crypt_clients_n)
> +               return;
> +       pages /= dm_crypt_clients_n;
> +       if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT)
> +               pages = DM_CRYPT_MIN_PAGES_PER_CLIENT;
> +       dm_crypt_pages_per_client = pages;
> +}
> +
> +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
> +{
> +       struct crypt_config *cc = pool_data;
> +       struct page *page;
> +       if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> +           likely(gfp_mask & __GFP_NORETRY))
> +               return NULL;
> +       page = alloc_page(gfp_mask);
> +       if (likely(page != NULL))
> +               percpu_counter_add(&cc->n_allocated_pages, 1);
> +       return page;
> +}
> +
> +static void crypt_page_free(void *page, void *pool_data)
> +{
> +       struct crypt_config *cc = pool_data;
> +       __free_page(page);
> +       percpu_counter_sub(&cc->n_allocated_pages, 1);
> +}
> +
>  static void crypt_dtr(struct dm_target *ti)
>  {
>         struct crypt_config *cc = ti->private;
> @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target *
>         mempool_destroy(cc->req_pool);
>         mempool_destroy(cc->tag_pool);
>
> +       if (cc->page_pool)
> +               WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
> +       percpu_counter_destroy(&cc->n_allocated_pages);
> +
>         if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
>                 cc->iv_gen_ops->dtr(cc);
>
> @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target *
>
>         /* Must zero key material before freeing */
>         kzfree(cc);
> +
> +       spin_lock(&dm_crypt_clients_lock);
> +       WARN_ON(!dm_crypt_clients_n);
> +       dm_crypt_clients_n--;
> +       crypt_calculate_pages_per_client();
> +       spin_unlock(&dm_crypt_clients_lock);
>  }
>
>  static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
> @@ -2636,6 +2685,15 @@ static int crypt_ctr(struct dm_target *t
>
>         ti->private = cc;
>
> +       spin_lock(&dm_crypt_clients_lock);
> +       dm_crypt_clients_n++;
> +       crypt_calculate_pages_per_client();
> +       spin_unlock(&dm_crypt_clients_lock);
> +
> +       ret = percpu_counter_init(&cc->n_allocated_pages, 0, GFP_KERNEL);
> +       if (ret < 0)
> +               goto bad;
> +
>         /* Optional parameters need to be read before cipher constructor */
>         if (argc > 5) {
>                 ret = crypt_ctr_optional(ti, argc - 5, &argv[5]);
> @@ -2690,7 +2748,7 @@ static int crypt_ctr(struct dm_target *t
>                 ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
>                       ARCH_KMALLOC_MINALIGN);
>
> -       cc->page_pool = mempool_create_page_pool(BIO_MAX_PAGES, 0);
> +       cc->page_pool = mempool_create(BIO_MAX_PAGES, crypt_page_alloc, crypt_page_free, cc);
>         if (!cc->page_pool) {
>                 ti->error = "Cannot allocate page mempool";
>                 goto bad;

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Aug. 15, 2017, 12:07 a.m. UTC | #3

On Mon, 14 Aug 2017, John Stoffel wrote:

> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero
> Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
> Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
> Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the
> Mikulas> zero page as their payload.
> 
> Mikulas> For each incoming page, dm-crypt allocates another page that holds the
> Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
> Mikulas> allocate the amount of memory that is equal to the size of the device.
> Mikulas> This can trigger OOM killer or cause system crash.
> 
> Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
> Mikulas> allocates to 2% of total system memory.
> 
> Is this limit per-device or system wide?  it's not clear to me if the
> minimum is just one page, or more so it might be nice to clarify
> that.

The limit is system-wide. The limit is divided by the number of active 
dm-crypt devices and each device receives an equal share.

There is a lower bound of BIO_MAX_PAGES * 16. The per-device limit is 
never lower than this.

> Also, for large memory systems, that's still alot of pages.  Maybe it
> should be more exponential in it's clamping as memory sizes go up?  2%
> of 2T is 4G, which is pretty darn big...

The more we restrict it, the less parallelism is used in dm-crypt, so it 
makes sense to have a big value on machines with many cores.

> Mikulas> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Mikulas> Cc: stable@vger.kernel.org
> 
> Mikulas> ---
> Mikulas>  drivers/md/dm-crypt.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
> Mikulas>  1 file changed, 59 insertions(+), 1 deletion(-)
> 
> Mikulas> Index: linux-2.6/drivers/md/dm-crypt.c
> Mikulas> ===================================================================
> Mikulas> --- linux-2.6.orig/drivers/md/dm-crypt.c
> Mikulas> +++ linux-2.6/drivers/md/dm-crypt.c
> Mikulas> @@ -148,6 +148,8 @@ struct crypt_config {
> Mikulas>  	mempool_t *tag_pool;
> Mikulas>  	unsigned tag_pool_max_sectors;
>  
> Mikulas> +	struct percpu_counter n_allocated_pages;
> Mikulas> +
> Mikulas>  	struct bio_set *bs;
> Mikulas>  	struct mutex bio_alloc_lock;
>  
> Mikulas> @@ -219,6 +221,12 @@ struct crypt_config {
> Mikulas>  #define MAX_TAG_SIZE	480
> Mikulas>  #define POOL_ENTRY_SIZE	512
>  
> Mikulas> +static DEFINE_SPINLOCK(dm_crypt_clients_lock);
> Mikulas> +static unsigned dm_crypt_clients_n = 0;
> Mikulas> +static volatile unsigned long dm_crypt_pages_per_client;
> Mikulas> +#define DM_CRYPT_MEMORY_PERCENT			2
> Mikulas> +#define DM_CRYPT_MIN_PAGES_PER_CLIENT		(BIO_MAX_PAGES * 16)
> Mikulas> +
> Mikulas>  static void clone_init(struct dm_crypt_io *, struct bio *);
> Mikulas>  static void kcryptd_queue_crypt(struct dm_crypt_io *io);
> Mikulas>  static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
> Mikulas> @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c
> Mikulas>  	return r;
> Mikulas>  }
>  
> Mikulas> +static void crypt_calculate_pages_per_client(void)
> Mikulas> +{
> Mikulas> +	unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100;
> Mikulas> +	if (!dm_crypt_clients_n)
> Mikulas> +		return;
> Mikulas> +	pages /= dm_crypt_clients_n;
> 
> Would it make sense to use a shift here, or does this only run once on
> setup?  And shouldn't you return a value here, if only to make it
> clear what you're defaulting to?  

This piece of code is executed only when a dm-crypt device is created or 
deactivated, so performance doesn't matter.

> Mikulas> +	if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT)
> Mikulas> +		pages = DM_CRYPT_MIN_PAGES_PER_CLIENT;
> Mikulas> +	dm_crypt_pages_per_client = pages;
> Mikulas> +}
> Mikulas> +
> Mikulas> +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
> Mikulas> +{
> Mikulas> +	struct crypt_config *cc = pool_data;
> Mikulas> +	struct page *page;
> Mikulas> +	if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> Mikulas> +	    likely(gfp_mask & __GFP_NORETRY))
> Mikulas> +		return NULL;
> Mikulas> +	page = alloc_page(gfp_mask);
> Mikulas> +	if (likely(page != NULL))
> Mikulas> +		percpu_counter_add(&cc->n_allocated_pages, 1);
> Mikulas> +	return page;
> Mikulas> +}
> Mikulas> +
> Mikulas> +static void crypt_page_free(void *page, void *pool_data)
> Mikulas> +{
> Mikulas> +	struct crypt_config *cc = pool_data;
> Mikulas> +	__free_page(page);
> Mikulas> +	percpu_counter_sub(&cc->n_allocated_pages, 1);
> Mikulas> +}
> Mikulas> +
> Mikulas>  static void crypt_dtr(struct dm_target *ti)
> Mikulas>  {
> Mikulas>  	struct crypt_config *cc = ti->private;
> Mikulas> @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target *
> Mikulas>  	mempool_destroy(cc->req_pool);
> Mikulas>  	mempool_destroy(cc->tag_pool);
>  
> Mikulas> +	if (cc->page_pool)
> Mikulas> +		WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
> Mikulas> +	percpu_counter_destroy(&cc->n_allocated_pages);
> Mikulas> +
> Mikulas>  	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
> cc-> iv_gen_ops->dtr(cc);
>  
> Mikulas> @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target *
>  
> Mikulas>  	/* Must zero key material before freeing */
> Mikulas>  	kzfree(cc);
> Mikulas> +
> Mikulas> +	spin_lock(&dm_crypt_clients_lock);
> Mikulas> +	WARN_ON(!dm_crypt_clients_n);
> Mikulas> +	dm_crypt_clients_n--;
> Mikulas> +	crypt_calculate_pages_per_client();
> Mikulas> +	spin_unlock(&dm_crypt_clients_lock);
> Mikulas>  }
>  
> Mikulas>  static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
> Mikulas> @@ -2636,6 +2685,15 @@ static int crypt_ctr(struct dm_target *t
>  
> ti-> private = cc;
>  
> Mikulas> +	spin_lock(&dm_crypt_clients_lock);
> Mikulas> +	dm_crypt_clients_n++;
> Mikulas> +	crypt_calculate_pages_per_client();
> Mikulas> +	spin_unlock(&dm_crypt_clients_lock);
> Mikulas> +
> Mikulas> +	ret = percpu_counter_init(&cc->n_allocated_pages, 0, GFP_KERNEL);
> Mikulas> +	if (ret < 0)
> Mikulas> +		goto bad;
> Mikulas> +
> Mikulas>  	/* Optional parameters need to be read before cipher constructor */
> Mikulas>  	if (argc > 5) {
> Mikulas>  		ret = crypt_ctr_optional(ti, argc - 5, &argv[5]);
> Mikulas> @@ -2690,7 +2748,7 @@ static int crypt_ctr(struct dm_target *t
> Mikulas>  		ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
> Mikulas>  		      ARCH_KMALLOC_MINALIGN);
>  
> Mikulas> -	cc->page_pool = mempool_create_page_pool(BIO_MAX_PAGES, 0);
> Mikulas> +	cc->page_pool = mempool_create(BIO_MAX_PAGES, crypt_page_alloc, crypt_page_free, cc);
> Mikulas>  	if (!cc->page_pool) {
> ti-> error = "Cannot allocate page mempool";
> Mikulas>  		goto bad;
> 
> Mikulas> --
> Mikulas> dm-devel mailing list
> Mikulas> dm-devel@redhat.com
> Mikulas> https://www.redhat.com/mailman/listinfo/dm-devel
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Aug. 15, 2017, 12:20 a.m. UTC | #4

On Tue, 15 Aug 2017, Tom Yan wrote:

> Just tested the patch with kernel 4.12.6. Well it sort of worked. No
> more OOM or kernel panic. Memory takeup is around ~250M on a machine
> with 8G RAM. However I keep getting this:
> 
> Aug 15 04:04:10 archlinux kernel: INFO: task blkdiscard:538 blocked
> for more than 120 seconds.
> Aug 15 04:04:10 archlinux kernel:       Tainted: P         C O
> 4.12.6-1-ARCH #1
> Aug 15 04:04:10 archlinux kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 15 04:04:10 archlinux kernel: blkdiscard      D    0   538    537 0x00000000
> Aug 15 04:04:10 archlinux kernel: Call Trace:
> Aug 15 04:04:10 archlinux kernel:  __schedule+0x236/0x870
> Aug 15 04:04:10 archlinux kernel:  schedule+0x3d/0x90
> Aug 15 04:04:10 archlinux kernel:  schedule_timeout+0x21f/0x330
> Aug 15 04:04:10 archlinux kernel:  io_schedule_timeout+0x1e/0x50
> Aug 15 04:04:10 archlinux kernel:  ? io_schedule_timeout+0x1e/0x50
> Aug 15 04:04:10 archlinux kernel:  wait_for_completion_io+0xa5/0x120
> Aug 15 04:04:10 archlinux kernel:  ? wake_up_q+0x80/0x80
> Aug 15 04:04:10 archlinux kernel:  submit_bio_wait+0x68/0x90
> Aug 15 04:04:10 archlinux kernel:  blkdev_issue_zeroout+0x80/0xc0
> Aug 15 04:04:10 archlinux kernel:  blkdev_ioctl+0x707/0x940
> Aug 15 04:04:10 archlinux kernel:  ? blkdev_ioctl+0x707/0x940
> Aug 15 04:04:10 archlinux kernel:  block_ioctl+0x3d/0x50
> Aug 15 04:04:10 archlinux kernel:  do_vfs_ioctl+0xa5/0x600
> Aug 15 04:04:10 archlinux kernel:  ? SYSC_newfstat+0x44/0x70
> Aug 15 04:04:10 archlinux kernel:  ? getrawmonotonic64+0x36/0xc0
> Aug 15 04:04:10 archlinux kernel:  SyS_ioctl+0x79/0x90
> Aug 15 04:04:10 archlinux kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa5
> Aug 15 04:04:10 archlinux kernel: RIP: 0033:0x7f2b463378b7
> Aug 15 04:04:10 archlinux kernel: RSP: 002b:00007fffb2dad8b8 EFLAGS:
> 00000246 ORIG_RAX: 0000000000000010
> Aug 15 04:04:10 archlinux kernel: RAX: ffffffffffffffda RBX:
> 000000568a3922c8 RCX: 00007f2b463378b7
> Aug 15 04:04:10 archlinux kernel: RDX: 00007fffb2dad910 RSI:
> 000000000000127f RDI: 0000000000000003
> Aug 15 04:04:10 archlinux kernel: RBP: 0000000000000000 R08:
> 0000000000000200 R09: 0000000000000000
> Aug 15 04:04:10 archlinux kernel: R10: 00007fffb2dad870 R11:
> 0000000000000246 R12: 0000000000000000
> Aug 15 04:04:10 archlinux kernel: R13: 0000000000000003 R14:
> 00007fffb2dadae8 R15: 0000000000000000
> 
> which I do not get if I do `blkdiscard -z` on the underlying device
> (which does not support SCSI WRITE SAME) instead of the dm-crypt
> container.

It happens because encryption is too slow, it takes more than 120 seconds 
to clear the device. I don't know what else it should do. It is not easy 
to interrupt the operation because you would then have to deal with 
dangling bios.

> In the first trial I got some more lower-level errors. blkdiscard
> could not exit after the job was seemingly finished (no more write
> according to iostat). The container could not be closed either so I
> had to just disconnect the drive. I could not reproduce it in second
> trial though, so I am not sure if it was just some coincidental
> hardware hiccup. My systemd journal happened to be broken as well so
> the log of that trial was lost.

So, it's probably bug in error handling in the underlying device (or in 
dm-crypt). Was the device that experienced errors the same as the root 
device of the system?

> I also have doubt in the approach. Can't we split the bio chain as per
> how it was chained and allocate memory bio per bio, and if it's not
> enough, also limit the memory allocation with a maximum number
> (arbitrary or not) of bios?

The number of in-flight bios could be also limited in next_bio (because if 
you have really small memory and large device, the memory could be 
exhausted by the allocation of bio structures). But this is not your case.

Note that no pages are allocated in function that does the zeroing 
(__blkdev_issue_zeroout). The function creates large number of bios and 
all the bios point to the same page containing all zeroes. The memory 
allocation happens when dm-crypt attempts to allocate pages that hold the 
encrypted data.

There are other cases where dm-crypt can cause memory exhaustion, so the 
fix in dm-crypt is appropriate. Another case where it blows the memory is 
when you create a device that has an odd number of sectors (so block size 
512 is used), use dm-crypt on this device and write to the encrypted 
device using the block device's page cache. In this case, dm-crypt 
receives large amount of 512-byte bios, allocates a full 4k page for each 
bio and it also exhausts memory. This patch fixes this problem as well.

> On 14 August 2017 at 10:45, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > dm-crypt consumes excessive amount memory when the user attempts to zero
> > a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
> > the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
> > __blkdev_issue_zeroout sends large amount of write bios that contain the
> > zero page as their payload.
> >
> > For each incoming page, dm-crypt allocates another page that holds the
> > encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
> > allocate the amount of memory that is equal to the size of the device.
> > This can trigger OOM killer or cause system crash.
> >
> > This patch fixes the bug by limiting the amount of memory that dm-crypt
> > allocates to 2% of total system memory.
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: stable@vger.kernel.org
> >
> > ---
> >  drivers/md/dm-crypt.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 59 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/drivers/md/dm-crypt.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/dm-crypt.c
> > +++ linux-2.6/drivers/md/dm-crypt.c
> > @@ -148,6 +148,8 @@ struct crypt_config {
> >         mempool_t *tag_pool;
> >         unsigned tag_pool_max_sectors;
> >
> > +       struct percpu_counter n_allocated_pages;
> > +
> >         struct bio_set *bs;
> >         struct mutex bio_alloc_lock;
> >
> > @@ -219,6 +221,12 @@ struct crypt_config {
> >  #define MAX_TAG_SIZE   480
> >  #define POOL_ENTRY_SIZE        512
> >
> > +static DEFINE_SPINLOCK(dm_crypt_clients_lock);
> > +static unsigned dm_crypt_clients_n = 0;
> > +static volatile unsigned long dm_crypt_pages_per_client;
> > +#define DM_CRYPT_MEMORY_PERCENT                        2
> > +#define DM_CRYPT_MIN_PAGES_PER_CLIENT          (BIO_MAX_PAGES * 16)
> > +
> >  static void clone_init(struct dm_crypt_io *, struct bio *);
> >  static void kcryptd_queue_crypt(struct dm_crypt_io *io);
> >  static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
> > @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c
> >         return r;
> >  }
> >
> > +static void crypt_calculate_pages_per_client(void)
> > +{
> > +       unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100;
> > +       if (!dm_crypt_clients_n)
> > +               return;
> > +       pages /= dm_crypt_clients_n;
> > +       if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT)
> > +               pages = DM_CRYPT_MIN_PAGES_PER_CLIENT;
> > +       dm_crypt_pages_per_client = pages;
> > +}
> > +
> > +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
> > +{
> > +       struct crypt_config *cc = pool_data;
> > +       struct page *page;
> > +       if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> > +           likely(gfp_mask & __GFP_NORETRY))
> > +               return NULL;
> > +       page = alloc_page(gfp_mask);
> > +       if (likely(page != NULL))
> > +               percpu_counter_add(&cc->n_allocated_pages, 1);
> > +       return page;
> > +}
> > +
> > +static void crypt_page_free(void *page, void *pool_data)
> > +{
> > +       struct crypt_config *cc = pool_data;
> > +       __free_page(page);
> > +       percpu_counter_sub(&cc->n_allocated_pages, 1);
> > +}
> > +
> >  static void crypt_dtr(struct dm_target *ti)
> >  {
> >         struct crypt_config *cc = ti->private;
> > @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target *
> >         mempool_destroy(cc->req_pool);
> >         mempool_destroy(cc->tag_pool);
> >
> > +       if (cc->page_pool)
> > +               WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0);
> > +       percpu_counter_destroy(&cc->n_allocated_pages);
> > +
> >         if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
> >                 cc->iv_gen_ops->dtr(cc);
> >
> > @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target *
> >
> >         /* Must zero key material before freeing */
> >         kzfree(cc);
> > +
> > +       spin_lock(&dm_crypt_clients_lock);
> > +       WARN_ON(!dm_crypt_clients_n);
> > +       dm_crypt_clients_n--;
> > +       crypt_calculate_pages_per_client();
> > +       spin_unlock(&dm_crypt_clients_lock);
> >  }
> >
> >  static int crypt_ctr_ivmode(struct dm_target *ti, const char *ivmode)
> > @@ -2636,6 +2685,15 @@ static int crypt_ctr(struct dm_target *t
> >
> >         ti->private = cc;
> >
> > +       spin_lock(&dm_crypt_clients_lock);
> > +       dm_crypt_clients_n++;
> > +       crypt_calculate_pages_per_client();
> > +       spin_unlock(&dm_crypt_clients_lock);
> > +
> > +       ret = percpu_counter_init(&cc->n_allocated_pages, 0, GFP_KERNEL);
> > +       if (ret < 0)
> > +               goto bad;
> > +
> >         /* Optional parameters need to be read before cipher constructor */
> >         if (argc > 5) {
> >                 ret = crypt_ctr_optional(ti, argc - 5, &argv[5]);
> > @@ -2690,7 +2748,7 @@ static int crypt_ctr(struct dm_target *t
> >                 ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
> >                       ARCH_KMALLOC_MINALIGN);
> >
> > -       cc->page_pool = mempool_create_page_pool(BIO_MAX_PAGES, 0);
> > +       cc->page_pool = mempool_create(BIO_MAX_PAGES, crypt_page_alloc, crypt_page_free, cc);
> >         if (!cc->page_pool) {
> >                 ti->error = "Cannot allocate page mempool";
> >                 goto bad;
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tom Yan Aug. 15, 2017, 4:51 p.m. UTC | #5

On 15 August 2017 at 08:20, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> It happens because encryption is too slow, it takes more than 120 seconds
> to clear the device. I don't know what else it should do. It is not easy
> to interrupt the operation because you would then have to deal with
> dangling bios.
>

What I find suspicious is `blkdiscard -z` the underlying device does
not lead to those warning (while it takes the same amount of time), it
makes me wonder if it's like, the dm layer is not doing certain thing
it should do to inform the block layer or whatsoever that the action
is ongoing as expected (like, notify about the completion of each
"part" of the request). There must be a reason that doing it to a SCSI
device directly does not trigger such warning, no?

Also when cat or cp /dev/zero to the crypt does cause any problem,
doesn't that to some extent show that the approach
__blkdev_issue_zeroout taken isn't proper/general enough?

>
> So, it's probably bug in error handling in the underlying device (or in
> dm-crypt). Was the device that experienced errors the same as the root
> device of the system?
>

I am not sure. It was two separate drives. The root device was a
usb-storage thumb drive and the drive used for blkdiscard trial was a
SATA drive on a uas adapter. Both of them are connected to a USB 2.0
hub. It might be just the hub having hiccup.

>
> The number of in-flight bios could be also limited in next_bio (because if
> you have really small memory and large device, the memory could be
> exhausted by the allocation of bio structures). But this is not your case.
>

Is the dm layer "aware of" next_bio? I mean, apparently what next_bio
does seems to be "chaining up" bios (of the whole request, which could
be the size of a disk or so), and when the chain reaches the dm layer,
it seems that it wasn't splited (in the "processing" sense, not
"alignment" sense, if you know what I mean) again properly.

>
> Note that no pages are allocated in function that does the zeroing
> (__blkdev_issue_zeroout). The function creates large number of bios and
> all the bios point to the same page containing all zeroes. The memory
> allocation happens when dm-crypt attempts to allocate pages that hold the
> encrypted data.
>

Yeah I know that. And that's where comes the problem. Shouldn't
dm-crypt decide how much to hold base on the capability of the
underlying device instead of the available memory? (e.g. blocks per
command, command per queue, maximum numbers of queues...)

>
> There are other cases where dm-crypt can cause memory exhaustion, so the
> fix in dm-crypt is appropriate. Another case where it blows the memory is
> when you create a device that has an odd number of sectors (so block size
> 512 is used), use dm-crypt on this device and write to the encrypted
> device using the block device's page cache. In this case, dm-crypt
> receives large amount of 512-byte bios, allocates a full 4k page for each
> bio and it also exhausts memory. This patch fixes this problem as well.
>

I am not saying that this patch is like totally wrong though. It's
just that it seems to be more of a supplement / precaution in case
things go terribly wrong, but not a fix that deals with the problem
under the hood.

P.S. I am sorry if any of these are non-sensical. I don't really know
much about how the block layer or the dm layer work. What I said are
pretty much base on instinct so don't mind me if I sounded naive.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

John Stoffel Aug. 17, 2017, 6:50 p.m. UTC | #6

>>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:

Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
>> 
Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the
Mikulas> zero page as their payload.
>> 
Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.
>> 
Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.
>> 
>> Is this limit per-device or system wide?  it's not clear to me if the
>> minimum is just one page, or more so it might be nice to clarify
>> that.

Mikulas> The limit is system-wide. The limit is divided by the number
Mikulas> of active dm-crypt devices and each device receives an equal
Mikulas> share.

Ok, thanks for the clarification.  Can it be expanded dynamically?  I
could see that for large systems, it might not scale well.  

Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device
Mikulas> limit is never lower than this.

Nice.

>> Also, for large memory systems, that's still alot of pages.  Maybe it
>> should be more exponential in it's clamping as memory sizes go up?  2%
>> of 2T is 4G, which is pretty darn big...

Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so it 
Mikulas> makes sense to have a big value on machines with many cores.

True... but it's still a concern that it's hardcoded.

Thanks for your replies.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Aug. 18, 2017, 4:58 p.m. UTC | #7

On Thu, 17 Aug 2017, John Stoffel wrote:

> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote:
> 
> >> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
> >> 
> Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero
> Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
> Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
> Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the
> Mikulas> zero page as their payload.
> >> 
> Mikulas> For each incoming page, dm-crypt allocates another page that holds the
> Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
> Mikulas> allocate the amount of memory that is equal to the size of the device.
> Mikulas> This can trigger OOM killer or cause system crash.
> >> 
> Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
> Mikulas> allocates to 2% of total system memory.
> >> 
> >> Is this limit per-device or system wide?  it's not clear to me if the
> >> minimum is just one page, or more so it might be nice to clarify
> >> that.
> 
> Mikulas> The limit is system-wide. The limit is divided by the number
> Mikulas> of active dm-crypt devices and each device receives an equal
> Mikulas> share.
> 
> Ok, thanks for the clarification.  Can it be expanded dynamically?  I
> could see that for large systems, it might not scale well.  
> 
> Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device
> Mikulas> limit is never lower than this.
> 
> Nice.
> 
> >> Also, for large memory systems, that's still alot of pages.  Maybe it
> >> should be more exponential in it's clamping as memory sizes go up?  2%
> >> of 2T is 4G, which is pretty darn big...
> 
> Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so it 
> Mikulas> makes sense to have a big value on machines with many cores.
> 
> True... but it's still a concern that it's hardcoded.
> 
> Thanks for your replies.

The limit can't be changed. I expect that no one will run a system with 
such a tight requirements that the memory is used up to the last 2%.

If the user is near hitting the memory limit, he can just add swap space 
to solve the problem instead of tuning dm-crypt.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

John Stoffel Aug. 18, 2017, 7:02 p.m. UTC | #8

>>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:

Mikulas> On Thu, 17 Aug 2017, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
>> 
Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote:
>> 
>> >> >>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:
>> >> 
Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the
Mikulas> zero page as their payload.
>> >> 
Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.
>> >> 
Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.
>> >> 
>> >> Is this limit per-device or system wide?  it's not clear to me if the
>> >> minimum is just one page, or more so it might be nice to clarify
>> >> that.
>> 
Mikulas> The limit is system-wide. The limit is divided by the number
Mikulas> of active dm-crypt devices and each device receives an equal
Mikulas> share.
>> 
>> Ok, thanks for the clarification.  Can it be expanded dynamically?  I
>> could see that for large systems, it might not scale well.  
>> 
Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device
Mikulas> limit is never lower than this.
>> 
>> Nice.
>> 
>> >> Also, for large memory systems, that's still alot of pages.  Maybe it
>> >> should be more exponential in it's clamping as memory sizes go up?  2%
>> >> of 2T is 4G, which is pretty darn big...
>> 
Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so it 
Mikulas> makes sense to have a big value on machines with many cores.
>> 
>> True... but it's still a concern that it's hardcoded.
>> 
>> Thanks for your replies.

Mikulas> The limit can't be changed. I expect that no one will run a system with 
Mikulas> such a tight requirements that the memory is used up to the last 2%.

Mikulas> If the user is near hitting the memory limit, he can just add swap space 
Mikulas> to solve the problem instead of tuning dm-crypt.

My real concern is that we build up too many outstanding BIOs so that
they take along time to process.  I'm already seeing some reports
about timeouts after 120 seconds due to stacking of dm-crypt with
other stuff.

What I'm trying to say is that as memory grows, you need to slow down
the growth of the amount of memory that dm-crypt can allocate.  It's
more important to scale by the speed of the IO sub-system than the
size of memory.

But hey, you really addresed my concern with smaller memory systems
already.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Aug. 19, 2017, 12:18 a.m. UTC | #9

On Fri, 18 Aug 2017, John Stoffel wrote:

> Mikulas> The limit can't be changed. I expect that no one will run a system with 
> Mikulas> such a tight requirements that the memory is used up to the last 2%.
> 
> Mikulas> If the user is near hitting the memory limit, he can just add swap space 
> Mikulas> to solve the problem instead of tuning dm-crypt.
> 
> My real concern is that we build up too many outstanding BIOs so that
> they take along time to process.  I'm already seeing some reports
> about timeouts after 120 seconds due to stacking of dm-crypt with
> other stuff.
> 
> What I'm trying to say is that as memory grows, you need to slow down
> the growth of the amount of memory that dm-crypt can allocate.  It's
> more important to scale by the speed of the IO sub-system than the
> size of memory.
> 
> But hey, you really addresed my concern with smaller memory systems
> already.
> 
> John

This patch doesn't restrict the number of in-flight bios. Lowering the 
memory limit has no effect on the number of bios being allocated by some 
other part of the kernel and submitted for dm-crypt.

Some times ago I made a patch that limits the number of in-flight bios in 
device mapper - see this:
http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Bart Van Assche Aug. 19, 2017, 12:59 a.m. UTC | #10

On Fri, 2017-08-18 at 20:18 -0400, Mikulas Patocka wrote:
> Some times ago I made a patch that limits the number of in-flight bios in 
> device mapper - see this:
> http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

It would be great to have this functionality upstream ... Do you have any
plans to submit this patch to the device mapper maintainer, Mike Snitzer?

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Aug. 19, 2017, 5:34 p.m. UTC | #11

On Wed, 16 Aug 2017, Tom Yan wrote:

> What I find suspicious is `blkdiscard -z` the underlying device does
> not lead to those warning (while it takes the same amount of time), it
> makes me wonder if it's like, the dm layer is not doing certain thing
> it should do to inform the block layer or whatsoever that the action
> is ongoing as expected (like, notify about the completion of each
> "part" of the request). There must be a reason that doing it to a SCSI
> device directly does not trigger such warning, no?

If you issue a single ioctl that takes extreme amount of time - the kernel 
warns about being blocked for extreme amount of time - what else should it 
do?

You can change submit_bio_wait to call wait_for_completion_io_timeout 
instead of wait_for_completion_io (and retry if the timeout was hit) - 
that will shut the warning up, but it won't speed up the operation.

The zeroing operation can't be made interruptible easily because if you 
the operation were interrupted, there would be pending bios left.

> Also when cat or cp /dev/zero to the crypt does cause any problem,

cp /dev/zero generates large amount of small write syscalls (so every 
single syscall completes quickly). blkdiscard -z generates one 
long-lasting syscall that clears the whole device.

> doesn't that to some extent show that the approach
> __blkdev_issue_zeroout taken isn't proper/general enough?
> 
> >
> > So, it's probably bug in error handling in the underlying device (or in
> > dm-crypt). Was the device that experienced errors the same as the root
> > device of the system?
> >
> 
> I am not sure. It was two separate drives. The root device was a
> usb-storage thumb drive and the drive used for blkdiscard trial was a
> SATA drive on a uas adapter. Both of them are connected to a USB 2.0
> hub. It might be just the hub having hiccup.
> 
> >
> > The number of in-flight bios could be also limited in next_bio (because if
> > you have really small memory and large device, the memory could be
> > exhausted by the allocation of bio structures). But this is not your case.
> >
> 
> Is the dm layer "aware of" next_bio? I mean, apparently what next_bio
> does seems to be "chaining up" bios (of the whole request, which could
> be the size of a disk or so), and when the chain reaches the dm layer,
> it seems that it wasn't splited (in the "processing" sense, not
> "alignment" sense, if you know what I mean) again properly.

The chained bios arrive independently to the dm layer and the dm layer 
doesn't care about how they are changed.

> > Note that no pages are allocated in function that does the zeroing
> > (__blkdev_issue_zeroout). The function creates large number of bios and
> > all the bios point to the same page containing all zeroes. The memory
> > allocation happens when dm-crypt attempts to allocate pages that hold the
> > encrypted data.
> >
> 
> Yeah I know that. And that's where comes the problem. Shouldn't
> dm-crypt decide how much to hold base on the capability of the
> underlying device instead of the available memory? (e.g. blocks per
> command, command per queue, maximum numbers of queues...)

Changing the memory limit in dm-crypt doesn't change the number of bios 
that dm-crypt receives. If you lower the memory limit, you decrease 
performance (because there would be less parallelism) and you gain 
nothing. It won't solve the long-wait problem.

> > There are other cases where dm-crypt can cause memory exhaustion, so the
> > fix in dm-crypt is appropriate. Another case where it blows the memory is
> > when you create a device that has an odd number of sectors (so block size
> > 512 is used), use dm-crypt on this device and write to the encrypted
> > device using the block device's page cache. In this case, dm-crypt
> > receives large amount of 512-byte bios, allocates a full 4k page for each
> > bio and it also exhausts memory. This patch fixes this problem as well.
> >
> 
> I am not saying that this patch is like totally wrong though. It's
> just that it seems to be more of a supplement / precaution in case
> things go terribly wrong, but not a fix that deals with the problem
> under the hood.
> 
> P.S. I am sorry if any of these are non-sensical. I don't really know
> much about how the block layer or the dm layer work. What I said are
> pretty much base on instinct so don't mind me if I sounded naive.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

John Stoffel Aug. 21, 2017, 2:06 p.m. UTC | #12

>>>>> "Mikulas" == Mikulas Patocka <mpatocka@redhat.com> writes:

Mikulas> On Fri, 18 Aug 2017, John Stoffel wrote:

Mikulas> The limit can't be changed. I expect that no one will run a system with 
Mikulas> such a tight requirements that the memory is used up to the last 2%.
>> 
Mikulas> If the user is near hitting the memory limit, he can just add swap space 
Mikulas> to solve the problem instead of tuning dm-crypt.
>> 
>> My real concern is that we build up too many outstanding BIOs so that
>> they take along time to process.  I'm already seeing some reports
>> about timeouts after 120 seconds due to stacking of dm-crypt with
>> other stuff.
>> 
>> What I'm trying to say is that as memory grows, you need to slow down
>> the growth of the amount of memory that dm-crypt can allocate.  It's
>> more important to scale by the speed of the IO sub-system than the
>> size of memory.
>> 
>> But hey, you really addresed my concern with smaller memory systems
>> already.
>> 
>> John

Mikulas> This patch doesn't restrict the number of in-flight bios. Lowering the 
Mikulas> memory limit has no effect on the number of bios being allocated by some 
Mikulas> other part of the kernel and submitted for dm-crypt.

Good to know.  

Mikulas> Some times ago I made a patch that limits the number of in-flight bios in 
Mikulas> device mapper - see this:
Mikulas> http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Interesting.  I'll try to find some time to look this over. 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tom Yan Aug. 25, 2017, 4:58 a.m. UTC | #13

On 20 August 2017 at 01:34, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> If you issue a single ioctl that takes extreme amount of time - the kernel
> warns about being blocked for extreme amount of time - what else should it
> do?
>

But as I said, it does NOT warn about being blocked if the ioctl is
issued for a SCSI device (i.e. blkdiscard -z /dev/sdX), so why the
warning occurs / cannot be avoided in the case of a dm(-crypt)
container?

One thing I can see is, when blkdiscard -z is running on a dm-crypt
container, iostat list it with a big avgqu-sz (not to be mixed up with
avgrq-sz), starting from ~40000 (the underlying device is a 37G
partition), gradually dropping to 0 (when the job is finished) with a
step of ~40; all other items are 0 for the container (except %util,
also 100% for most of the time). While for the underlying device, no
matter if it is written to through a dm-crypt container, its lines
stays at something like these:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00   85.00     0.00 43520.00
1024.00   144.71 1664.89    0.00 1664.89  11.76 100.00
sdb               0.00     0.00    0.00   86.00     0.00 44032.00
1024.00   142.34 1694.70    0.00 1694.70  11.63 100.00

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka Sept. 11, 2017, 11:57 p.m. UTC | #14

On Fri, 25 Aug 2017, Tom Yan wrote:

> On 20 August 2017 at 01:34, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > If you issue a single ioctl that takes extreme amount of time - the kernel
> > warns about being blocked for extreme amount of time - what else should it
> > do?
> >
> 
> But as I said, it does NOT warn about being blocked if the ioctl is
> issued for a SCSI device (i.e. blkdiscard -z /dev/sdX), so why the
> warning occurs / cannot be avoided in the case of a dm(-crypt)
> container?

The SCSI disk blocks in the make_request_fn routine, dm-crypt accepts the 
requests without blocking there. If you set high nr_requests value for the 
SCSI disk, it would not block and the same warning would occur.

> One thing I can see is, when blkdiscard -z is running on a dm-crypt
> container, iostat list it with a big avgqu-sz (not to be mixed up with
> avgrq-sz), starting from ~40000 (the underlying device is a 37G

There is large nubmer of requests in flight, so it reports large number of 
requests in flight. This isn't problem. The disk has 37G, average request 
size is 1024kB, 37G/1024k=37888.

Mikulas

> partition), gradually dropping to 0 (when the job is finished) with a
> step of ~40; all other items are 0 for the container (except %util,
> also 100% for most of the time). While for the underlying device, no
> matter if it is written to through a dm-crypt container, its lines
> stays at something like these:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb               0.00     0.00    0.00   85.00     0.00 43520.00
> 1024.00   144.71 1664.89    0.00 1664.89  11.76 100.00
> sdb               0.00     0.00    0.00   86.00     0.00 44032.00
> 1024.00   142.34 1694.70    0.00 1694.70  11.63 100.00
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

dm-crypt: limit the number of allocated pages

Commit Message

Comments

Patch