Message ID | 20180517205343.8401-1-qing.huang@oracle.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 05/17/2018 01:53 PM, Qing Huang wrote: > When a system is under memory presure (high usage with fragments), > the original 256KB ICM chunk allocations will likely trigger kernel > memory management to enter slow path doing memory compact/migration > ops in order to complete high order memory allocations. > > When that happens, user processes calling uverb APIs may get stuck > for more than 120s easily even though there are a lot of free pages > in smaller chunks available in the system. > > Syslog: > ... > Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task > oracle_205573_e:205573 blocked for more than 120 seconds. > ... > NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d And you'll understand some people care about this. Strongly. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/17/2018 2:14 PM, Eric Dumazet wrote: > On 05/17/2018 01:53 PM, Qing Huang wrote: >> When a system is under memory presure (high usage with fragments), >> the original 256KB ICM chunk allocations will likely trigger kernel >> memory management to enter slow path doing memory compact/migration >> ops in order to complete high order memory allocations. >> >> When that happens, user processes calling uverb APIs may get stuck >> for more than 120s easily even though there are a lot of free pages >> in smaller chunks available in the system. >> >> Syslog: >> ... >> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >> oracle_205573_e:205573 blocked for more than 120 seconds. >> ... >> > NACK on this patch. > > You have been asked repeatedly to use kvmalloc() > > This is not a minor suggestion. > > Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. > And you'll understand some people care about this. > > Strongly. > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/05/2018 12:45 AM, Qing Huang wrote: > > > On 5/17/2018 2:14 PM, Eric Dumazet wrote: >> On 05/17/2018 01:53 PM, Qing Huang wrote: >>> When a system is under memory presure (high usage with fragments), >>> the original 256KB ICM chunk allocations will likely trigger kernel >>> memory management to enter slow path doing memory compact/migration >>> ops in order to complete high order memory allocations. >>> >>> When that happens, user processes calling uverb APIs may get stuck >>> for more than 120s easily even though there are a lot of free pages >>> in smaller chunks available in the system. >>> >>> Syslog: >>> ... >>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >>> oracle_205573_e:205573 blocked for more than 120 seconds. >>> ... >>> >> NACK on this patch. >> >> You have been asked repeatedly to use kvmalloc() >> >> This is not a minor suggestion. >> >> Take a look >> athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d >> > > Would you please take a look at how table->icm is being used in the mlx4 > driver? It's a meta data used for individual pointer variable referencing, > not as data frag or in/out buffer. It has no need for contiguous phy. > memory. > > Thanks. > NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. Thanks, Tariq >> And you'll understand some people care about this. >> >> Strongly. >> >> Thanks. >> > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/22/2018 8:33 AM, Tariq Toukan wrote: > > > On 18/05/2018 12:45 AM, Qing Huang wrote: >> >> >> On 5/17/2018 2:14 PM, Eric Dumazet wrote: >>> On 05/17/2018 01:53 PM, Qing Huang wrote: >>>> When a system is under memory presure (high usage with fragments), >>>> the original 256KB ICM chunk allocations will likely trigger kernel >>>> memory management to enter slow path doing memory compact/migration >>>> ops in order to complete high order memory allocations. >>>> >>>> When that happens, user processes calling uverb APIs may get stuck >>>> for more than 120s easily even though there are a lot of free pages >>>> in smaller chunks available in the system. >>>> >>>> Syslog: >>>> ... >>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >>>> oracle_205573_e:205573 blocked for more than 120 seconds. >>>> ... >>>> >>> NACK on this patch. >>> >>> You have been asked repeatedly to use kvmalloc() >>> >>> This is not a minor suggestion. >>> >>> Take a look >>> athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d >>> >> >> Would you please take a look at how table->icm is being used in the >> mlx4 driver? It's a meta data used for individual pointer variable >> referencing, >> not as data frag or in/out buffer. It has no need for contiguous phy. >> memory. >> >> Thanks. >> > > NACK. > > This would cause a degradation when iterating the entries of table->icm. > For example, in mlx4_table_get_range. E.g. int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 start, u32 end) { int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size; int err; u32 i; for (i = start; i <= end; i += inc) { err = mlx4_table_get(dev, table, i); if (err) goto fail; } return 0; ... } E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. So you will have to allocate more 512 mtt objects in order to have table->icm pointer to increment by 1 to fetch next pointer value. So 256K mtt objects are needed in order to traverse table->icm pointer across a page boundary in the call stacks. Considering mlx4_table_get_range() is only used in control path, there is no significant gain by using kvzalloc vs. vzalloc for table->icm. Anyway, if a user makes sure mlx4 driver to be loaded very early and doesn't remove and reload it afterwards, we should have enough (and not wasting) contiguous phy mem for table->icm allocation. I will use kvzalloc to replace vzalloc and send a V4 patch. Thanks, Qing > > Thanks, > Tariq > >>> And you'll understand some people care about this. >>> >>> Strongly. >>> >>> Thanks. >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..3705f4e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -33,6 +33,7 @@ #include <linux/errno.h> #include <linux/mm.h> +#include <linux/vmalloc.h> #include <linux/scatterlist.h> #include <linux/slab.h> @@ -43,12 +44,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); }