diff mbox

One liner patch for prd to avoid "kernel BUG at drivers/block/pmem.c:176!"

Message ID 54E07788.8060907@plexistor.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boaz Harrosh Feb. 15, 2015, 10:40 a.m. UTC
On 02/13/2015 03:50 AM, Roger C. Pao (Enmotus) wrote:
> <hugh.daschbach@enmotus.com> wrote:
>      The biovec can present a size greater that PAGE_SIZE if an I/O 
> buffer contains physically contiguous pages.  This may be unusual for 
> user space pages, as the virtual to physical memory map gets fragmented. 
>   But for buffers allocated by the kernel with kmalloc, physical memory 
> will be contiguous.
>      Even if a single I/O request does not contain two contiguous pages, 
> the block layer may merge two requests that have contiguous pages.  It 
> will then attempt to coalesce biovecs.  You probably won't see that if 
> you avoid the I/O scheduler by capturing requests at make_request.  But 
> it is still a good idea to declare your devices segment size limitation 
> with blk_queue_max_segment_size.  There are a couple drivers in 
> drivers/block/ that do just that to limit segments to PAGE_SIZE.

Ye only I cannot find this code in block layer except of course as I said
in blk_rq_map_sg which only SCSI layer uses (and maybe some other block

In fact I know that the Kernel goes through great pain to call split_huge_pages
when it needs to SWAP anon HUGE pages through the block layer.

The one thing that looks very weird in what you are saying is the end_io API
it is called per bio each bio has biovec of a single page. If the block layer
goes on a coalescing spree than that would be unexpected for a caller who
expect a notification per bio per page, for example to call put_page() on each
page or page_unlock(). So the block layer must present back a one-to-one pages to
submitted pages. What are you saying that the block layer has code to coalesce
contiguous pages then un-coalesce them before the call to end_io. Where is that
info stored? (This is exactly what happens with  blk_rq_map_sg() coalescing is
done on the extra allocated sg_list and the bio is left as is for end_io

I'm not saying it is not there, only that I cannot find it.

>      We caught this because the Enmotus tiering engine issues rather 
> large I/O requests to buffers that were allocated with kmalloc.  It is 
> fairly common for the tiering engine to allocate I/O buffers of 64KB or 
> greater.  If the underlying block device supports it, we will submit a 
> bio with a biovec mapping many contiguous pages.  The entire buffer will 
> possibly be mapped by a single biovec.  

OK So  you have a special Kernel modules that maps these biovecs not through
the use of bio_add_page, what API do you use? do you hack in biovec members

> The tiering engine uses 
> max_segment_size to determine how to build it's biovec list.

That said I don't mind supporting this, in pmem. Do you know of an API
parallel to now used kmap_atomic(PAGE) that will return a valid pointer
for more than a PAGE_SIZE.

So OK I could easily use page_address + page_to_pfn, pfn_to_page, but that
is only good for 64bit. Now we are already very limited on 32bit platforms
anyway (Sub giga size NvDIMM), and I do not mind at all to limit pmem.c to
only compile on 64bit.

And actually in 64bit kmap_atomic(PAGE) will give me a valid virtual-pointer
for larger than PAGE_SIZE, since as you say these are pfn contiguous and the
kernel mapping is flat.

But what are the API's that we are suppose to use with such pages-pointers
that actually point to a more-then-one-page.
I do not want to have to crash and fix this code for ARCHs other than x86_64

>      Thanks for considering the change.
>      Hugh

Here below is a simple patch for pmem to support such pages on x86_64
NOTE: This is based and is only good on my tree, with the simplified code.

All we need to do is remove the BUG_ON.

[Which is badly placed because the limiting condition is the call to
kmap_atomic(page) not in the below function but in pmem_do_bvec() so
the BUG_ON should have been there.]


diff mbox


diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 5eced12..26d61da 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -120,7 +120,6 @@  static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
-		BUG_ON(len > PAGE_SIZE);
 		pmem_do_bvec(pmem, bvec.bv_page, len,
 			    bvec.bv_offset, rw, sector);
 		sector += len >> SECTOR_SHIFT;