diff mbox

[v2] dm: add topology support

Message ID 1244524492-2292-1-git-send-email-snitzer@redhat.com (mailing list archive)
State Accepted, archived
Delegated to: Alasdair Kergon
Headers show

Commit Message

Mike Snitzer June 9, 2009, 5:14 a.m. UTC
Use blk_stack_limits() to stack block limits (including topology) rather
than duplicate the equivalent within Device Mapper.

This patch depends on commit 77634f33d4078542cf1087995cced0ffdae25aa2
from linux-2.6-block +for-next

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-table.c         |  145 +++++++++++-------------------------------
 include/linux/device-mapper.h |   16 ----
 2 files changed, 43 insertions(+), 118 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Martin K. Petersen June 10, 2009, 6:51 a.m. UTC | #1
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> Use blk_stack_limits() to stack block limits (including topology)
Mike> rather than duplicate the equivalent within Device Mapper.

Great!  So how are the userland bits coming along?
Mike Snitzer June 10, 2009, 2 p.m. UTC | #2
On Wed, Jun 10 2009 at  2:51am -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> Use blk_stack_limits() to stack block limits (including topology)
> Mike> rather than duplicate the equivalent within Device Mapper.
> 
> Great!  So how are the userland bits coming along?

They should be pretty well sorted out with the patches I posted to
lvm-devel early Monday morning (code review still needed):
https://www.redhat.com/archives/lvm-devel/2009-June/msg00037.html

As for the DM changes.  Alasdair reviewed v2 of the patch and found an
issue that I need to get a final answer to.

Your change to dm-table.c:dm_table_set_restrictions() in linux-2.6-block
+for-next's ae03bf639a5027d27270123f5f6e3ee6a412781d introduced calls to
blk_queue_*() setters.

My v2 of the DM topology support patch does away with those and just
uses blk_stack_limits().  In the process we go back to _not_ using the
blk_queue_*() setters (note the additional checks that those setters
have).

So the question: is _not_ using the blk_queue_*() setters perfectly
fine?  Given that DM has always _not_ used them the quick answer is
"seems fine".

But I need to dig a bit more to understand if the additional logic in
the blk_queue_*() setters is something DM shouldn't be circumventing.

But we're almost done with these DM/LVM2 topology bits.. really :)

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon June 10, 2009, 2:36 p.m. UTC | #3
On Wed, Jun 10, 2009 at 10:00:29AM -0400, Mike Snitzer wrote:
> My v2 of the DM topology support patch does away with those and just
> uses blk_stack_limits().  In the process we go back to _not_ using the
> blk_queue_*() setters (note the additional checks that those setters
> have).
 
That was a secondary point, that could be cleared up by reviewing the
checks to ensure they all automatically hold so never need checking
at that point.

The main problem was that applying the combined limits for the table
previously - and correctly - paid no regard to the existing queue limits
on the dm device.  The new code seems to combine the limits between the
live and inactive tables, which is nonsense as in dm terms they are
totally independent.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen June 10, 2009, 5:58 p.m. UTC | #4
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> So the question: is _not_ using the blk_queue_*() setters
Mike> perfectly fine?  Given that DM has always _not_ used them the
Mike> quick answer is "seems fine".

Mike> But I need to dig a bit more to understand if the additional logic
Mike> in the blk_queue_*() setters is something DM shouldn't be
Mike> circumventing.

The original intent was that drivers like DM and MD would seed their
limits using the blk_queue* calls before adding any component devices.
blk_stack_limits() would then scale accordingly for every new device
added.

Is there any reason in particular that this approach wouldn't work for
DM?  I.e. set defaults ahead of time instead of doing it upon table
completion using check_for_valid_limits()?
Mike Snitzer June 10, 2009, 9:12 p.m. UTC | #5
On Wed, Jun 10 2009 at  1:58pm -0400,
Martin K. Petersen <martin.petersen@oracle.com> wrote:

> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> So the question: is _not_ using the blk_queue_*() setters
> Mike> perfectly fine?  Given that DM has always _not_ used them the
> Mike> quick answer is "seems fine".
> 
> Mike> But I need to dig a bit more to understand if the additional logic
> Mike> in the blk_queue_*() setters is something DM shouldn't be
> Mike> circumventing.
> 
> The original intent was that drivers like DM and MD would seed their
> limits using the blk_queue* calls before adding any component devices.
> blk_stack_limits() would then scale accordingly for every new device
> added.
> 
> Is there any reason in particular that this approach wouldn't work for
> DM?  I.e. set defaults ahead of time instead of doing it upon table
> completion using check_for_valid_limits()?

When a table is pushed to DM each target device in the table may have
different limits.  There is no one-size-fits-all default.

With DM, the underlying device that you're sending IO to is a function
of offset into the DM device.  Therefore, the associated IO limits
should really be a function of offset.

Understanding that that is a more complicated proposition we're working
with the topology infrastructure that you've put together as best we
can.

That means we use a device's alignment_offset in userland LVM2 to push
down a data area whose start+size is aligned.  This gives us the
guarantee that each device in a given DM table is aligned.  From there
DM will use blk_stack_limits() to combine all the other topology limits
(alignment_offset is no longer an issue; though we could get some
warnings.. may look to silence them in dm-table.c).

But blk_stack_limits() leads to a situation where the combined limits
(io_min, logical_block_size) are not ideal for all offsets into the
resulting DM device (e.g. issuing larger IOs to some target devices than
would otherwise be needed).

This is not new for DM (historically DM stacked hardsect_size and other
limits in the same fashion), but it doesn't mean that we aren't keeping
in mind that limits as a function of offset would be more appropriate.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen June 10, 2009, 10:06 p.m. UTC | #6
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> When a table is pushed to DM each target device in the table may
Mike> have different limits.  There is no one-size-fits-all default.

But incrementally finding the one size that does fit all (DM dev as well
as targets) is the raison d'être for blk_stack_limits().

The default limits should all be set by the block layer when setting up
the request queue.  So my reason for inquiring was to figure out whether
check_for_valid_limits() actually makes any difference?


Mike> With DM, the underlying device that you're sending IO to is a
Mike> function of offset into the DM device.  Therefore, the associated
Mike> IO limits should really be a function of offset.

This was one of the discussion topics at the Storage and Filesystems
workshop a couple of months ago.  My original topology code actually
permitted a list of topologies for a block device.  With all that
entails of nightmares splicing and dicing lists in the stacking
function.  It was not pretty, and the general consensus at the workshop
was that this was way too complex.  I agreed and the code was gutted.

And even having a topology list failed to correctly describe some valid
configurations.  One pathological case is having a 512-byte
logical/physical in a mirror with a 512/4KB odd-aligned one.  How do you
define a topology for that?  The solution is to adjust the alignment and
scale up io_min to match the 4K drive.  I.e. to change things for the
drive that isn't the "problem".


Mike> That means we use a device's alignment_offset in userland LVM2 to
Mike> push down a data area whose start+size is aligned.  This gives us
Mike> the guarantee that each device in a given DM table is aligned.

*nod*


Mike> But blk_stack_limits() leads to a situation where the combined
Mike> limits (io_min, logical_block_size) are not ideal for all offsets
Mike> into the resulting DM device (e.g. issuing larger IOs to some
Mike> target devices than would otherwise be needed).

Yup.  But not smaller.  That's the whole point.  Making sure we align to
the lowest common denominator and never incur a read-modify-write cycle
on any storage device.
Alasdair G Kergon June 10, 2009, 11:08 p.m. UTC | #7
On Wed, Jun 10, 2009 at 06:06:36PM -0400, Martin K. Petersen wrote:
> The default limits should all be set by the block layer when setting up
> the request queue.  So my reason for inquiring was to figure out whether
> check_for_valid_limits() actually makes any difference?
 
I renamed that badly-named function earlier to:
  init_valid_queue_limits()

It should also really be shared with block/.

What's going on is that LVM prepares the new tables it wants to
build up a device stack in advance, then if everything has worked,
makes them all go live.

The validation has to happen during the first phase - backing out
the change to the device stack upon a failure is easier then as
we have not yet reached the commit point of the transaction.
The operation making the new stack live if at all possible must not
fail, because that comes within the commit logic and would make recovery
much trickier.

In dm terms, this means we have two tables - called 'live' and
'inactive'.  The first phase sets up inactive tables on all the stacked
devices involved in the transaction and that is when all the memory
needed is allocated and the validation occurs.  The second phase then
makes the inactive table live and discards the previously-live table.
The two tables are independent: the old queue limits on the dm device
are discarded and replaced by the newly-calculated ones.

Currently those limits are calculated in phase one, but we should see
about delaying this limit combination code (which should alway succeed)
until phase two (which gives userspace code more freedom in its use of
the interface).

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

Index: linux-2.6/drivers/md/dm-table.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-table.c
+++ linux-2.6/drivers/md/dm-table.c
@@ -66,7 +66,7 @@  struct dm_table {
 	 * These are optimistic limits taken from all the
 	 * targets, some targets will need smaller limits.
 	 */
-	struct io_restrictions limits;
+	struct queue_limits limits;
 
 	/* events get handed up using this callback */
 	void (*event_fn)(void *);
@@ -89,43 +89,6 @@  static unsigned int int_log(unsigned int
 }
 
 /*
- * Returns the minimum that is _not_ zero, unless both are zero.
- */
-#define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r))
-
-/*
- * Combine two io_restrictions, always taking the lower value.
- */
-static void combine_restrictions_low(struct io_restrictions *lhs,
-				     struct io_restrictions *rhs)
-{
-	lhs->max_sectors =
-		min_not_zero(lhs->max_sectors, rhs->max_sectors);
-
-	lhs->max_phys_segments =
-		min_not_zero(lhs->max_phys_segments, rhs->max_phys_segments);
-
-	lhs->max_hw_segments =
-		min_not_zero(lhs->max_hw_segments, rhs->max_hw_segments);
-
-	lhs->logical_block_size = max(lhs->logical_block_size,
-				      rhs->logical_block_size);
-
-	lhs->max_segment_size =
-		min_not_zero(lhs->max_segment_size, rhs->max_segment_size);
-
-	lhs->max_hw_sectors =
-		min_not_zero(lhs->max_hw_sectors, rhs->max_hw_sectors);
-
-	lhs->seg_boundary_mask =
-		min_not_zero(lhs->seg_boundary_mask, rhs->seg_boundary_mask);
-
-	lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);
-
-	lhs->no_cluster |= rhs->no_cluster;
-}
-
-/*
  * Calculate the index of the child node of the n'th node k'th key.
  */
 static inline unsigned int get_child(unsigned int n, unsigned int k)
@@ -513,10 +476,14 @@  static int __table_get_device(struct dm_
 	return 0;
 }
 
+/*
+ * Returns the minimum that is _not_ zero, unless both are zero.
+ */
+#define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r))
+
 void dm_set_device_limits(struct dm_target *ti, struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-	struct io_restrictions *rs = &ti->limits;
 	char b[BDEVNAME_SIZE];
 
 	if (unlikely(!q)) {
@@ -525,15 +492,9 @@  void dm_set_device_limits(struct dm_targ
 		return;
 	}
 
-	/*
-	 * Combine the device limits low.
-	 *
-	 * FIXME: if we move an io_restriction struct
-	 *        into q this would just be a call to
-	 *        combine_restrictions_low()
-	 */
-	rs->max_sectors =
-		min_not_zero(rs->max_sectors, queue_max_sectors(q));
+	if (blk_stack_limits(&ti->limits, &q->limits, 0) < 0)
+		DMWARN("%s: target device %s is misaligned",
+		       dm_device_name(ti->table->md), bdevname(bdev, b));
 
 	/*
 	 * Check if merge fn is supported.
@@ -542,33 +503,9 @@  void dm_set_device_limits(struct dm_targ
 	 */
 
 	if (q->merge_bvec_fn && !ti->type->merge)
-		rs->max_sectors =
-			min_not_zero(rs->max_sectors,
+		ti->limits.max_sectors =
+			min_not_zero(ti->limits.max_sectors,
 				     (unsigned int) (PAGE_SIZE >> 9));
-
-	rs->max_phys_segments =
-		min_not_zero(rs->max_phys_segments,
-			     queue_max_phys_segments(q));
-
-	rs->max_hw_segments =
-		min_not_zero(rs->max_hw_segments, queue_max_hw_segments(q));
-
-	rs->logical_block_size = max(rs->logical_block_size,
-				     queue_logical_block_size(q));
-
-	rs->max_segment_size =
-		min_not_zero(rs->max_segment_size, queue_max_segment_size(q));
-
-	rs->max_hw_sectors =
-		min_not_zero(rs->max_hw_sectors, queue_max_hw_sectors(q));
-
-	rs->seg_boundary_mask =
-		min_not_zero(rs->seg_boundary_mask,
-			     queue_segment_boundary(q));
-
-	rs->bounce_pfn = min_not_zero(rs->bounce_pfn, queue_bounce_pfn(q));
-
-	rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
 }
 EXPORT_SYMBOL_GPL(dm_set_device_limits);
 
@@ -706,24 +643,28 @@  int dm_split_args(int *argc, char ***arg
 	return 0;
 }
 
-static void check_for_valid_limits(struct io_restrictions *rs)
+static void check_for_valid_limits(struct queue_limits *limits)
 {
-	if (!rs->max_sectors)
-		rs->max_sectors = SAFE_MAX_SECTORS;
-	if (!rs->max_hw_sectors)
-		rs->max_hw_sectors = SAFE_MAX_SECTORS;
-	if (!rs->max_phys_segments)
-		rs->max_phys_segments = MAX_PHYS_SEGMENTS;
-	if (!rs->max_hw_segments)
-		rs->max_hw_segments = MAX_HW_SEGMENTS;
-	if (!rs->logical_block_size)
-		rs->logical_block_size = 1 << SECTOR_SHIFT;
-	if (!rs->max_segment_size)
-		rs->max_segment_size = MAX_SEGMENT_SIZE;
-	if (!rs->seg_boundary_mask)
-		rs->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
-	if (!rs->bounce_pfn)
-		rs->bounce_pfn = -1;
+	if (!limits->max_sectors)
+		limits->max_sectors = SAFE_MAX_SECTORS;
+	if (!limits->max_hw_sectors)
+		limits->max_hw_sectors = SAFE_MAX_SECTORS;
+	if (!limits->max_phys_segments)
+		limits->max_phys_segments = MAX_PHYS_SEGMENTS;
+	if (!limits->max_hw_segments)
+		limits->max_hw_segments = MAX_HW_SEGMENTS;
+	if (!limits->logical_block_size)
+		limits->logical_block_size = 1 << SECTOR_SHIFT;
+	if (!limits->physical_block_size)
+		limits->physical_block_size = 1 << SECTOR_SHIFT;
+	if (!limits->io_min)
+		limits->io_min = 1 << SECTOR_SHIFT;
+	if (!limits->max_segment_size)
+		limits->max_segment_size = MAX_SEGMENT_SIZE;
+	if (!limits->seg_boundary_mask)
+		limits->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
+	if (!limits->bounce_pfn)
+		limits->bounce_pfn = -1;
 }
 
 /*
@@ -843,9 +784,12 @@  int dm_table_add_target(struct dm_table 
 
 	t->highs[t->num_targets++] = tgt->begin + tgt->len - 1;
 
-	/* FIXME: the plan is to combine high here and then have
-	 * the merge fn apply the target level restrictions. */
-	combine_restrictions_low(&t->limits, &tgt->limits);
+	if (blk_stack_limits(&t->limits, &tgt->limits, 0) < 0)
+		DMWARN("%s: target device (start sect %llu len %llu) "
+		       "is misaligned",
+		       dm_device_name(t->md),
+		       (unsigned long long) tgt->begin,
+		       (unsigned long long) tgt->len);
 	return 0;
 
  bad:
@@ -1010,17 +954,10 @@  no_integrity:
 void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q)
 {
 	/*
-	 * Make sure we obey the optimistic sub devices
-	 * restrictions.
+	 * blk_stack_limits() returns if the table is misaligned; but
+	 * dm_table_add_target() already checked the table's alignment
 	 */
-	blk_queue_max_sectors(q, t->limits.max_sectors);
-	blk_queue_max_phys_segments(q, t->limits.max_phys_segments);
-	blk_queue_max_hw_segments(q, t->limits.max_hw_segments);
-	blk_queue_logical_block_size(q, t->limits.logical_block_size);
-	blk_queue_max_segment_size(q, t->limits.max_segment_size);
-	blk_queue_max_hw_sectors(q, t->limits.max_hw_sectors);
-	blk_queue_segment_boundary(q, t->limits.seg_boundary_mask);
-	blk_queue_bounce_limit(q, t->limits.bounce_pfn);
+	(void)blk_stack_limits(&q->limits, &t->limits, 0);
 
 	if (t->limits.no_cluster)
 		queue_flag_clear_unlocked(QUEUE_FLAG_CLUSTER, q);
Index: linux-2.6/include/linux/device-mapper.h
===================================================================
--- linux-2.6.orig/include/linux/device-mapper.h
+++ linux-2.6/include/linux/device-mapper.h
@@ -143,18 +143,6 @@  struct target_type {
 	struct list_head list;
 };
 
-struct io_restrictions {
-	unsigned long bounce_pfn;
-	unsigned long seg_boundary_mask;
-	unsigned max_hw_sectors;
-	unsigned max_sectors;
-	unsigned max_segment_size;
-	unsigned short logical_block_size;
-	unsigned short max_hw_segments;
-	unsigned short max_phys_segments;
-	unsigned char no_cluster; /* inverted so that 0 is default */
-};
-
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;
@@ -163,7 +151,7 @@  struct dm_target {
 	sector_t begin;
 	sector_t len;
 
-	/* FIXME: turn this into a mask, and merge with io_restrictions */
+	/* FIXME: turn this into a mask, and merge with queue_limits */
 	/* Always a power of 2 */
 	sector_t split_io;
 
@@ -171,7 +159,7 @@  struct dm_target {
 	 * These are automatically filled in by
 	 * dm_table_get_device.
 	 */
-	struct io_restrictions limits;
+	struct queue_limits limits;
 
 	/* target specific data */
 	void *private;