[1/3] xfs: Add rtdefault mount option

Message ID	25856B28-A65C-4C5B-890D-159F8822393D@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Richard Wareing <rwareing@fb.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: [PATCH 1/3] xfs: Add rtdefault mount option Message-ID: <25856B28-A65C-4C5B-890D-159F8822393D@fb.com> Date: Thu, 31 Aug 2017 18:00:21 -0700 To: <linux-xfs@vger.kernel.org> Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BN6PR15MB1779; 23:PBJulYrPJafILVSfywGhW721+/CR0EiUWMxpEEwEw?= =?us-ascii?Q?JXdSZaZ2OulqEkNz1hEtMp/qJHnU20CCfeh08utfExdfsk+5lB5mcEyr690P?= =?us-ascii?Q?Eq71XGloGhZUaTfMwR/DSX8obmUKZqpZ0omzJAfugSuePW+6oRcknzYqPZgB?= =?us-ascii?Q?9ULBPbHseXKryQ+xwtT2o8RqAt5/D7CbVytWVvK9FKT88pXlcNE+YXUBp0KR?= =?us-ascii?Q?waCfcCwc/L5QpMNaHNF15gl4slO6KRxlZV/0jkaZ3w574ULWyl+zHOOAt42X?= =?us-ascii?Q?tfMvroh+BNI0sPKm/paEIm8fAV6NK5tER42lP73gLoYVnayr++oT6CbMH3Dy?= =?us-ascii?Q?Cm4Oe+PgN4MsUCKdSjvDWCFuAgx5qZ4hF3qJmxOLh0/2fw9rUOse1RLz7B19?= =?us-ascii?Q?Lo9aBNMLeK69eNoY20maTuzNXtvDVgsW/Vd19NxDBQ2RGwMQB3dQiFIRN5iM?= =?us-ascii?Q?lNhLb+3HVBepFQwSEruLqmR2HJ9nlQ5wJcR5ybU16ma7KdExkz+ROiPxt0xz?= =?us-ascii?Q?pj9SuP3fSqgoXQmQ/fXnVqk9tv2jsQ3V9lG+WaTgt3cIpcpu6Z5L9TxF99Ky?= =?us-ascii?Q?i1nDJ8+RzbccAd4DywPxZcmn3jbWWDGR30IXgDX5mtTylfcBHBOzfBYzXb4O?= =?us-ascii?Q?CSLLR3FbTlcJpTTIoqZVZFHUuRJuits8PoIaeiKtrhb1p1k++VEcPi83rJ2V?= =?us-ascii?Q?dNRG4+nYDzQqt2gw8e3cas0MiNlxaU+eb7FSdFEiqpjBgd9oVjVhRaRMvsWh?= =?us-ascii?Q?1VaiXW5sHTqDYUFGVe502+cfgAscsp0JHVxFuJPg34iZmVLq9Q2UXRox78Ep?= =?us-ascii?Q?zz6B7RDEm32ONBumUjZFV0jXsJi+c5tI48sv8UzoQ1BOB2tbdClV+w5UYoMM?= =?us-ascii?Q?ABEhBI74BvXL2ovU+BqWJON5BrVQmNBxs6SH0f2VQmh8QrynMunuS3Nu6FOu?= =?us-ascii?Q?/3et6nysGd7wvYMdbqGMJa3cHoDwQxRqMa5z7ulYcbnDP4lCxjfx/bUSoxFx?= =?us-ascii?Q?NSgF6fOP4JsZUfhQxrJezeGdFciF4cYBLzEimckjRWXbZtj56qFvaeIskIJ9?= =?us-ascii?Q?Z3dpwRojU8ji4Mi1f6lV0lix4VO?= X-Microsoft-Exchange-Diagnostics: 1; BN6PR15MB1779; 6:rUCqNykXiylnYjJTUsJkF8SrlheBa/nxu+n94bkPKZnUs3uYNxD0x5Lb6Zbuzr/uzXyj67uL5XfYZHngoRZalbM5nhZpK8A5l1bzYkLkEeyYqx56wm+bJXgiIhqzc6hKymjr2HcNeycuTOJg+EQGV/Y8aX0aelbd2RrZ9fiwOTJdN/sZN7mo/T6ZEg3KbNrNbNOkH5CZc49ahh/vPt+D/MwbwSJIcZr6adNhGy/GnIfueii4um8ffbqpwfBRGAbRzEydFyz2O75r/7f57lPkXpNFpNn5CqpZZtXI0+SYuoa+WMCWHE/hKK1hGSf4FBzFLwjohkLgufGnZIxV+snq8A==; 5:lhhvH5U9xjY+JGvHFAiE5chfZsuApZoDC6L3KpPcvJp1cYWserfEHzeV0zNz8uecm8rIfEv3eaRqEL+/5GHz5XZkdPJP8FFX9ykOsImknPNT79Svt4H4PWlYA2orSCnx+fHL/P8ep7jvcX/0pCamuA==; 24:oVbhtWT3MmZmxsCQTmgSB5WYxn9eYbSCvQ7wSV0jQ4nC1nd70glZbZGWHCdTcV0N7r6gyUqCwWLNe9/83dSOsUB5H2EqSC2kIvTiNIB/igM=; 7:SFbpxn5fIkH+OIK0KRDy31ODw/i3TrNJD8cmw0ykZgL/X3vykBOcVdVORFoJHpIO/5sqfqXFXDA4P0eqEiroNwwQnor5hmyFa35yisdL8NdMIwjlxLJWzTHoxZ+iqMK0bgfun4+l7s4PzxeeUlXBeMegbiRDOKa4IK5EM3TPNUjMskGndmcVi5FkR3w+u3jaNMS2KW6G3J0/MMX2LykFsyA/oEOLSq1ro1nwpSPRgM0= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BN6PR15MB1779; 20:+O27coT2DYwomF6XhK1YsWenE1CmGIfJlTYFlTpITHC8c93/8USXKgHuEdqMFZwKhqkNiYBmgrHmKm0H5GeEXAmuPkUma/tnNgivfshCjkJOaATmofhNaO1W+m1kJhxhnFEzku5kSskLa/8ZBnjSnqOFRi8gR/MpboJ6TG28F+4= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Sep 2017 01:00:26.8123 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Richard Wareing Sept. 1, 2017, 1 a.m. UTC

Hello all,

It turns out, XFS real-time volumes are actually a very useful/cool feature, I am wondering if there is support in the community to make this feature a bit more user friendly, easier to operate and interact with. To kick things off I bring patches table :).

For those who aren't familiar with real-time XFS volumes, they are basically a method of storing the data blocks of some files on a separate device. In our specific application, are using real-time devices to store large files (>256KB) on HDDS, while all metadata & journal updates goto an SSD of suitable endurance & capacity. We also see use-cases for this for distributed storage systems such as GlusterFS which are heavy in metadata operations (80%+ of IOPs). By using real-time devices to tier your XFS filesystem storage, you can dramatically reduce HDD IOPs (50% in our case) and dramatically improve metadata and small file latency (HDD->SSD like reductions).

Here are the features in the proposed patch set:

1. rtdefault - Defaulting block allocations to the real-time device via a mount flag rtdefault, vs using an inheritance flag or ioctl's. This options gives users tier'ing of their metadata out of the box with ease, and in a manner more users are familiar with (mount flags), vs having to set inheritance bits or use ioctls (many distributed storage developers are resistant to including FS specific code into their stacks).

2. rtstatfs - Returning real-time block device free space instead of the non-realtime device via the "rtstatfs" flag. This creates an experience/semantics which is a bit more familiar to users if they use real-time in a tiering configuration. "df" reports the space on your HDDs, and the metadata space can be returned by a tool like xfs_info (I have patches for this too if there is interest) or xfs_io. I think this might be a bit more intuitive for the masses than the reverse (having to goto xfs_io for the HDD space, and df for the SSD metadata).

3. rtfallocmin - This option can be combined with either rtdefault or standalone. When combined with rtdefault, it uses fallocate as "signal" to *exempt* storage on the real-time device, automatically promoting small fallocations to the SSD, while directing larger ones (or fallocation-less creations) to the HDD. This option also works really well with tools like "rsync" which support fallocate (--preallocate flag) so users can easily promote/demote files to/from the SSD.

Ideally, I'd like to help build-out more tiering features into XFS if there is interest in the community, but figured I'd start with these patches first. Other ideas/improvements: automatic eviction from SSD once file grows beyond rtfallocmin, automatic fall-back to real-time device if non-RT device (SSD) is out of blocks, add support for the more sophisticated AG based block allocator to RT (bitmapped version works well for us, but multi-threaded use-cases might not do as well).

Looking forward to getting feedback!

Richard Wareing

Note: The patches should patch clean against the XFS Kernel master branch @ https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git (SHA: 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c).

Darrick J. Wong Sept. 1, 2017, 4:26 a.m. UTC | #1

On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> Hello all, 
> 
> It turns out, XFS real-time volumes are actually a very useful/cool
> feature, I am wondering if there is support in the community to make
> this feature a bit more user friendly, easier to operate and interact
> with. To kick things off I bring patches table :).
> 
> For those who aren't familiar with real-time XFS volumes, they are
> basically a method of storing the data blocks of some files on a
> separate device. In our specific application, are using real-time
> devices to store large files (>256KB) on HDDS, while all metadata &
> journal updates goto an SSD of suitable endurance & capacity. We also
> see use-cases for this for distributed storage systems such as
> GlusterFS which are heavy in metadata operations (80%+ of IOPs). By
> using real-time devices to tier your XFS filesystem storage, you can
> dramatically reduce HDD IOPs (50% in our case) and dramatically
> improve metadata and small file latency (HDD->SSD like reductions).
> 
> Here are the features in the proposed patch set:
> 
> 1. rtdefault  - Defaulting block allocations to the real-time device
> via a mount flag rtdefault, vs using an inheritance flag or ioctl's.
> This options gives users tier'ing of their metadata out of the box
> with ease, and in a manner more users are familiar with (mount flags),
> vs having to set inheritance bits or use ioctls (many distributed
> storage developers are resistant to including FS specific code into
> their stacks).

The ioctl to set RTINHERIT/REALTIME is a VFS level ioctl now.  I can
think of a couple problems with the mount option -- first, more mount
options to test (or not ;)); what happens if you actually want your file
to end up on the data device; and won't this surprise all the existing
programs that are accustomed to the traditional way of handling rt
devices?

I mean, you /could/ just mkfs.xfs -d rtinherit=1 and that would get you
mostly the same results, right?

(Yeah, I know, undocumented mkfs option... <grumble>)

> 2. rtstatfs  - Returning real-time block device free space instead of
> the non-realtime device via the "rtstatfs" flag. This creates an
> experience/semantics which is a bit more familiar to users if they use
> real-time in a tiering configuration. "df" reports the space on your
> HDDs, and the metadata space can be returned by a tool like xfs_info
> (I have patches for this too if there is interest) or xfs_io. I think
> this might be a bit more intuitive for the masses than the reverse
> (having to goto xfs_io for the HDD space, and df for the SSD
> metadata).

I was a little surprised we don't just add up the data+rt space counters
for statfs; how /does/ one figure out how much space is free on the rt
device?

(Will research this tomorrow if nobody pipes up in the mean time.)

> 3. rtfallocmin - This option can be combined with either rtdefault or
> standalone. When combined with rtdefault, it uses fallocate as
> "signal" to *exempt* storage on the real-time device, automatically
> promoting small fallocations to the SSD, while directing larger ones
> (or fallocation-less creations) to the HDD. This option also works
> really well with tools like "rsync" which support fallocate
> (--preallocate flag) so users can easily promote/demote files to/from
> the SSD.

I see where you're coming from, but I don't think it's a good idea to
overload the existing fallocate interface to have it decide device
placement too.  The side effects of the existing mode flags are well
known and it's hard to get everyone on board with a semantic change to
an existing mode.

> Ideally, I'd like to help build-out more tiering features into XFS if
> there is interest in the community, but figured I'd start with these
> patches first.  Other ideas/improvements: automatic eviction from SSD
> once file grows beyond rtfallocmin, automatic fall-back to real-time
> device if non-RT device (SSD) is out of blocks, add support for the
> more sophisticated AG based block allocator to RT (bitmapped version
> works well for us, but multi-threaded use-cases might not do as well).
> 
> Looking forward to getting feedback!
> 
> Richard Wareing
> 
> Note: The patches should patch clean against the XFS Kernel master
> branch @ https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git (SHA:
> 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c).

Needs a Signed-off-by...

--D

> 
> =======
> 
> - Adds rtdefault mount option to default writes to real-time device.
> This removes the need for ioctl calls or inheritance bits to get files
> to flow to real-time device.
> - Enables XFS to store FS metadata on non-RT device (e.g. SSD) while
> storing data blocks on real-time device.  Negates any code changes by
> application, install kernel, format, mount and profit.
> ---
> fs/xfs/xfs_inode.c |  8 ++++++++
> fs/xfs/xfs_mount.h |  5 +++++
> fs/xfs/xfs_super.c | 13 ++++++++++++-
> 3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ec9826c..1611195 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -873,6 +873,14 @@ xfs_ialloc(
> 		break;
> 	case S_IFREG:
> 	case S_IFDIR:
> +		/* Set flags if we are defaulting to real-time device */
> +		if (mp->m_rtdev_targp != NULL &&
> +		   mp->m_flags & XFS_MOUNT_RTDEFAULT) {
> +			if (S_ISDIR(mode))
> +				ip->i_d.di_flags |= XFS_DIFLAG_RTINHERIT;
> +			else if (S_ISREG(mode))
> +				ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
> +		}
> 		if (pip && (pip->i_d.di_flags & XFS_DIFLAG_ANY)) {
> 			uint64_t	di_flags2 = 0;
> 			uint		di_flags = 0;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 9fa312a..da25398 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -243,6 +243,11 @@ typedef struct xfs_mount {
> 						   allocator */
> #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
> 
> +/* FB Real-time device options */
> +#define XFS_MOUNT_RTDEFAULT	(1ULL << 61)	/* Always allocate blocks from
> +						 * RT device
> +						 */
> +
> #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
> 
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 455a575..e4f85a9 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -83,7 +83,7 @@ enum {
> 	Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota,
> 	Opt_uquota, Opt_gquota, Opt_pquota,
> 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
> -	Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
> +	Opt_discard, Opt_nodiscard, Opt_dax, Opt_rtdefault, Opt_err,
> };
> 
> static const match_table_t tokens = {
> @@ -133,6 +133,9 @@ static const match_table_t tokens = {
> 
> 	{Opt_dax,	"dax"},		/* Enable direct access to bdev pages */
> 
> +#ifdef CONFIG_XFS_RT
> +	{Opt_rtdefault,	"rtdefault"},	/* Default to real-time device */
> +#endif
> 	/* Deprecated mount options scheduled for removal */
> 	{Opt_barrier,	"barrier"},	/* use writer barriers for log write and
> 					 * unwritten extent conversion */
> @@ -367,6 +370,11 @@ xfs_parseargs(
> 		case Opt_nodiscard:
> 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> 			break;
> +#ifdef CONFIG_XFS_RT
> +		case Opt_rtdefault:
> +			mp->m_flags |= XFS_MOUNT_RTDEFAULT;
> +			break;
> +#endif
> #ifdef CONFIG_FS_DAX
> 		case Opt_dax:
> 			mp->m_flags |= XFS_MOUNT_DAX;
> @@ -492,6 +500,9 @@ xfs_showargs(
> 		{ XFS_MOUNT_DISCARD,		",discard" },
> 		{ XFS_MOUNT_SMALL_INUMS,	",inode32" },
> 		{ XFS_MOUNT_DAX,		",dax" },
> +#ifdef CONFIG_XFS_RT
> +		{ XFS_MOUNT_RTDEFAULT,          ",rtdefault" },
> +#endif
> 		{ 0, NULL }
> 	};
> 	static struct proc_xfs_info xfs_info_unset[] = {
> -- 
> 2.9.3--
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 1, 2017, 4:31 a.m. UTC | #2

Hi Richard,

On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> Hello all, 
> 
> It turns out, XFS real-time volumes are actually a very
> useful/cool feature, I am wondering if there is support in the
> community to make this feature a bit more user friendly, easier to
> operate and interact with. To kick things off I bring patches
> table :).
> 
> For those who aren't familiar with real-time XFS volumes, they are
> basically a method of storing the data blocks of some files on a
> separate device. In our specific application, are using real-time
> devices to store large files (>256KB) on HDDS, while all metadata
> & journal updates goto an SSD of suitable endurance & capacity.

Well that's interesting. How widely deployed is this? We don't do a
whole lot of upstream testing on the rt device, so I'm curious to
know about what problems you've had to fix to get it to wrok
reliably....

> We
> also see use-cases for this for distributed storage systems such
> as GlusterFS which are heavy in metadata operations (80%+ of
> IOPs). By using real-time devices to tier your XFS filesystem
> storage, you can dramatically reduce HDD IOPs (50% in our case)
> and dramatically improve metadata and small file latency (HDD->SSD
> like reductions).

IMO, this isn't really what I'd call "tiered" storage - it's just a
data/metadata separation. Same as putting your log on an external
device to separate the journal IO from user IO isn't tiering.... :)

FWIW, using an external log for fsync heavy workloads reduces data
device IOPS by roughly 50%, and that seems to match what you are
saying occurs in your workloads by moving data IO to a separate
device from the log.  So now I'm wondering - is the reduction in
IOPS on your HDDs reflecting the impact of separating journal
commits from data writes? If not, where is the 50% of the IOPS that
aren't data and aren't journal going in your workloads?

> Here are the features in the proposed patch set:
> 
> 1. rtdefault  - Defaulting block allocations to the real-time
> device via a mount flag rtdefault, vs using an inheritance flag or
> ioctl's. This options gives users tier'ing of their metadata out
> of the box with ease,

As you've stated, we already have per-inode flags for this, but from
what you've said I'm not sure you realise that we don't need a mount
option for "out of the box" data-on-rtdev support. i.e.  mkfs
already provides "out of the box" data-on-rtdev support:

# mkfs.xfs -r rtdev=/dev/rt -d rtinherit=1 /dev/data

> and in a manner more users are familiar with
> (mount flags), vs having to set inheritance bits or use ioctls
> (many distributed storage developers are resistant to including FS
> specific code into their stacks).

Even with a rtdefault mount option, admins would still have to use
'chattr -R -r' to turn off use of the rt device by default because
removing the mount option doesn't get rid of the on-disk inode flags
that control this behaviour.

Maybe I'm missing something, but I don't see what this mount option
makes simpler or easier for users....

> 2. rtstatfs  - Returning real-time block device free space instead
> of the non-realtime device via the "rtstatfs" flag. This creates
> an experience/semantics which is a bit more familiar to users if
> they use real-time in a tiering configuration. "df" reports the
> space on your HDDs, and the metadata space can be returned by a
> tool like xfs_info (I have patches for this too if there is
> interest) or xfs_io. I think this might be a bit more intuitive
> for the masses than the reverse (having to goto xfs_io for the HDD
> space, and df for the SSD metadata).

Yep, useful idea. We already have a mechanism for reporting
different information to statfs depending on what is passed to it.
We use that to report directory tree quota information instead of
filesystem wide information. See the project id inode flag hooks at
the end of xfs_fs_statfs().

Similar could be done here - if statfs is pointed at a RT
file/directory, report rt device usage. If it's pointed at a the
root directory, report data device information.

> 3. rtfallocmin - This option can be combined with either rtdefault
> or standalone. When combined with rtdefault, it uses fallocate as
> "signal" to *exempt* storage on the real-time device,
> automatically promoting small fallocations to the SSD, while
> directing larger ones (or fallocation-less creations) to the HDD.

Hmmmm. Abusing fallocate to control allocation policy is kinda
nasty. I'd much prefer we work towards a usable allocation policy
framework rather than encode one-off hacks like this into the
filesystem behaviour.

> This option also works really well with tools like "rsync" which
> support fallocate (--preallocate flag) so users can easily
> promote/demote files to/from the SSD.

That's a neat hack, but it's not a viable adminisitration policy
interface :/

> Ideally, I'd like to help build-out more tiering features into XFS
> if there is interest in the community, but figured I'd start with
> these patches first.  Other ideas/improvements: automatic eviction
> from SSD once file grows beyond rtfallocmin,

You could probably already do that right now with fanotify +
userspace-based atomic data mover (i.e. xfs_fsr).

Keep in mind any time you say "move data around when ..." I'll
probably reply "you can use xfs_fsr for that". "fsr" = "file system
reorganiser" and it's sole purpose in life is to transparently move
data around the filesystem....

> automatic fall-back
> to real-time device if non-RT device (SSD) is out of blocks,

If you run the data device out of blocks, you can't allocate blocks
for the new metadata that has to be allocated to track the data held
in the RT device.  i.e.  running the data device out of space is
a filesystem wide ENOSPC condition even if there's still space
in the rt device for the data.

> add
> support for the more sophisticated AG based block allocator to RT
> (bitmapped version works well for us, but multi-threaded use-cases
> might not do as well).

That's a great big can of worms - not sure we want to open it. The
simplicity of the rt allocator is one of it's major benefits to
workloads that require deterministic allocation behaviour...

Cheers,

Dave.

Richard Wareing Sept. 1, 2017, 6:39 p.m. UTC | #3

Thanks for the quick feedback Dave!  My comments are in-line below.

> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> Hi Richard,
> 
> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
>> Hello all, 
>> 
>> It turns out, XFS real-time volumes are actually a very
>> useful/cool feature, I am wondering if there is support in the
>> community to make this feature a bit more user friendly, easier to
>> operate and interact with. To kick things off I bring patches
>> table :).
>> 
>> For those who aren't familiar with real-time XFS volumes, they are
>> basically a method of storing the data blocks of some files on a
>> separate device. In our specific application, are using real-time
>> devices to store large files (>256KB) on HDDS, while all metadata
>> & journal updates goto an SSD of suitable endurance & capacity.
> 
> Well that's interesting. How widely deployed is this? We don't do a
> whole lot of upstream testing on the rt device, so I'm curious to
> know about what problems you've had to fix to get it to wrok
> reliably....

So far we have a modest deployment of 13 machines, we are going to step this up to 30 pretty soon.  As for problems, actually only 1 really.  I originally started my experiments on Kernel 4.0.9 and ran into a kernel panic (I can't recall the exact bug) but after consulting the mailing list there was a patch in a later kernel version (4.6) which resolved the problem.  Since then I've moved my work to 4.11.

Our use-case is pretty straight forward, we open -> fallocate -> write and never again write to the files.  From there is just reads and unlinks.  Multi-threading IO is kept to a minimum (typically a single thread but perhaps 3-4 under high load), which probably avoids tempting fate on hard to track down race bugs.

> 
>> We
>> also see use-cases for this for distributed storage systems such
>> as GlusterFS which are heavy in metadata operations (80%+ of
>> IOPs). By using real-time devices to tier your XFS filesystem
>> storage, you can dramatically reduce HDD IOPs (50% in our case)
>> and dramatically improve metadata and small file latency (HDD->SSD
>> like reductions).
> 
> IMO, this isn't really what I'd call "tiered" storage - it's just a
> data/metadata separation. Same as putting your log on an external
> device to separate the journal IO from user IO isn't tiering.... :)

The tiering comes into play with the rtfallocmin, because we always fallocate to "declare" our intention to write data XFS can automatically direct the data to the correct tier of storage (SSD for small files <= 256k & HDD for larger ones); in our distributed storage system this has the ultimatel effect of making the IO path files <= ~2MB SSD backed.  Storage systems such as GlusterFS can leverage this as well (and where I plan to focus my attention once I'm done with my present use-case).

> 
> FWIW, using an external log for fsync heavy workloads reduces data
> device IOPS by roughly 50%, and that seems to match what you are
> saying occurs in your workloads by moving data IO to a separate
> device from the log.  So now I'm wondering - is the reduction in
> IOPS on your HDDs reflecting the impact of separating journal
> commits from data writes? If not, where is the 50% of the IOPS that
> aren't data and aren't journal going in your workloads?
> 

>> Here are the features in the proposed patch set:
>> 
>> 1. rtdefault  - Defaulting block allocations to the real-time
>> device via a mount flag rtdefault, vs using an inheritance flag or
>> ioctl's. This options gives users tier'ing of their metadata out
>> of the box with ease,
> 
> As you've stated, we already have per-inode flags for this, but from
> what you've said I'm not sure you realise that we don't need a mount
> option for "out of the box" data-on-rtdev support. i.e.  mkfs
> already provides "out of the box" data-on-rtdev support:
> 
> # mkfs.xfs -r rtdev=/dev/rt -d rtinherit=1 /dev/data
> 
>> and in a manner more users are familiar with
>> (mount flags), vs having to set inheritance bits or use ioctls
>> (many distributed storage developers are resistant to including FS
>> specific code into their stacks).
> 
> Even with a rtdefault mount option, admins would still have to use
> 'chattr -R -r' to turn off use of the rt device by default because
> removing the mount option doesn't get rid of the on-disk inode flags
> that control this behaviour.
> 
> Maybe I'm missing something, but I don't see what this mount option
> makes simpler or easier for users....

You are correct, I wasn't aware of the "rtinherit" mkfs time option :).  However, it functions much the same as setting the inheritance bit on the directory manually, which is subtly different (and less intuitive as I hope to convince you).  Inheritance bits are problematic for a couple reasons, first it's not super obvious (by common mechanisms such as xfs_info or /proc/mounts) to the admin this is in place.  Imagine you are taking over an existing setup, it might take you many moons to discover the action by which files are defaulting to the real-time device.

Second, you bring up a really good point that the rtdefault flag would still require users to strip the inheritance bits from the directories, but this actually points out the second problem with the inheritance bits: you have them all over your FS, and stripping them would require a FS walk (as a user).  I think the change here I can make is to simply not set the directory inheritance bits on the directories since rtdefault takes over this function, this way when you remove the mount flag, the behavior is intuitive: files no longer default to the RT device.  This way users get the added benefit of not having to walk the entire FS to strip the inheritance bits from the directories, and a more intuitive behavior.

> 
>> 2. rtstatfs  - Returning real-time block device free space instead
>> of the non-realtime device via the "rtstatfs" flag. This creates
>> an experience/semantics which is a bit more familiar to users if
>> they use real-time in a tiering configuration. "df" reports the
>> space on your HDDs, and the metadata space can be returned by a
>> tool like xfs_info (I have patches for this too if there is
>> interest) or xfs_io. I think this might be a bit more intuitive
>> for the masses than the reverse (having to goto xfs_io for the HDD
>> space, and df for the SSD metadata).
> 
> Yep, useful idea. We already have a mechanism for reporting
> different information to statfs depending on what is passed to it.
> We use that to report directory tree quota information instead of
> filesystem wide information. See the project id inode flag hooks at
> the end of xfs_fs_statfs().
> 
> Similar could be done here - if statfs is pointed at a RT
> file/directory, report rt device usage. If it's pointed at a the
> root directory, report data device information.
> 

I'll re-work the patch to fix this.

>> 3. rtfallocmin - This option can be combined with either rtdefault
>> or standalone. When combined with rtdefault, it uses fallocate as
>> "signal" to *exempt* storage on the real-time device,
>> automatically promoting small fallocations to the SSD, while
>> directing larger ones (or fallocation-less creations) to the HDD.
> 
> Hmmmm. Abusing fallocate to control allocation policy is kinda
> nasty. I'd much prefer we work towards a usable allocation policy
> framework rather than encode one-off hacks like this into the
> filesystem behaviour.
> 

I'm completely open to suggestions here, though it's been amazingly useful to have fallocate as the signal here as there's a pile of user land tools which use fallocate prior to writing data (.  You can use these without any modification to do all sorts of operational tasks (e.g. file promotion to non-RT device, restoration of backups with/without sending small files to SSD, shell scripts which use "fallocate" utility etc).  This really puts the control in the hands in the administrator, who can then use their imagination to come up with all sorts of utilities & scripts which make their life easier.  Contrast this with having to patch xfs_fsr or another xfs tool which would be daunting for most admins.

>> This option also works really well with tools like "rsync" which
>> support fallocate (--preallocate flag) so users can easily
>> promote/demote files to/from the SSD.
> 
> That's a neat hack, but it's not a viable adminisitration policy
> interface :/
> 
>> Ideally, I'd like to help build-out more tiering features into XFS
>> if there is interest in the community, but figured I'd start with
>> these patches first.  Other ideas/improvements: automatic eviction
>> from SSD once file grows beyond rtfallocmin,
> 
> You could probably already do that right now with fanotify +
> userspace-based atomic data mover (i.e. xfs_fsr).
> 
> Keep in mind any time you say "move data around when ..." I'll
> probably reply "you can use xfs_fsr for that". "fsr" = "file system
> reorganiser" and it's sole purpose in life is to transparently move
> data around the filesystem....

Agreed, but not really tenable to add piles of use-cases to xfs_fsr vs. leveraging the existing utilities out there.  I really want to unlock the potential for admins to dream up or leverage existing utilities for their operational needs.

> 
>> automatic fall-back
>> to real-time device if non-RT device (SSD) is out of blocks,
> 
> If you run the data device out of blocks, you can't allocate blocks
> for the new metadata that has to be allocated to track the data held
> in the RT device.  i.e.  running the data device out of space is
> a filesystem wide ENOSPC condition even if there's still space
> in the rt device for the data.
> 

Wrt metadata, my plan here was to reserve (or does inode reservation handle this?) some percentage of non-RT blocks for metadata.  This way data would over-flow reliably.  I'm still tweaking this patch so nothing to show yet.

>> add
>> support for the more sophisticated AG based block allocator to RT
>> (bitmapped version works well for us, but multi-threaded use-cases
>> might not do as well).
> 
> That's a great big can of worms - not sure we want to open it. The
> simplicity of the rt allocator is one of it's major benefits to
> workloads that require deterministic allocation behaviour...

Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Wareing Sept. 1, 2017, 6:53 p.m. UTC | #4

Thanks for the feedback Darrick, comments in-line.


> On Aug 31, 2017, at 9:26 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
>> Hello all, 
>> 
>> It turns out, XFS real-time volumes are actually a very useful/cool
>> feature, I am wondering if there is support in the community to make
>> this feature a bit more user friendly, easier to operate and interact
>> with. To kick things off I bring patches table :).
>> 
>> For those who aren't familiar with real-time XFS volumes, they are
>> basically a method of storing the data blocks of some files on a
>> separate device. In our specific application, are using real-time
>> devices to store large files (>256KB) on HDDS, while all metadata &
>> journal updates goto an SSD of suitable endurance & capacity. We also
>> see use-cases for this for distributed storage systems such as
>> GlusterFS which are heavy in metadata operations (80%+ of IOPs). By
>> using real-time devices to tier your XFS filesystem storage, you can
>> dramatically reduce HDD IOPs (50% in our case) and dramatically
>> improve metadata and small file latency (HDD->SSD like reductions).
>> 
>> Here are the features in the proposed patch set:
>> 
>> 1. rtdefault  - Defaulting block allocations to the real-time device
>> via a mount flag rtdefault, vs using an inheritance flag or ioctl's.
>> This options gives users tier'ing of their metadata out of the box
>> with ease, and in a manner more users are familiar with (mount flags),
>> vs having to set inheritance bits or use ioctls (many distributed
>> storage developers are resistant to including FS specific code into
>> their stacks).
> 
> The ioctl to set RTINHERIT/REALTIME is a VFS level ioctl now.  I can
> think of a couple problems with the mount option -- first, more mount
> options to test (or not ;)); what happens if you actually want your file
> to end up on the data device; and won't this surprise all the existing
> programs that are accustomed to the traditional way of handling rt
> devices?
> 
> I mean, you /could/ just mkfs.xfs -d rtinherit=1 and that would get you
> mostly the same results, right?
> 
> (Yeah, I know, undocumented mkfs option... <grumble>)

Checkout my reply to Dave on this :).

> 
>> 2. rtstatfs  - Returning real-time block device free space instead of
>> the non-realtime device via the "rtstatfs" flag. This creates an
>> experience/semantics which is a bit more familiar to users if they use
>> real-time in a tiering configuration. "df" reports the space on your
>> HDDs, and the metadata space can be returned by a tool like xfs_info
>> (I have patches for this too if there is interest) or xfs_io. I think
>> this might be a bit more intuitive for the masses than the reverse
>> (having to goto xfs_io for the HDD space, and df for the SSD
>> metadata).
> 
> I was a little surprised we don't just add up the data+rt space counters
> for statfs; how /does/ one figure out how much space is free on the rt
> device?
> 
> (Will research this tomorrow if nobody pipes up in the mean time.)
> 

I was as well!

>> 3. rtfallocmin - This option can be combined with either rtdefault or
>> standalone. When combined with rtdefault, it uses fallocate as
>> "signal" to *exempt* storage on the real-time device, automatically
>> promoting small fallocations to the SSD, while directing larger ones
>> (or fallocation-less creations) to the HDD. This option also works
>> really well with tools like "rsync" which support fallocate
>> (--preallocate flag) so users can easily promote/demote files to/from
>> the SSD.
> 
> I see where you're coming from, but I don't think it's a good idea to
> overload the existing fallocate interface to have it decide device
> placement too.  The side effects of the existing mode flags are well
> known and it's hard to get everyone on board with a semantic change to
> an existing mode.

I'm not proposing we remove the flags, just add this option to allow admins a slightly easier way to interact and leverage existing utilities.


> 
>> Ideally, I'd like to help build-out more tiering features into XFS if
>> there is interest in the community, but figured I'd start with these
>> patches first.  Other ideas/improvements: automatic eviction from SSD
>> once file grows beyond rtfallocmin, automatic fall-back to real-time
>> device if non-RT device (SSD) is out of blocks, add support for the
>> more sophisticated AG based block allocator to RT (bitmapped version
>> works well for us, but multi-threaded use-cases might not do as well).
>> 
>> Looking forward to getting feedback!
>> 
>> Richard Wareing
>> 
>> Note: The patches should patch clean against the XFS Kernel master
>> branch @ https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git (SHA:
>> 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c).
> 
> Needs a Signed-off-by...
> 
> --D
> 
>> 
>> =======
>> 
>> - Adds rtdefault mount option to default writes to real-time device.
>> This removes the need for ioctl calls or inheritance bits to get files
>> to flow to real-time device.
>> - Enables XFS to store FS metadata on non-RT device (e.g. SSD) while
>> storing data blocks on real-time device.  Negates any code changes by
>> application, install kernel, format, mount and profit.
>> ---
>> fs/xfs/xfs_inode.c |  8 ++++++++
>> fs/xfs/xfs_mount.h |  5 +++++
>> fs/xfs/xfs_super.c | 13 ++++++++++++-
>> 3 files changed, 25 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index ec9826c..1611195 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -873,6 +873,14 @@ xfs_ialloc(
>> 		break;
>> 	case S_IFREG:
>> 	case S_IFDIR:
>> +		/* Set flags if we are defaulting to real-time device */
>> +		if (mp->m_rtdev_targp != NULL &&
>> +		   mp->m_flags & XFS_MOUNT_RTDEFAULT) {
>> +			if (S_ISDIR(mode))
>> +				ip->i_d.di_flags |= XFS_DIFLAG_RTINHERIT;
>> +			else if (S_ISREG(mode))
>> +				ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
>> +		}
>> 		if (pip && (pip->i_d.di_flags & XFS_DIFLAG_ANY)) {
>> 			uint64_t	di_flags2 = 0;
>> 			uint		di_flags = 0;
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index 9fa312a..da25398 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -243,6 +243,11 @@ typedef struct xfs_mount {
>> 						   allocator */
>> #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
>> 
>> +/* FB Real-time device options */
>> +#define XFS_MOUNT_RTDEFAULT	(1ULL << 61)	/* Always allocate blocks from
>> +						 * RT device
>> +						 */
>> +
>> #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
>> 
>> 
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 455a575..e4f85a9 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -83,7 +83,7 @@ enum {
>> 	Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota,
>> 	Opt_uquota, Opt_gquota, Opt_pquota,
>> 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
>> -	Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
>> +	Opt_discard, Opt_nodiscard, Opt_dax, Opt_rtdefault, Opt_err,
>> };
>> 
>> static const match_table_t tokens = {
>> @@ -133,6 +133,9 @@ static const match_table_t tokens = {
>> 
>> 	{Opt_dax,	"dax"},		/* Enable direct access to bdev pages */
>> 
>> +#ifdef CONFIG_XFS_RT
>> +	{Opt_rtdefault,	"rtdefault"},	/* Default to real-time device */
>> +#endif
>> 	/* Deprecated mount options scheduled for removal */
>> 	{Opt_barrier,	"barrier"},	/* use writer barriers for log write and
>> 					 * unwritten extent conversion */
>> @@ -367,6 +370,11 @@ xfs_parseargs(
>> 		case Opt_nodiscard:
>> 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
>> 			break;
>> +#ifdef CONFIG_XFS_RT
>> +		case Opt_rtdefault:
>> +			mp->m_flags |= XFS_MOUNT_RTDEFAULT;
>> +			break;
>> +#endif
>> #ifdef CONFIG_FS_DAX
>> 		case Opt_dax:
>> 			mp->m_flags |= XFS_MOUNT_DAX;
>> @@ -492,6 +500,9 @@ xfs_showargs(
>> 		{ XFS_MOUNT_DISCARD,		",discard" },
>> 		{ XFS_MOUNT_SMALL_INUMS,	",inode32" },
>> 		{ XFS_MOUNT_DAX,		",dax" },
>> +#ifdef CONFIG_XFS_RT
>> +		{ XFS_MOUNT_RTDEFAULT,          ",rtdefault" },
>> +#endif
>> 		{ 0, NULL }
>> 	};
>> 	static struct proc_xfs_info xfs_info_unset[] = {
>> -- 
>> 2.9.3--
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster Sept. 1, 2017, 7:32 p.m. UTC | #5

On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
> Thanks for the quick feedback Dave!  My comments are in-line below.
> 
> 
> > On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > Hi Richard,
> > 
> > On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
...
> >> add
> >> support for the more sophisticated AG based block allocator to RT
> >> (bitmapped version works well for us, but multi-threaded use-cases
> >> might not do as well).
> > 
> > That's a great big can of worms - not sure we want to open it. The
> > simplicity of the rt allocator is one of it's major benefits to
> > workloads that require deterministic allocation behaviour...
> 
> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
> 

Just a side point based on the discussion so far... I kind of get the
impression that the primary reason for using realtime support here is
for the simple fact that it's a separate physical device. That provides
a basic mechanism to split files across fast and slow physical storage
based on some up-front heuristic. The fact that the realtime feature
uses a separate allocation algorithm is actually irrelevant (and
possibly a problem in the future).

Is that an accurate assessment? If so, it makes me wonder whether it's
worth thinking about if there are ways to get the same behavior using
traditional functionality. This ignores Dave's question about how much
of the performance actually comes from simply separating out the log,
but for example suppose we had a JBOD block device made up of a
combination of spinning and solid state disks via device-mapper with the
requirement that a boundary from fast -> slow and vice versa was always
at something like a 100GB alignment. Then if you formatted that device
with XFS using 100GB AGs (or whatever to make them line up), and could
somehow tag each AG as "fast" or "slow" based on the known underlying
device mapping, could you potentially get the same results by using the
same heuristics to direct files to particular sets of AGs rather than
between two physical devices? Obviously there are some differences like
metadata being spread across the fast/slow devices (though I think we
had such a thing as metadata only AGs), etc. I'm just handwaving here to
try and better understand the goal.

Brian

> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Wareing Sept. 1, 2017, 8:36 p.m. UTC | #6

> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
> 
> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
>> Thanks for the quick feedback Dave!  My comments are in-line below.
>> 
>> 
>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> Hi Richard,
>>> 
>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> ...
>>>> add
>>>> support for the more sophisticated AG based block allocator to RT
>>>> (bitmapped version works well for us, but multi-threaded use-cases
>>>> might not do as well).
>>> 
>>> That's a great big can of worms - not sure we want to open it. The
>>> simplicity of the rt allocator is one of it's major benefits to
>>> workloads that require deterministic allocation behaviour...
>> 
>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
>> 
> 
> Just a side point based on the discussion so far... I kind of get the
> impression that the primary reason for using realtime support here is
> for the simple fact that it's a separate physical device. That provides
> a basic mechanism to split files across fast and slow physical storage
> based on some up-front heuristic. The fact that the realtime feature
> uses a separate allocation algorithm is actually irrelevant (and
> possibly a problem in the future).
> 
> Is that an accurate assessment? If so, it makes me wonder whether it's
> worth thinking about if there are ways to get the same behavior using
> traditional functionality. This ignores Dave's question about how much
> of the performance actually comes from simply separating out the log,
> but for example suppose we had a JBOD block device made up of a
> combination of spinning and solid state disks via device-mapper with the
> requirement that a boundary from fast -> slow and vice versa was always
> at something like a 100GB alignment. Then if you formatted that device
> with XFS using 100GB AGs (or whatever to make them line up), and could
> somehow tag each AG as "fast" or "slow" based on the known underlying
> device mapping, could you potentially get the same results by using the
> same heuristics to direct files to particular sets of AGs rather than
> between two physical devices? Obviously there are some differences like
> metadata being spread across the fast/slow devices (though I think we
> had such a thing as metadata only AGs), etc. I'm just handwaving here to
> try and better understand the goal.
> 


Sorry I forgot to clarify the origins of the performance wins here.   This is obviously very workload dependent (e.g. write/flush/inode updatey workloads benefit the most) but for our use case about ~65% of the IOP savings (~1/3 journal + slightly less than 1/3 sync of metadata from journal, slightly less as some journal entries get canceled), the remainder 1/3 of the win comes from reading small files from the SSD vs. HDDs (about 25-30% of our file population is <=256k; depending on the cluster).  To be clear, we don't split files, we store all data blocks of the files either entirely on the SSD (e.g. small files <=256k) and the rest on the real-time HDD device.  The basic principal here being that, larger files MIGHT have small IOPs to them (in our use-case this happens to be rare, but not impossible), but small files always do, and when 25-30% of your population is small...that's a big chunk of your IOPs.

The AG based could work, though it's going to be a very hard sell to use dm mapper, this isn't code we have ever used in our storage stack.  At our scale, there are important operational reasons we need to keep the storage stack simple (less bugs to hit), so keeping the solution contained within XFS is a necessary requirement for us.

Richard


> Brian
> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 1, 2017, 10:55 p.m. UTC | #7

[satuday morning here, so just a quick comment]

On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:
> > On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
> > 
> > On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
> >> Thanks for the quick feedback Dave!  My comments are in-line below.
> >> 
> >> 
> >>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> 
> >>> Hi Richard,
> >>> 
> >>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> > ...
> >>>> add
> >>>> support for the more sophisticated AG based block allocator to RT
> >>>> (bitmapped version works well for us, but multi-threaded use-cases
> >>>> might not do as well).
> >>> 
> >>> That's a great big can of worms - not sure we want to open it. The
> >>> simplicity of the rt allocator is one of it's major benefits to
> >>> workloads that require deterministic allocation behaviour...
> >> 
> >> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
> >> 
> > 
> > Just a side point based on the discussion so far... I kind of get the
> > impression that the primary reason for using realtime support here is
> > for the simple fact that it's a separate physical device. That provides
> > a basic mechanism to split files across fast and slow physical storage
> > based on some up-front heuristic. The fact that the realtime feature
> > uses a separate allocation algorithm is actually irrelevant (and
> > possibly a problem in the future).
> > 
> > Is that an accurate assessment? If so, it makes me wonder whether it's
> > worth thinking about if there are ways to get the same behavior using
> > traditional functionality. This ignores Dave's question about how much
> > of the performance actually comes from simply separating out the log,
> > but for example suppose we had a JBOD block device made up of a
> > combination of spinning and solid state disks via device-mapper with the
> > requirement that a boundary from fast -> slow and vice versa was always
> > at something like a 100GB alignment. Then if you formatted that device
> > with XFS using 100GB AGs (or whatever to make them line up), and could
> > somehow tag each AG as "fast" or "slow" based on the known underlying
> > device mapping,

Not a new idea. :)

I've got old xfs_spaceman patches sitting around somewhere for
ioctls to add such information to individual AGs. I think I called
them "concat groups" to allow multiple AGs to sit inside a single
concatenation, and they added a policy layer over the top of AGs
to control things like metadata placement....

> > could you potentially get the same results by using the
> > same heuristics to direct files to particular sets of AGs rather than
> > between two physical devices?

That's pretty much what I was working on back at SGI in 2007. i.e.
providing a method for configuring AGs with difference
characteristics and a userspace policy interface to configure and
make use of it....

http://oss.sgi.com/archives/xfs/2009-02/msg00250.html

> > Obviously there are some differences like
> > metadata being spread across the fast/slow devices (though I think we
> > had such a thing as metadata only AGs), etc.

We have "metadata preferred" AGs, and that is what the inode32
policy uses to place all the inodes and directory/atribute metadata
in the 32bit inode address space. It doesn't get used for data
unless the rest of the filesystem is ENOSPC.

> > I'm just handwaving here to
> > try and better understand the goal.

We've been down these paths many times - the problem has always been
that the people who want complex, configurable allocation policies
for their workload have never provided the resources needed to
implement past "here's a mount option hack that works for us".....

> Sorry I forgot to clarify the origins of the performance wins
> here.   This is obviously very workload dependent (e.g.
> write/flush/inode updatey workloads benefit the most) but for our
> use case about ~65% of the IOP savings (~1/3 journal + slightly
> less than 1/3 sync of metadata from journal, slightly less as some
> journal entries get canceled), the remainder 1/3 of the win comes
> from reading small files from the SSD vs. HDDs (about 25-30% of
> our file population is <=256k; depending on the cluster).  To be
> clear, we don't split files, we store all data blocks of the files
> either entirely on the SSD (e.g. small files <=256k) and the rest
> on the real-time HDD device.  The basic principal here being that,
> larger files MIGHT have small IOPs to them (in our use-case this
> happens to be rare, but not impossible), but small files always
> do, and when 25-30% of your population is small...that's a big
> chunk of your IOPs.

So here's a test for you. Make a device with a SSD as the first 1TB,
and you HDD as the rest (use dm to do this). Then use the inode32
allocator (mount option) to split metadata from data. The filesysetm
will keep inodes/directories on the SSD and file data on the HDD
automatically.

Better yet: have data allocations smaller than stripe units target
metadata prefferred AGs (i.e. the SSD region) and allocations larger
than stripe unit target the data-preferred AGs. Set the stripe unit
to match your SSD/HDD threshold....

[snip]

> The AG based could work, though it's going to be a very hard sell
> to use dm mapper, this isn't code we have ever used in our storage
> stack.  At our scale, there are important operational reasons we
> need to keep the storage stack simple (less bugs to hit), so
> keeping the solution contained within XFS is a necessary
> requirement for us.

Modifying the filesysetm on-disk format is far more complex than
adding dm to your stack. Filesystem modifications are difficult and
time consuming because if we screw up, users lose all their data.

If you can solve the problem with DM and a little bit of additional
in-memory kernel code to categorise and select which AG to use for
what (i.e. policy stuff that can be held in userspace), then that is
the pretty much the only answer that makes sense from a filesystem
developer's point of view....

Start by thinking about exposing AG behaviour controls through sysfs
objects and configuring them at mount time through udev event
notifications.

Cheers,

Dave.

Richard Wareing Sept. 1, 2017, 11:37 p.m. UTC | #8

> On Sep 1, 2017, at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> [satuday morning here, so just a quick comment]
> 
> On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:
>>> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
>>> 
>>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
>>>> Thanks for the quick feedback Dave!  My comments are in-line below.
>>>> 
>>>> 
>>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> 
>>>>> Hi Richard,
>>>>> 
>>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
>>> ...
>>>>>> add
>>>>>> support for the more sophisticated AG based block allocator to RT
>>>>>> (bitmapped version works well for us, but multi-threaded use-cases
>>>>>> might not do as well).
>>>>> 
>>>>> That's a great big can of worms - not sure we want to open it. The
>>>>> simplicity of the rt allocator is one of it's major benefits to
>>>>> workloads that require deterministic allocation behaviour...
>>>> 
>>>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
>>>> 
>>> 
>>> Just a side point based on the discussion so far... I kind of get the
>>> impression that the primary reason for using realtime support here is
>>> for the simple fact that it's a separate physical device. That provides
>>> a basic mechanism to split files across fast and slow physical storage
>>> based on some up-front heuristic. The fact that the realtime feature
>>> uses a separate allocation algorithm is actually irrelevant (and
>>> possibly a problem in the future).
>>> 
>>> Is that an accurate assessment? If so, it makes me wonder whether it's
>>> worth thinking about if there are ways to get the same behavior using
>>> traditional functionality. This ignores Dave's question about how much
>>> of the performance actually comes from simply separating out the log,
>>> but for example suppose we had a JBOD block device made up of a
>>> combination of spinning and solid state disks via device-mapper with the
>>> requirement that a boundary from fast -> slow and vice versa was always
>>> at something like a 100GB alignment. Then if you formatted that device
>>> with XFS using 100GB AGs (or whatever to make them line up), and could
>>> somehow tag each AG as "fast" or "slow" based on the known underlying
>>> device mapping,
> 
> Not a new idea. :)
> 
> I've got old xfs_spaceman patches sitting around somewhere for
> ioctls to add such information to individual AGs. I think I called
> them "concat groups" to allow multiple AGs to sit inside a single
> concatenation, and they added a policy layer over the top of AGs
> to control things like metadata placement....
> 
>>> could you potentially get the same results by using the
>>> same heuristics to direct files to particular sets of AGs rather than
>>> between two physical devices?
> 
> That's pretty much what I was working on back at SGI in 2007. i.e.
> providing a method for configuring AGs with difference
> characteristics and a userspace policy interface to configure and
> make use of it....
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__oss.sgi.com_archives_xfs_2009-2D02_msg00250.html&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodIG4&s=bAOVWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e= 
> 
> 
>>> Obviously there are some differences like
>>> metadata being spread across the fast/slow devices (though I think we
>>> had such a thing as metadata only AGs), etc.
> 
> We have "metadata preferred" AGs, and that is what the inode32
> policy uses to place all the inodes and directory/atribute metadata
> in the 32bit inode address space. It doesn't get used for data
> unless the rest of the filesystem is ENOSPC.
> 
>>> I'm just handwaving here to
>>> try and better understand the goal.
> 
> We've been down these paths many times - the problem has always been
> that the people who want complex, configurable allocation policies
> for their workload have never provided the resources needed to
> implement past "here's a mount option hack that works for us".....
> 
>> Sorry I forgot to clarify the origins of the performance wins
>> here.   This is obviously very workload dependent (e.g.
>> write/flush/inode updatey workloads benefit the most) but for our
>> use case about ~65% of the IOP savings (~1/3 journal + slightly
>> less than 1/3 sync of metadata from journal, slightly less as some
>> journal entries get canceled), the remainder 1/3 of the win comes
>> from reading small files from the SSD vs. HDDs (about 25-30% of
>> our file population is <=256k; depending on the cluster).  To be
>> clear, we don't split files, we store all data blocks of the files
>> either entirely on the SSD (e.g. small files <=256k) and the rest
>> on the real-time HDD device.  The basic principal here being that,
>> larger files MIGHT have small IOPs to them (in our use-case this
>> happens to be rare, but not impossible), but small files always
>> do, and when 25-30% of your population is small...that's a big
>> chunk of your IOPs.
> 
> So here's a test for you. Make a device with a SSD as the first 1TB,
> and you HDD as the rest (use dm to do this). Then use the inode32
> allocator (mount option) to split metadata from data. The filesysetm
> will keep inodes/directories on the SSD and file data on the HDD
> automatically.
> 
> Better yet: have data allocations smaller than stripe units target
> metadata prefferred AGs (i.e. the SSD region) and allocations larger
> than stripe unit target the data-preferred AGs. Set the stripe unit
> to match your SSD/HDD threshold....
> 
> [snip]
> 
>> The AG based could work, though it's going to be a very hard sell
>> to use dm mapper, this isn't code we have ever used in our storage
>> stack.  At our scale, there are important operational reasons we
>> need to keep the storage stack simple (less bugs to hit), so
>> keeping the solution contained within XFS is a necessary
>> requirement for us.
> 
> Modifying the filesysetm on-disk format is far more complex than
> adding dm to your stack. Filesystem modifications are difficult and
> time consuming because if we screw up, users lose all their data.
> 
> If you can solve the problem with DM and a little bit of additional
> in-memory kernel code to categorise and select which AG to use for
> what (i.e. policy stuff that can be held in userspace), then that is
> the pretty much the only answer that makes sense from a filesystem
> developer's point of view....
> 
> Start by thinking about exposing AG behaviour controls through sysfs
> objects and configuring them at mount time through udev event
> notifications.
> 

Very cool idea.  A detail which I left out which might complicate this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SSD for about 15 drives), and even then we don't even use 50% of the SSD for these partitions.  We also want to be very selective about what data we let touch the SSD, we don't want folks who write large files by doing small IO to touch the SSD, only IO to small files (which are immutable in our use-case).

On an unrelated note, after talking to Omar Sandoval & Chris Mason over here, I'm reworking rtdefault to change it to "rtdisable" which gives the same operational outcome vs. rtdefault w/o setting inheritance bits (see prior e-mail).  This way folks have a kill switch of sorts, yet otherwise maintains the existing "persistent" behavior.


> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster Sept. 2, 2017, 11:55 a.m. UTC | #9

On Fri, Sep 01, 2017 at 11:37:37PM +0000, Richard Wareing wrote:
> 
> > On Sep 1, 2017, at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > [satuday morning here, so just a quick comment]
> > 
> > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:
> >>> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
> >>> 
> >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
> >>>> Thanks for the quick feedback Dave!  My comments are in-line below.
> >>>> 
> >>>> 
> >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>>> 
> >>>>> Hi Richard,
> >>>>> 
> >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> >>> ...
> >>>>>> add
> >>>>>> support for the more sophisticated AG based block allocator to RT
> >>>>>> (bitmapped version works well for us, but multi-threaded use-cases
> >>>>>> might not do as well).
> >>>>> 
> >>>>> That's a great big can of worms - not sure we want to open it. The
> >>>>> simplicity of the rt allocator is one of it's major benefits to
> >>>>> workloads that require deterministic allocation behaviour...
> >>>> 
> >>>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
> >>>> 
> >>> 
> >>> Just a side point based on the discussion so far... I kind of get the
> >>> impression that the primary reason for using realtime support here is
> >>> for the simple fact that it's a separate physical device. That provides
> >>> a basic mechanism to split files across fast and slow physical storage
> >>> based on some up-front heuristic. The fact that the realtime feature
> >>> uses a separate allocation algorithm is actually irrelevant (and
> >>> possibly a problem in the future).
> >>> 
> >>> Is that an accurate assessment? If so, it makes me wonder whether it's
> >>> worth thinking about if there are ways to get the same behavior using
> >>> traditional functionality. This ignores Dave's question about how much
> >>> of the performance actually comes from simply separating out the log,
> >>> but for example suppose we had a JBOD block device made up of a
> >>> combination of spinning and solid state disks via device-mapper with the
> >>> requirement that a boundary from fast -> slow and vice versa was always
> >>> at something like a 100GB alignment. Then if you formatted that device
> >>> with XFS using 100GB AGs (or whatever to make them line up), and could
> >>> somehow tag each AG as "fast" or "slow" based on the known underlying
> >>> device mapping,
> > 
> > Not a new idea. :)
> > 

Yeah (what ever is? :P).. I know we've discussed having more controls or
attributes of AGs for various things in the past. I'm not trying to
propose a particular design here, but rather trying to step back from
the focus on RT and understand what the general requirements are
(multi-device, tiering, etc.). I've not seen the pluggable allocation
stuff before, but it sounds like that could suit this use case perfectly.

> > I've got old xfs_spaceman patches sitting around somewhere for
> > ioctls to add such information to individual AGs. I think I called
> > them "concat groups" to allow multiple AGs to sit inside a single
> > concatenation, and they added a policy layer over the top of AGs
> > to control things like metadata placement....
> > 

Yeah, the alignment thing is just the first thing that popped in my head
for a thought experiment. Programmatic knobs on AGs via ioctl() or sysfs
is certainly a more legitimate solution.

> >>> could you potentially get the same results by using the
> >>> same heuristics to direct files to particular sets of AGs rather than
> >>> between two physical devices?
> > 
> > That's pretty much what I was working on back at SGI in 2007. i.e.
> > providing a method for configuring AGs with difference
> > characteristics and a userspace policy interface to configure and
> > make use of it....
> > 
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__oss.sgi.com_archives_xfs_2009-2D02_msg00250.html&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodIG4&s=bAOVWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e= 
> > 
> > 
> >>> Obviously there are some differences like
> >>> metadata being spread across the fast/slow devices (though I think we
> >>> had such a thing as metadata only AGs), etc.
> > 
> > We have "metadata preferred" AGs, and that is what the inode32
> > policy uses to place all the inodes and directory/atribute metadata
> > in the 32bit inode address space. It doesn't get used for data
> > unless the rest of the filesystem is ENOSPC.
> > 

Ah, right. Thanks.

> >>> I'm just handwaving here to
> >>> try and better understand the goal.
> > 
> > We've been down these paths many times - the problem has always been
> > that the people who want complex, configurable allocation policies
> > for their workload have never provided the resources needed to
> > implement past "here's a mount option hack that works for us".....
> > 

Yep. To be fair, I think what Richard is doing is an interesting and
useful experiment. If one wants to determine whether there's value in
directing files across separate devices via file size in a constrained
workload, it makes sense to hack up things like RT and fallocate()
because they provide the basic mechanisms you'd want to take advantage
of without having to reimplement that stuff just to prove a concept.

The challenge of course is then realizing when you're done that this is
not a generic solution. It abuses features/interfaces in ways they were
not designed for, disrupts traditional functionality, makes assumptions
that may not be valid for all users (i.e., file size based filtering,
number of devices, device to device ratios), etc. So we have to step
back and try to piece together a more generic, upstream-worthy approach.
To your point, it would be nice if those exploring these kind of hacks
would contribute more to that upstream process rather than settle on
running the "custom fit" hack until upstream comes around with something
better on its own. ;) (Though sending it out is still better than not,
so thanks for that. :)

> >> Sorry I forgot to clarify the origins of the performance wins
> >> here.   This is obviously very workload dependent (e.g.
> >> write/flush/inode updatey workloads benefit the most) but for our
> >> use case about ~65% of the IOP savings (~1/3 journal + slightly
> >> less than 1/3 sync of metadata from journal, slightly less as some
> >> journal entries get canceled), the remainder 1/3 of the win comes
> >> from reading small files from the SSD vs. HDDs (about 25-30% of
> >> our file population is <=256k; depending on the cluster).  To be
> >> clear, we don't split files, we store all data blocks of the files
> >> either entirely on the SSD (e.g. small files <=256k) and the rest
> >> on the real-time HDD device.  The basic principal here being that,
> >> larger files MIGHT have small IOPs to them (in our use-case this
> >> happens to be rare, but not impossible), but small files always
> >> do, and when 25-30% of your population is small...that's a big
> >> chunk of your IOPs.
> > 
> > So here's a test for you. Make a device with a SSD as the first 1TB,
> > and you HDD as the rest (use dm to do this). Then use the inode32
> > allocator (mount option) to split metadata from data. The filesysetm
> > will keep inodes/directories on the SSD and file data on the HDD
> > automatically.
> > 
> > Better yet: have data allocations smaller than stripe units target
> > metadata prefferred AGs (i.e. the SSD region) and allocations larger
> > than stripe unit target the data-preferred AGs. Set the stripe unit
> > to match your SSD/HDD threshold....
> > 
> > [snip]
> > 
> >> The AG based could work, though it's going to be a very hard sell
> >> to use dm mapper, this isn't code we have ever used in our storage
> >> stack.  At our scale, there are important operational reasons we
> >> need to keep the storage stack simple (less bugs to hit), so
> >> keeping the solution contained within XFS is a necessary
> >> requirement for us.
> > 

I am obviously not at all familiar with your storage stack and the
requirements of your environment and whatnoat. It's certainly possible
that there's some technical reason you can't use dm, but I find it very
hard to believe that reason is "there might be bugs" if you're instead
willing to hack up and deploy a barely tested feature such as XFS RT.
Using dm for basic linear mapping (i.e., partitioning) seems pretty much
ubiquitous in the Linux world these days.

> > Modifying the filesysetm on-disk format is far more complex than
> > adding dm to your stack. Filesystem modifications are difficult and
> > time consuming because if we screw up, users lose all their data.
> > 
> > If you can solve the problem with DM and a little bit of additional
> > in-memory kernel code to categorise and select which AG to use for
> > what (i.e. policy stuff that can be held in userspace), then that is
> > the pretty much the only answer that makes sense from a filesystem
> > developer's point of view....
> > 

Yep, agreed.

> > Start by thinking about exposing AG behaviour controls through sysfs
> > objects and configuring them at mount time through udev event
> > notifications.
> > 
> 
> Very cool idea.  A detail which I left out which might complicate this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SSD for about 15 drives), and even then we don't even use 50% of the SSD for these partitions.  We also want to be very selective about what data we let touch the SSD, we don't want folks who write large files by doing small IO to touch the SSD, only IO to small files (which are immutable in our use-case).
> 

I think Dave's more after the data point of how much basic metadata/data
separation helps your workload. This is an experiment you can run to get
that behavior without having to write any code (maybe a little for the
stripe unit thing ;). If there's a physical device size limitation,
perhaps you can do something crazy like create a sparse 1TB file on the
SSD, map that to a block device over loop or something and proceed from
there.

Though I guess that since this is a performance experiment, a better
idea may be to find a bigger SSD or concat 4 of the 256GB devices into
1TB and use that, assuming you're able to procure enough devices to run
an informative test.

Brian

> On an unrelated note, after talking to Omar Sandoval & Chris Mason over here, I'm reworking rtdefault to change it to "rtdisable" which gives the same operational outcome vs. rtdefault w/o setting inheritance bits (see prior e-mail).  This way folks have a kill switch of sorts, yet otherwise maintains the existing "persistent" behavior.
> 
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 2, 2017, 10:56 p.m. UTC | #10

On Sat, Sep 02, 2017 at 07:55:45AM -0400, Brian Foster wrote:
> On Fri, Sep 01, 2017 at 11:37:37PM +0000, Richard Wareing wrote:
> > 
> > > On Sep 1, 2017, at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > [satuday morning here, so just a quick comment]
> > > 
> > > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:
> > >>> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
> > >>> 
> > >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
> > >>>> Thanks for the quick feedback Dave!  My comments are in-line below.
> > >>>> 
> > >>>> 
> > >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >>>>> 
> > >>>>> Hi Richard,
> > >>>>> 
> > >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
> > >>> ...
> > >>>>>> add
> > >>>>>> support for the more sophisticated AG based block allocator to RT
> > >>>>>> (bitmapped version works well for us, but multi-threaded use-cases
> > >>>>>> might not do as well).
> > >>>>> 
> > >>>>> That's a great big can of worms - not sure we want to open it. The
> > >>>>> simplicity of the rt allocator is one of it's major benefits to
> > >>>>> workloads that require deterministic allocation behaviour...
> > >>>> 
> > >>>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
> > >>>> 
> > >>> 
> > >>> Just a side point based on the discussion so far... I kind of get the
> > >>> impression that the primary reason for using realtime support here is
> > >>> for the simple fact that it's a separate physical device. That provides
> > >>> a basic mechanism to split files across fast and slow physical storage
> > >>> based on some up-front heuristic. The fact that the realtime feature
> > >>> uses a separate allocation algorithm is actually irrelevant (and
> > >>> possibly a problem in the future).
> > >>> 
> > >>> Is that an accurate assessment? If so, it makes me wonder whether it's
> > >>> worth thinking about if there are ways to get the same behavior using
> > >>> traditional functionality. This ignores Dave's question about how much
> > >>> of the performance actually comes from simply separating out the log,
> > >>> but for example suppose we had a JBOD block device made up of a
> > >>> combination of spinning and solid state disks via device-mapper with the
> > >>> requirement that a boundary from fast -> slow and vice versa was always
> > >>> at something like a 100GB alignment. Then if you formatted that device
> > >>> with XFS using 100GB AGs (or whatever to make them line up), and could
> > >>> somehow tag each AG as "fast" or "slow" based on the known underlying
> > >>> device mapping,
> > > 
> > > Not a new idea. :)
> > > 
> 
> Yeah (what ever is? :P).. I know we've discussed having more controls or
> attributes of AGs for various things in the past. I'm not trying to
> propose a particular design here, but rather trying to step back from
> the focus on RT and understand what the general requirements are
> (multi-device, tiering, etc.).

Same here :P

> I've not seen the pluggable allocation
> stuff before, but it sounds like that could suit this use case perfectly.

Yup, there's plenty of use cases for it, but not enough resources to
go round...

> > > I've got old xfs_spaceman patches sitting around somewhere for
> > > ioctls to add such information to individual AGs. I think I called
> > > them "concat groups" to allow multiple AGs to sit inside a single
> > > concatenation, and they added a policy layer over the top of AGs
> > > to control things like metadata placement....
> > > 
> 
> Yeah, the alignment thing is just the first thing that popped in my head
> for a thought experiment. Programmatic knobs on AGs via ioctl() or sysfs
> is certainly a more legitimate solution.

Yeah, it matches nicely with the configurable error handling via
sysfs - mount the filesystem, get a uevent, read the config file,
punch in the customised allocation config via sysfs knobs...

> > >>> I'm just handwaving here to
> > >>> try and better understand the goal.
> > > 
> > > We've been down these paths many times - the problem has always been
> > > that the people who want complex, configurable allocation policies
> > > for their workload have never provided the resources needed to
> > > implement past "here's a mount option hack that works for us".....
> > > 
> 
> Yep. To be fair, I think what Richard is doing is an interesting and
> useful experiment. If one wants to determine whether there's value in
> directing files across separate devices via file size in a constrained
> workload, it makes sense to hack up things like RT and fallocate()
> because they provide the basic mechanisms you'd want to take advantage
> of without having to reimplement that stuff just to prove a concept.

Yes, that's how one prototypes and tests a hypothesis quickly... :)

> The challenge of course is then realizing when you're done that this is
> not a generic solution. It abuses features/interfaces in ways they were
> not designed for, disrupts traditional functionality, makes assumptions
> that may not be valid for all users (i.e., file size based filtering,
> number of devices, device to device ratios), etc. So we have to step
> back and try to piece together a more generic, upstream-worthy approach.

*nod*

> To your point, it would be nice if those exploring these kind of hacks
> would contribute more to that upstream process rather than settle on
> running the "custom fit" hack until upstream comes around with something
> better on its own. ;) (Though sending it out is still better than not,
> so thanks for that. :)

Yes, we do tend to set the bar quite high for new functionality.
Years of carrying around complex, one-off problem solutions that
don't quite work properly except in the original environment they
were designed for and are pretty much unused by anyone else (*cough*
filestreams *cough*) makes me want to avoid more one-off allocation
hacks and instead find a more generic solution...

> > >> The AG based could work, though it's going to be a very hard sell
> > >> to use dm mapper, this isn't code we have ever used in our storage
> > >> stack.  At our scale, there are important operational reasons we
> > >> need to keep the storage stack simple (less bugs to hit), so
> > >> keeping the solution contained within XFS is a necessary
> > >> requirement for us.
> > > 
> 
> I am obviously not at all familiar with your storage stack and the
> requirements of your environment and whatnoat. It's certainly possible
> that there's some technical reason you can't use dm, but I find it very
> hard to believe that reason is "there might be bugs" if you're instead
> willing to hack up and deploy a barely tested feature such as XFS RT.
> Using dm for basic linear mapping (i.e., partitioning) seems pretty much
> ubiquitous in the Linux world these days.

Yup, my thoughts exactly.

> > > Modifying the filesysetm on-disk format is far more complex than
> > > adding dm to your stack. Filesystem modifications are difficult and
> > > time consuming because if we screw up, users lose all their data.
> > > 
> > > If you can solve the problem with DM and a little bit of additional
> > > in-memory kernel code to categorise and select which AG to use for
> > > what (i.e. policy stuff that can be held in userspace), then that is
> > > the pretty much the only answer that makes sense from a filesystem
> > > developer's point of view....
> > > 
> 
> Yep, agreed.
> 
> > > Start by thinking about exposing AG behaviour controls through sysfs
> > > objects and configuring them at mount time through udev event
> > > notifications.
> > > 
> > 
> > Very cool idea.  A detail which I left out which might complicate this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SSD for about 15 drives), and even then we don't even use 50% of the SSD for these partitions.  We also want to be very selective about what data we let touch the SSD, we don't want folks who write large files by doing small IO to touch the SSD, only IO to small files (which are immutable in our use-case).
> > 
> 
> I think Dave's more after the data point of how much basic metadata/data
> separation helps your workload. This is an experiment you can run to get
> that behavior without having to write any code (maybe a little for the
> stripe unit thing ;).

Yup - the "select AG based on initial allocation size" criteria
would need some extra code in xfs_bmap_btalloc() to handle properly.

> If there's a physical device size limitation,
> perhaps you can do something crazy like create a sparse 1TB file on the
> SSD, map that to a block device over loop or something and proceed from
> there.

That'd work, but probably wouldn't perform all that well given the
added latency of the loop device...

> Though I guess that since this is a performance experiment, a better
> idea may be to find a bigger SSD or concat 4 of the 256GB devices into
> 1TB and use that, assuming you're able to procure enough devices to run
> an informative test.

I think it would be simpler to just use xfs_db to remove all the
space beyond 17GB in the first AG by modifying the freespace record
that mkfs lays down. i.e.  just shorten it to 17GB from "all of AG",
and the first AG will only have 17GB of space to play with.  Some
AGF and SB free space accounting would need to be modified as well
(to account for the lost space), but the result would be 1TB AGs and
AG 0 only having the first 17GB available for use, which matches the
SSD partitions exactly. It would also need mkfs to place the log in
ag 0, too (mkfs -l agnum=0 ....).

Again, this is a hack you could use for testing - the moment you run
xfs_repair it'll return the "lost space" in AG 0 to the free pool,
and it won't work unless you modify the freesapce records/accounting
again. It would, however, largely tell us whether we can acheive the
same performance outcome without needing the RT device....

Cheers,

Dave.

Richard Wareing Sept. 3, 2017, 12:43 a.m. UTC | #11

On 9/2/17, 4:55 AM, "Brian Foster" <bfoster@redhat.com> wrote:

    On Fri, Sep 01, 2017 at 11:37:37PM +0000, Richard Wareing wrote:
    > 

    > > On Sep 1, 2017, at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:

    > > 

    > > [satuday morning here, so just a quick comment]

    > > 

    > > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:

    > >>> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:

    > >>> 

    > >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:

    > >>>> Thanks for the quick feedback Dave!  My comments are in-line below.

    > >>>> 

    > >>>> 

    > >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:

    > >>>>> 

    > >>>>> Hi Richard,

    > >>>>> 

    > >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:

    > >>> ...

    > >>>>>> add

    > >>>>>> support for the more sophisticated AG based block allocator to RT

    > >>>>>> (bitmapped version works well for us, but multi-threaded use-cases

    > >>>>>> might not do as well).

    > >>>>> 

    > >>>>> That's a great big can of worms - not sure we want to open it. The

    > >>>>> simplicity of the rt allocator is one of it's major benefits to

    > >>>>> workloads that require deterministic allocation behaviour...

    > >>>> 

    > >>>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).

    > >>>> 

    > >>> 

    > >>> Just a side point based on the discussion so far... I kind of get the

    > >>> impression that the primary reason for using realtime support here is

    > >>> for the simple fact that it's a separate physical device. That provides

    > >>> a basic mechanism to split files across fast and slow physical storage

    > >>> based on some up-front heuristic. The fact that the realtime feature

    > >>> uses a separate allocation algorithm is actually irrelevant (and

    > >>> possibly a problem in the future).

    > >>> 

    > >>> Is that an accurate assessment? If so, it makes me wonder whether it's

    > >>> worth thinking about if there are ways to get the same behavior using

    > >>> traditional functionality. This ignores Dave's question about how much

    > >>> of the performance actually comes from simply separating out the log,

    > >>> but for example suppose we had a JBOD block device made up of a

    > >>> combination of spinning and solid state disks via device-mapper with the

    > >>> requirement that a boundary from fast -> slow and vice versa was always

    > >>> at something like a 100GB alignment. Then if you formatted that device

    > >>> with XFS using 100GB AGs (or whatever to make them line up), and could

    > >>> somehow tag each AG as "fast" or "slow" based on the known underlying

    > >>> device mapping,

    > > 

    > > Not a new idea. :)

    > > 

    
    Yeah (what ever is? :P).. I know we've discussed having more controls or
    attributes of AGs for various things in the past. I'm not trying to
    propose a particular design here, but rather trying to step back from
    the focus on RT and understand what the general requirements are
    (multi-device, tiering, etc.). I've not seen the pluggable allocation
    stuff before, but it sounds like that could suit this use case perfectly.
    
    > > I've got old xfs_spaceman patches sitting around somewhere for

    > > ioctls to add such information to individual AGs. I think I called

    > > them "concat groups" to allow multiple AGs to sit inside a single

    > > concatenation, and they added a policy layer over the top of AGs

    > > to control things like metadata placement....

    > > 

    
    Yeah, the alignment thing is just the first thing that popped in my head
    for a thought experiment. Programmatic knobs on AGs via ioctl() or sysfs
    is certainly a more legitimate solution.
    
    > >>> could you potentially get the same results by using the

    > >>> same heuristics to direct files to particular sets of AGs rather than

    > >>> between two physical devices?

    > > 

    > > That's pretty much what I was working on back at SGI in 2007. i.e.

    > > providing a method for configuring AGs with difference

    > > characteristics and a userspace policy interface to configure and

    > > make use of it....

    > > 

    > > https://urldefense.proofpoint.com/v2/url?u=http-3A__oss.sgi.com_archives_xfs_2009-2D02_msg00250.html&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodIG4&s=bAOVWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e= 

    > > 

    > > 

    > >>> Obviously there are some differences like

    > >>> metadata being spread across the fast/slow devices (though I think we

    > >>> had such a thing as metadata only AGs), etc.

    > > 

    > > We have "metadata preferred" AGs, and that is what the inode32

    > > policy uses to place all the inodes and directory/atribute metadata

    > > in the 32bit inode address space. It doesn't get used for data

    > > unless the rest of the filesystem is ENOSPC.

    > > 

    
    Ah, right. Thanks.
    
    > >>> I'm just handwaving here to

    > >>> try and better understand the goal.

    > > 

    > > We've been down these paths many times - the problem has always been

    > > that the people who want complex, configurable allocation policies

    > > for their workload have never provided the resources needed to

    > > implement past "here's a mount option hack that works for us".....

    > > 

    
    Yep. To be fair, I think what Richard is doing is an interesting and
    useful experiment. If one wants to determine whether there's value in
    directing files across separate devices via file size in a constrained
    workload, it makes sense to hack up things like RT and fallocate()
    because they provide the basic mechanisms you'd want to take advantage
    of without having to reimplement that stuff just to prove a concept.
    
    The challenge of course is then realizing when you're done that this is
    not a generic solution. It abuses features/interfaces in ways they were
    not designed for, disrupts traditional functionality, makes assumptions
    that may not be valid for all users (i.e., file size based filtering,
    number of devices, device to device ratios), etc. So we have to step
    back and try to piece together a more generic, upstream-worthy approach.
    To your point, it would be nice if those exploring these kind of hacks
    would contribute more to that upstream process rather than settle on
    running the "custom fit" hack until upstream comes around with something
    better on its own. ;) (Though sending it out is still better than not,
    so thanks for that. :)
    
    > >> Sorry I forgot to clarify the origins of the performance wins

    > >> here.   This is obviously very workload dependent (e.g.

    > >> write/flush/inode updatey workloads benefit the most) but for our

    > >> use case about ~65% of the IOP savings (~1/3 journal + slightly

    > >> less than 1/3 sync of metadata from journal, slightly less as some

    > >> journal entries get canceled), the remainder 1/3 of the win comes

    > >> from reading small files from the SSD vs. HDDs (about 25-30% of

    > >> our file population is <=256k; depending on the cluster).  To be

    > >> clear, we don't split files, we store all data blocks of the files

    > >> either entirely on the SSD (e.g. small files <=256k) and the rest

    > >> on the real-time HDD device.  The basic principal here being that,

    > >> larger files MIGHT have small IOPs to them (in our use-case this

    > >> happens to be rare, but not impossible), but small files always

    > >> do, and when 25-30% of your population is small...that's a big

    > >> chunk of your IOPs.

    > > 

    > > So here's a test for you. Make a device with a SSD as the first 1TB,

    > > and you HDD as the rest (use dm to do this). Then use the inode32

    > > allocator (mount option) to split metadata from data. The filesysetm

    > > will keep inodes/directories on the SSD and file data on the HDD

    > > automatically.

    > > 

    > > Better yet: have data allocations smaller than stripe units target

    > > metadata prefferred AGs (i.e. the SSD region) and allocations larger

    > > than stripe unit target the data-preferred AGs. Set the stripe unit

    > > to match your SSD/HDD threshold....

    > > 

    > > [snip]

    > > 

    > >> The AG based could work, though it's going to be a very hard sell

    > >> to use dm mapper, this isn't code we have ever used in our storage

    > >> stack.  At our scale, there are important operational reasons we

    > >> need to keep the storage stack simple (less bugs to hit), so

    > >> keeping the solution contained within XFS is a necessary

    > >> requirement for us.

    > > 

    
    I am obviously not at all familiar with your storage stack and the
    requirements of your environment and whatnoat. It's certainly possible
    that there's some technical reason you can't use dm, but I find it very
    hard to believe that reason is "there might be bugs" if you're instead
    willing to hack up and deploy a barely tested feature such as XFS RT.
    Using dm for basic linear mapping (i.e., partitioning) seems pretty much
    ubiquitous in the Linux world these days.
    
Bugs aren’t the only reason of course, but we’ve been working on this for a number of months, we also have thousands of production hours (* >10 FSes per system == >1M hours on the real-time code) on this setup, I’m also doing more testing with dm-flaky + dm-log w/ xfs-tests along with this.  In any event, large deviations (or starting over from scratch) on our setup isn’t something we’d like to do.  At this point I trust the RT allocator a good amount, and its sheer simplicity is something of an asset for us.

To be honest, if an AG allocator solution were available, I’d have to think carefully if it would make sense for us (though I’d be willing to help test/create it).  Once you have the small files filtered out to an SSD, you can dramatically increase the extent sizes on the RT FS (you don’t waste space for small allocations), yielding very dependable/contiguous reads/write IOs (we want multi-MB ave IOs), and the dependable latencies mesh well with the needs of a distributed FS.  I’d need to make sure these characteristics were achievable with the more AG allocator (yes there is “allocsize” option but it’s more of a suggestion than the hard guarantee of the RT extents), it’s complexity also makes developers prone to treating it as a “black box” and ending up with less than stellar IO efficiencies.

    > > Modifying the filesysetm on-disk format is far more complex than

    > > adding dm to your stack. Filesystem modifications are difficult and

    > > time consuming because if we screw up, users lose all their data.

    > > 

    > > If you can solve the problem with DM and a little bit of additional

    > > in-memory kernel code to categorise and select which AG to use for

    > > what (i.e. policy stuff that can be held in userspace), then that is

    > > the pretty much the only answer that makes sense from a filesystem

    > > developer's point of view....

    > > 

    
    Yep, agreed.
    
    > > Start by thinking about exposing AG behaviour controls through sysfs

    > > objects and configuring them at mount time through udev event

    > > notifications.

    > > 

    > 

    > Very cool idea.  A detail which I left out which might complicate this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SSD for about 15 drives), and even then we don't even use 50% of the SSD for these partitions.  We also want to be very selective about what data we let touch the SSD, we don't want folks who write large files by doing small IO to touch the SSD, only IO to small files (which are immutable in our use-case).

    > 

    
    I think Dave's more after the data point of how much basic metadata/data
    separation helps your workload. This is an experiment you can run to get
    that behavior without having to write any code (maybe a little for the
    stripe unit thing ;). If there's a physical device size limitation,
    perhaps you can do something crazy like create a sparse 1TB file on the
    SSD, map that to a block device over loop or something and proceed from
    there.

We have a very good idea on this already, we also have data for a 7 day period when we simply did MD offload to SSD alone.  Prior to even doing this setup, we used blktrace and examined all the metadata IO requests (e.g. per the RWBS field).  It’s about 60-65% of the IO savings, the remaining ~35% is from the small file IO.  For us, it’s worth saving.

Wrt to performance, we observe average 50%+ drops in latency for nearly all IO requests, the smaller IO requests should be quite a bit more but we need to change our threading model to handle a bit to take advantage of the fact the small files are on the SSDs (and therefore don’t need to wait behind other requests coming from the HDDs).
    
    Though I guess that since this is a performance experiment, a better
    idea may be to find a bigger SSD or concat 4 of the 256GB devices into
    1TB and use that, assuming you're able to procure enough devices to run
    an informative test.
    
    Brian
    
    > On an unrelated note, after talking to Omar Sandoval & Chris Mason over here, I'm reworking rtdefault to change it to "rtdisable" which gives the same operational outcome vs. rtdefault w/o setting inheritance bits (see prior e-mail).  This way folks have a kill switch of sorts, yet otherwise maintains the existing "persistent" behavior.

    > 

    > 

    > > Cheers,

    > > 

    > > Dave.

    > > -- 

    > > Dave Chinner

    > > david@fromorbit.com

    > 

    > --

    > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in

    > the body of a message to majordomo@vger.kernel.org

    > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Wareing Sept. 3, 2017, 3:31 a.m. UTC | #12

Quick correction, should be >100k hours on the RT code, not > 1M (maths is hard); we¹ll get to 1M soon, but not there yet ;) .

Richard


On 9/2/17, 5:44 PM, "linux-xfs-owner@vger.kernel.org on behalf of Richard Wareing" <linux-xfs-owner@vger.kernel.org on behalf of rwareing@fb.com> wrote:

    
    
    On 9/2/17, 4:55 AM, "Brian Foster" <bfoster@redhat.com> wrote:
    
        On Fri, Sep 01, 2017 at 11:37:37PM +0000, Richard Wareing wrote:
        > 
        > > On Sep 1, 2017, at 3:55 PM, Dave Chinner <david@fromorbit.com> wrote:
        > > 
        > > [satuday morning here, so just a quick comment]
        > > 
        > > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote:
        > >>> On Sep 1, 2017, at 12:32 PM, Brian Foster <bfoster@redhat.com> wrote:
        > >>> 
        > >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote:
        > >>>> Thanks for the quick feedback Dave!  My comments are in-line below.
        > >>>> 
        > >>>> 
        > >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner <david@fromorbit.com> wrote:
        > >>>>> 
        > >>>>> Hi Richard,
        > >>>>> 
        > >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote:
        > >>> ...
        > >>>>>> add
        > >>>>>> support for the more sophisticated AG based block allocator to RT
        > >>>>>> (bitmapped version works well for us, but multi-threaded use-cases
        > >>>>>> might not do as well).
        > >>>>> 
        > >>>>> That's a great big can of worms - not sure we want to open it. The
        > >>>>> simplicity of the rt allocator is one of it's major benefits to
        > >>>>> workloads that require deterministic allocation behaviour...
        > >>>> 
        > >>>> Agreed, I took a quick look at what it might take and came to a similar conclusion, but I can dream :).
        > >>>> 
        > >>> 
        > >>> Just a side point based on the discussion so far... I kind of get the
        > >>> impression that the primary reason for using realtime support here is
        > >>> for the simple fact that it's a separate physical device. That provides
        > >>> a basic mechanism to split files across fast and slow physical storage
        > >>> based on some up-front heuristic. The fact that the realtime feature
        > >>> uses a separate allocation algorithm is actually irrelevant (and
        > >>> possibly a problem in the future).
        > >>> 
        > >>> Is that an accurate assessment? If so, it makes me wonder whether it's
        > >>> worth thinking about if there are ways to get the same behavior using
        > >>> traditional functionality. This ignores Dave's question about how much
        > >>> of the performance actually comes from simply separating out the log,
        > >>> but for example suppose we had a JBOD block device made up of a
        > >>> combination of spinning and solid state disks via device-mapper with the
        > >>> requirement that a boundary from fast -> slow and vice versa was always
        > >>> at something like a 100GB alignment. Then if you formatted that device
        > >>> with XFS using 100GB AGs (or whatever to make them line up), and could
        > >>> somehow tag each AG as "fast" or "slow" based on the known underlying
        > >>> device mapping,
        > > 
        > > Not a new idea. :)
        > > 
        
        Yeah (what ever is? :P).. I know we've discussed having more controls or
        attributes of AGs for various things in the past. I'm not trying to
        propose a particular design here, but rather trying to step back from
        the focus on RT and understand what the general requirements are
        (multi-device, tiering, etc.). I've not seen the pluggable allocation
        stuff before, but it sounds like that could suit this use case perfectly.
        
        > > I've got old xfs_spaceman patches sitting around somewhere for
        > > ioctls to add such information to individual AGs. I think I called
        > > them "concat groups" to allow multiple AGs to sit inside a single
        > > concatenation, and they added a policy layer over the top of AGs
        > > to control things like metadata placement....
        > > 
        
        Yeah, the alignment thing is just the first thing that popped in my head
        for a thought experiment. Programmatic knobs on AGs via ioctl() or sysfs
        is certainly a more legitimate solution.
        
        > >>> could you potentially get the same results by using the
        > >>> same heuristics to direct files to particular sets of AGs rather than
        > >>> between two physical devices?
        > > 
        > > That's pretty much what I was working on back at SGI in 2007. i.e.
        > > providing a method for configuring AGs with difference
        > > characteristics and a userspace policy interface to configure and
        > > make use of it....
        > > 
        > > https://urldefense.proofpoint.com/v2/url?u=http-3A__oss.sgi.com_archives_xfs_2009-2D02_msg00250.html&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodIG4&s=bAOVWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e= 
        > > 
        > > 
        > >>> Obviously there are some differences like
        > >>> metadata being spread across the fast/slow devices (though I think we
        > >>> had such a thing as metadata only AGs), etc.
        > > 
        > > We have "metadata preferred" AGs, and that is what the inode32
        > > policy uses to place all the inodes and directory/atribute metadata
        > > in the 32bit inode address space. It doesn't get used for data
        > > unless the rest of the filesystem is ENOSPC.
        > > 
        
        Ah, right. Thanks.
        
        > >>> I'm just handwaving here to
        > >>> try and better understand the goal.
        > > 
        > > We've been down these paths many times - the problem has always been
        > > that the people who want complex, configurable allocation policies
        > > for their workload have never provided the resources needed to
        > > implement past "here's a mount option hack that works for us".....
        > > 
        
        Yep. To be fair, I think what Richard is doing is an interesting and
        useful experiment. If one wants to determine whether there's value in
        directing files across separate devices via file size in a constrained
        workload, it makes sense to hack up things like RT and fallocate()
        because they provide the basic mechanisms you'd want to take advantage
        of without having to reimplement that stuff just to prove a concept.
        
        The challenge of course is then realizing when you're done that this is
        not a generic solution. It abuses features/interfaces in ways they were
        not designed for, disrupts traditional functionality, makes assumptions
        that may not be valid for all users (i.e., file size based filtering,
        number of devices, device to device ratios), etc. So we have to step
        back and try to piece together a more generic, upstream-worthy approach.
        To your point, it would be nice if those exploring these kind of hacks
        would contribute more to that upstream process rather than settle on
        running the "custom fit" hack until upstream comes around with something
        better on its own. ;) (Though sending it out is still better than not,
        so thanks for that. :)
        
        > >> Sorry I forgot to clarify the origins of the performance wins
        > >> here.   This is obviously very workload dependent (e.g.
        > >> write/flush/inode updatey workloads benefit the most) but for our
        > >> use case about ~65% of the IOP savings (~1/3 journal + slightly
        > >> less than 1/3 sync of metadata from journal, slightly less as some
        > >> journal entries get canceled), the remainder 1/3 of the win comes
        > >> from reading small files from the SSD vs. HDDs (about 25-30% of
        > >> our file population is <=256k; depending on the cluster).  To be
        > >> clear, we don't split files, we store all data blocks of the files
        > >> either entirely on the SSD (e.g. small files <=256k) and the rest
        > >> on the real-time HDD device.  The basic principal here being that,
        > >> larger files MIGHT have small IOPs to them (in our use-case this
        > >> happens to be rare, but not impossible), but small files always
        > >> do, and when 25-30% of your population is small...that's a big
        > >> chunk of your IOPs.
        > > 
        > > So here's a test for you. Make a device with a SSD as the first 1TB,
        > > and you HDD as the rest (use dm to do this). Then use the inode32
        > > allocator (mount option) to split metadata from data. The filesysetm
        > > will keep inodes/directories on the SSD and file data on the HDD
        > > automatically.
        > > 
        > > Better yet: have data allocations smaller than stripe units target
        > > metadata prefferred AGs (i.e. the SSD region) and allocations larger
        > > than stripe unit target the data-preferred AGs. Set the stripe unit
        > > to match your SSD/HDD threshold....
        > > 
        > > [snip]
        > > 
        > >> The AG based could work, though it's going to be a very hard sell
        > >> to use dm mapper, this isn't code we have ever used in our storage
        > >> stack.  At our scale, there are important operational reasons we
        > >> need to keep the storage stack simple (less bugs to hit), so
        > >> keeping the solution contained within XFS is a necessary
        > >> requirement for us.
        > > 
        
        I am obviously not at all familiar with your storage stack and the
        requirements of your environment and whatnoat. It's certainly possible
        that there's some technical reason you can't use dm, but I find it very
        hard to believe that reason is "there might be bugs" if you're instead
        willing to hack up and deploy a barely tested feature such as XFS RT.
        Using dm for basic linear mapping (i.e., partitioning) seems pretty much
        ubiquitous in the Linux world these days.
        
    Bugs aren¹t the only reason of course, but we¹ve been working on this for a number of months, we also have thousands of production hours (* >10 FSes per system == >1M hours on the real-time code) on this setup, I¹m also doing more testing with dm-flaky + dm-log w/ xfs-tests along with this.  In any event, large deviations (or starting over from scratch) on our setup isn¹t something we¹d like to do.  At this point I trust the RT allocator a good amount, and its sheer simplicity is something of an asset for us.
    
    To be honest, if an AG allocator solution were available, I¹d have to think carefully if it would make sense for us (though I¹d be willing to help test/create it).  Once you have the small files filtered out to an SSD, you can dramatically increase the extent sizes on the RT FS (you don¹t waste space for small allocations), yielding very dependable/contiguous reads/write IOs (we want multi-MB ave IOs), and the dependable latencies mesh well with the needs of a distributed FS.  I¹d need to make sure these characteristics were achievable with the more AG allocator (yes there is ³allocsize² option but it¹s more of a suggestion than the hard guarantee of the RT extents), it¹s complexity also makes developers prone to treating it as a ³black box² and ending up with less than stellar IO efficiencies.
    
        > > Modifying the filesysetm on-disk format is far more complex than
        > > adding dm to your stack. Filesystem modifications are difficult and
        > > time consuming because if we screw up, users lose all their data.
        > > 
        > > If you can solve the problem with DM and a little bit of additional
        > > in-memory kernel code to categorise and select which AG to use for
        > > what (i.e. policy stuff that can be held in userspace), then that is
        > > the pretty much the only answer that makes sense from a filesystem
        > > developer's point of view....
        > > 
        
        Yep, agreed.
        
        > > Start by thinking about exposing AG behaviour controls through sysfs
        > > objects and configuring them at mount time through udev event
        > > notifications.
        > > 
        > 
        > Very cool idea.  A detail which I left out which might complicate this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SSD for about 15 drives), and even then we don't even use 50% of the SSD for these partitions.  We also want to be very selective about what data we let touch the SSD, we don't want folks who write large files by doing small IO to touch the SSD, only IO to small files (which are immutable in our use-case).
        > 
        
        I think Dave's more after the data point of how much basic metadata/data
        separation helps your workload. This is an experiment you can run to get
        that behavior without having to write any code (maybe a little for the
        stripe unit thing ;). If there's a physical device size limitation,
        perhaps you can do something crazy like create a sparse 1TB file on the
        SSD, map that to a block device over loop or something and proceed from
        there.
    
    We have a very good idea on this already, we also have data for a 7 day period when we simply did MD offload to SSD alone.  Prior to even doing this setup, we used blktrace and examined all the metadata IO requests (e.g. per the RWBS field).  It¹s about 60-65% of the IO savings, the remaining ~35% is from the small file IO.  For us, it¹s worth saving.
    
    Wrt to performance, we observe average 50%+ drops in latency for nearly all IO requests, the smaller IO requests should be quite a bit more but we need to change our threading model to handle a bit to take advantage of the fact the small files are on the SSDs (and therefore don¹t need to wait behind other requests coming from the HDDs).
        
        Though I guess that since this is a performance experiment, a better
        idea may be to find a bigger SSD or concat 4 of the 256GB devices into
        1TB and use that, assuming you're able to procure enough devices to run
        an informative test.
        
        Brian
        
        > On an unrelated note, after talking to Omar Sandoval & Chris Mason over here, I'm reworking rtdefault to change it to "rtdisable" which gives the same operational outcome vs. rtdefault w/o setting inheritance bits (see prior e-mail).  This way folks have a kill switch of sorts, yet otherwise maintains the existing "persistent" behavior.
        > 
        > 
        > > Cheers,
        > > 
        > > Dave.
        > > -- 
        > > Dave Chinner
        > > david@fromorbit.com
        > 
        > --
        > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
        > the body of a message to majordomo@vger.kernel.org
        > More majordomo info at  http://vger.kernel.org/majordomo-info.html
        
    


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Sept. 4, 2017, 1:17 a.m. UTC | #13

[ Richard, can you please fix your quoting and line wrapping to
work like everyone else's mail clients?]

On Sun, Sep 03, 2017 at 12:43:57AM +0000, Richard Wareing wrote:
> On 9/2/17, 4:55 AM, "Brian Foster" <bfoster@redhat.com> wrote:
>  >  I am obviously not at all familiar with your storage stack and
>  >  the requirements of your environment and whatnoat. It's
>  >  certainly possible that there's some technical reason you
>  >  can't use dm, but I find it very hard to believe that reason
>  >  is "there might be bugs" if you're instead willing to hack up
>  >  and deploy a barely tested feature such as XFS RT.  Using dm
>  >  for basic linear mapping (i.e., partitioning) seems pretty
>  >  much ubiquitous in the Linux world these days.
>     
> Bugs aren’t the only reason of course, but we’ve been
> working on this for a number of months, we also have thousands of
> production hours (* >10 FSes per system == >1M hours on the
> real-time code) on this setup, I’m also doing more testing
> with dm-flaky + dm-log w/ xfs-tests along with this.  In any
> event, large deviations (or starting over from scratch) on our
> setup isn’t something we’d like to do.  At this point I
> trust the RT allocator a good amount, and its sheer simplicity is
> something of an asset for us.

I'm just going to address the "rt dev is stable and well tested"
claim here.


I have my doubts you're actually testing what you think you are
testing with xfstests. Just configuring a rtdev doesn't mean
xfstests runs all it's tests on the rtdev. All it means is it runs
the very few tests that require a rtdev in addition to all the other
tests it runs against the normal data device.

If you really want to test rtdev functionality, you need to use the
"-d rtinherit" mkfs option to force all file data to be targetted at
the rtdev, not the data dev.

And when you do that, the rtdev blows up in 3 different ways in
under 30s, the thrid being a fatal kernel OOPS....

i.e.: Test device setup:

$ mkfs.xfs -f -r rtdev=/dev/ram0 -d rtinherit=1 /dev/pmem0

xfstests config section:

[xfs_rt]
FSTYP=xfs
TEST_DIR=/mnt/test
TEST_DEV=/dev/pmem0
TEST_RTDEV=/dev/ram0
SCRATCH_MNT=/mnt/scratch
SCRATCH_DEV=/dev/pmem1
SCRATCH_RTDEV=/dev/ram1
MKFS_OPTIONS="-d rtinherit=1"


And the result of running:

# ./check -g quick -s xfs_rt
SECTION       -- xfs_rt
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test4 4.13.0-rc7-dgc
MKFS_OPTIONS  -- -f -d rtinherit=1 /dev/pmem1
MOUNT_OPTIONS -- /dev/pmem1 /mnt/scratch

generic/001 3s ... 3s
generic/002 0s ... 1s
generic/003 10s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_rt/generic/003.out.bad)
    --- tests/generic/003.out   2014-02-24 09:58:09.505184325 +1100
    +++ /home/dave/src/xfstests-dev/results//xfs_rt/generic/003.out.bad 2017-09-04 10:19:07.609694351 +1000
    @@ -1,2 +1,27 @@
     QA output created by 003
    +./tests/generic/003: line 93: echo: write error: No space left on device
    +stat: cannot stat '/mnt/scratch/dir1/file1': Structure needs cleaning
    +ERROR: access time has changed for file1 after remount
    +ERROR: modify time has changed for file1 after remount
    +ERROR: change time has changed for file1 after remount
    +./tests/generic/003: line 120: echo: write error: No space left on device
    ...
    (Run 'diff -u tests/generic/003.out /home/dave/src/xfstests-dev/results//xfs_rt/generic/003.out.bad'  to see the entire diff)
_check_xfs_filesystem: filesystem on /dev/pmem1 is inconsistent (r)
(see /home/dave/src/xfstests-dev/results//xfs_rt/generic/003.full for details)
_check_dmesg: something found in dmesg (see /home/dave/src/xfstests-dev/results//xfs_rt/generic/003.dmesg)

[352996.421261] run fstests generic/003 at 2017-09-04 10:18:57
[352996.669490] XFS (pmem1): Unmounting Filesystem
[352996.714422] XFS (pmem1): Mounting V5 Filesystem
[352996.718122] XFS (pmem1): Ending clean mount
[352996.745512] XFS (pmem1): Unmounting Filesystem
[352996.780789] XFS (pmem1): Mounting V5 Filesystem
[352996.783980] XFS (pmem1): Ending clean mount
[352998.825234] XFS (pmem1): Unmounting Filesystem
[352998.839376] XFS (pmem1): Mounting V5 Filesystem
[352998.842762] XFS (pmem1): Ending clean mount
[352998.847718] XFS (pmem1): corrupt dinode 100, has realtime flag set.
[352998.848716] ffff88013b348800: 49 4e 81 a4 03 02 00 00 00 00 00 00 00 00 00 00  IN..............
[352998.851393] ffff88013b348810: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
[352998.852738] ffff88013b348820: 59 ac 9b f2 2e cf 4e 87 59 ac 9b f1 2e 91 a7 2b  Y.....N.Y......+
[352998.854168] ffff88013b348830: 59 ac 9b f1 2e 91 a7 2b 00 00 00 00 00 00 00 00  Y......+........
[352998.855514] XFS (pmem1): Internal error xfs_iformat(realtime) at line 94 of file fs/xfs/libxfs/xfs_inode_fork.c.  Caller xfs_iread+0x1cf/0x230
[352998.857637] CPU: 3 PID: 7470 Comm: stat Tainted: G        W       4.13.0-rc7-dgc #45
[352998.858833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[352998.860092] Call Trace:
[352998.860492]  dump_stack+0x63/0x8f
[352998.861052]  xfs_corruption_error+0x87/0x90
[352998.861711]  ? xfs_iread+0x1cf/0x230
[352998.862270]  xfs_iformat_fork+0x390/0x690
[352998.862896]  ? xfs_iread+0x1cf/0x230
[352998.863454]  ? xfs_inode_from_disk+0x35/0x230
[352998.864132]  xfs_iread+0x1cf/0x230
[352998.864672]  xfs_iget+0x518/0xa40
[352998.865221]  xfs_lookup+0xd6/0x100
[352998.865755]  xfs_vn_lookup+0x4c/0x90
[352998.866316]  lookup_slow+0x96/0x150
[352998.866860]  walk_component+0x19a/0x330
[352998.867454]  ? path_init+0x1dc/0x330
[352998.868011]  path_lookupat+0x64/0x1f0
[352998.868581]  filename_lookup+0xa9/0x170
[352998.869192]  ? filemap_map_pages+0x152/0x290
[352998.869853]  user_path_at_empty+0x36/0x40
[352998.870474]  ? user_path_at_empty+0x36/0x40
[352998.871130]  vfs_statx+0x67/0xc0
[352998.871635]  SYSC_newlstat+0x2e/0x50
[352998.872200]  ? trace_do_page_fault+0x41/0x140
[352998.872871]  SyS_newlstat+0xe/0x10
[352998.873423]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[352998.874140] RIP: 0033:0x7f75730690e5
[352998.874699] RSP: 002b:00007ffdcad5e878 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[352998.875856] RAX: ffffffffffffffda RBX: 00007ffdcad5ea68 RCX: 00007f75730690e5
[352998.876975] RDX: 00007ffdcad5e8b0 RSI: 00007ffdcad5e8b0 RDI: 00007ffdcad5fc9a
[352998.878072] RBP: 0000000000000004 R08: 0000000000000100 R09: 0000000000000000
[352998.879154] R10: 00000000000001cb R11: 0000000000000246 R12: 000056423451cc80
[352998.880233] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[352998.881548] XFS (pmem1): Corruption detected. Unmount and run xfs_repair
[352998.882581] XFS (pmem1): xfs_iread: xfs_iformat() returned error -117

The second blowup is:

generic/015 1s ... [failed, exit status 1] - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_rt/generic/015.out.bad)
    --- tests/generic/015.out   2014-01-20 16:57:33.965658221 +1100
    +++ /home/dave/src/xfstests-dev/results//xfs_rt/generic/015.out.bad 2017-09-04 10:19:17.998113907 +1000
    @@ -2,6 +2,5 @@
     fill disk:
        !!! disk full (expected)
     check free space:
    -delete fill:
    -check free space:
    -   !!! free space is in range
    +   *** file created with zero length
    ...
    (Run 'diff -u tests/generic/015.out /home/dave/src/xfstests-dev/results//xfs_rt/generic/015.out.bad'  to see the entire diff)
_check_xfs_filesystem: filesystem on /dev/pmem1 is inconsistent (r)
(see /home/dave/src/xfstests-dev/results//xfs_rt/generic/015.full for details)

Which may or may not be a xfstests problem, because repair blows
up with:

.....
inode 96 has RT flag set but there is no RT device
inode 99 has RT flag set but there is no RT device
inode 96 has RT flag set but there is no RT device
would fix bad flags.
inode 99 has RT flag set but there is no RT device
would fix bad flags.
found inode 99 claiming to be a real-time file
.....

And the third is:

[353017.737976] run fstests generic/018 at 2017-09-04 10:19:18
[353017.956902] XFS (pmem1): Mounting V5 Filesystem
[353017.960672] XFS (pmem1): Ending clean mount
[353017.982836] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[353017.984077] IP: xfs_find_bdev_for_inode+0x2b/0x30
[353017.984873] PGD 0 
[353017.984874] P4D 0 

[353017.985788] Oops: 0000 [#1] PREEMPT SMP
[353017.986412] CPU: 9 PID: 15847 Comm: xfs_io Tainted: G        W       4.13.0-rc7-dgc #45
[353017.987641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[353017.988932] task: ffff880236955740 task.stack: ffffc90007878000
[353017.989853] RIP: 0010:xfs_find_bdev_for_inode+0x2b/0x30
[353017.990666] RSP: 0018:ffffc9000787bc88 EFLAGS: 00010202
[353017.991466] RAX: 0000000000000000 RBX: ffffc9000787bd70 RCX: 000000000000000c
[353017.992584] RDX: 0000000000000001 RSI: fffffffffffffffe RDI: ffff8808280891e8
[353017.993657] RBP: ffffc9000787bcb0 R08: 0000000000000009 R09: ffff8808280890c8
[353017.994726] R10: 000000000000034e R11: ffff880236955740 R12: ffff880828089080
[353017.995808] R13: ffffc9000787bd08 R14: ffff88080a8de000 R15: ffff88080a8de000
[353017.996905] FS:  00007ff336cb21c0(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[353017.998114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[353017.998984] CR2: 0000000000000008 CR3: 000000022dc09000 CR4: 00000000000406e0
[353018.000049] Call Trace:
[353018.000465]  ? xfs_bmbt_to_iomap+0x78/0xb0
[353018.001097]  xfs_file_iomap_begin+0x265/0x990
[353018.001770]  iomap_apply+0x48/0xe0
[353018.002300]  ? iomap_write_end+0x70/0x70
[353018.002909]  iomap_fiemap+0x9e/0x100
[353018.003471]  ? iomap_write_end+0x70/0x70
[353018.004085]  xfs_vn_fiemap+0x5c/0x80
[353018.004668]  do_vfs_ioctl+0x450/0x5c0
[353018.005233]  SyS_ioctl+0x79/0x90
[353018.005735]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[353018.006440] RIP: 0033:0x7ff336390dc7
[353018.007000] RSP: 002b:00007fff1b806b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[353018.008154] RAX: ffffffffffffffda RBX: 0000000000000063 RCX: 00007ff336390dc7
[353018.009241] RDX: 0000558a334476a0 RSI: 00000000c020660b RDI: 0000000000000003
[353018.010314] RBP: 0000000000002710 R08: 0000000000000003 R09: 000000000000001d
[353018.011396] R10: 000000000000034e R11: 0000000000000246 R12: 0000000000001010
[353018.012479] R13: 00007ff336647b58 R14: 0000558a33447dc0 R15: 00007ff336647b00
[353018.013554] Code: 66 66 66 66 90 f6 47 da 01 55 48 89 e5 48 8b 87 98 fe ff ff 75 0d 48 8b 80 38 02 00 00 5d 48 8b 40 08 c3 48 8b 80 48 02 00 00 5d <48> 8b 40 08 c3 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 
[353018.016404] RIP: xfs_find_bdev_for_inode+0x2b/0x30 RSP: ffffc9000787bc88
[353018.017420] CR2: 0000000000000008
[353018.018024] ---[ end trace af08c2af09ff5975 ]---


A null pointer dereference in generic/018. At which point the system
needs rebooting to recover.

So, yeah, the rtdev is not stable, not robust and not very well
maintained at this point. If you want to focus new development on
the RT device, then the first thing we need is fixes for all it's
obvious problems. Get it working reliably upstream first so we have
a good baseline from which we can evaluate enhancements sanely...

Cheers,

Dave.

[1/3] xfs: Add rtdefault mount option

Commit Message

Comments

Patch