Message ID | 20211025192746.66445-1-krisman@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | file system-wide error monitoring | expand |
On Mon, Oct 25, 2021 at 10:27 PM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > Hi, > > This is the 9th version of this patch series. Thank you, Amir, Jan and > Ted, for the feedback in the previous versions. > > The main difference in this version is that the pool is no longer > resizeable nor limited in number of marks, even though we only > pre-allocate 32 slots. In addition, ext4 was modified to always return > non-zero errno, and the documentation was fixed accordingly (No longer > suggests we return EXT4_ERR* values. > > I also droped the Reviewed-by tags from the ext4 patch, due to the > changes above. > > Please let me know what you think. > All good on my end. I've made a couple of minor comments that could be addressed on commit if no other issues are found. Thanks, Amir.
On Tue 26-10-21 12:12:38, Amir Goldstein wrote: > On Mon, Oct 25, 2021 at 10:27 PM Gabriel Krisman Bertazi > <krisman@collabora.com> wrote: > > > > Hi, > > > > This is the 9th version of this patch series. Thank you, Amir, Jan and > > Ted, for the feedback in the previous versions. > > > > The main difference in this version is that the pool is no longer > > resizeable nor limited in number of marks, even though we only > > pre-allocate 32 slots. In addition, ext4 was modified to always return > > non-zero errno, and the documentation was fixed accordingly (No longer > > suggests we return EXT4_ERR* values. > > > > I also droped the Reviewed-by tags from the ext4 patch, due to the > > changes above. > > > > Please let me know what you think. > > > > All good on my end. > I've made a couple of minor comments that > could be addressed on commit if no other issues are found. All good on my end as well. I've applied all the minor updates, tested the result and pushed it out to fsnotify branch of my tree. WRT to your new FS_ERROR LTP tests, I've noticed that the testcases 1 and 3 from test fanotify20 fail for me. After a bit of debugging this seems to be a bug in ext4 where it calls ext4_abort() with EXT4_ERR_ESHUTDOWN instead of plain ESHUTDOWN. Not sure if you have that fixed or how come the tests passed for you. After fixing that ext4 bug everything passes for me. Honza
On Wed, Oct 27, 2021 at 2:22 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 26-10-21 12:12:38, Amir Goldstein wrote: > > On Mon, Oct 25, 2021 at 10:27 PM Gabriel Krisman Bertazi > > <krisman@collabora.com> wrote: > > > > > > Hi, > > > > > > This is the 9th version of this patch series. Thank you, Amir, Jan and > > > Ted, for the feedback in the previous versions. > > > > > > The main difference in this version is that the pool is no longer > > > resizeable nor limited in number of marks, even though we only > > > pre-allocate 32 slots. In addition, ext4 was modified to always return > > > non-zero errno, and the documentation was fixed accordingly (No longer > > > suggests we return EXT4_ERR* values. > > > > > > I also droped the Reviewed-by tags from the ext4 patch, due to the > > > changes above. > > > > > > Please let me know what you think. > > > > > > > All good on my end. > > I've made a couple of minor comments that > > could be addressed on commit if no other issues are found. > > All good on my end as well. I've applied all the minor updates, tested the > result and pushed it out to fsnotify branch of my tree. WRT to your new > FS_ERROR LTP tests, I've noticed that the testcases 1 and 3 from test > fanotify20 fail for me. After a bit of debugging this seems to be a bug in > ext4 where it calls ext4_abort() with EXT4_ERR_ESHUTDOWN instead of plain > ESHUTDOWN. Not sure if you have that fixed or how come the tests passed for > you. After fixing that ext4 bug everything passes for me. > Gabriel mentioned that bug in the cover letter of the LTP series :-) https://lore.kernel.org/linux-ext4/20211026173302.84000-1-krisman@collabora.com/T/#u Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Wed, Oct 27, 2021 at 2:22 PM Jan Kara <jack@suse.cz> wrote: >> >> On Tue 26-10-21 12:12:38, Amir Goldstein wrote: >> > On Mon, Oct 25, 2021 at 10:27 PM Gabriel Krisman Bertazi >> > <krisman@collabora.com> wrote: >> > > >> > > Hi, >> > > >> > > This is the 9th version of this patch series. Thank you, Amir, Jan and >> > > Ted, for the feedback in the previous versions. >> > > >> > > The main difference in this version is that the pool is no longer >> > > resizeable nor limited in number of marks, even though we only >> > > pre-allocate 32 slots. In addition, ext4 was modified to always return >> > > non-zero errno, and the documentation was fixed accordingly (No longer >> > > suggests we return EXT4_ERR* values. >> > > >> > > I also droped the Reviewed-by tags from the ext4 patch, due to the >> > > changes above. >> > > >> > > Please let me know what you think. >> > > >> > >> > All good on my end. >> > I've made a couple of minor comments that >> > could be addressed on commit if no other issues are found. >> >> All good on my end as well. I've applied all the minor updates, tested the >> result and pushed it out to fsnotify branch of my tree. WRT to your new >> FS_ERROR LTP tests, I've noticed that the testcases 1 and 3 from test >> fanotify20 fail for me. After a bit of debugging this seems to be a bug in >> ext4 where it calls ext4_abort() with EXT4_ERR_ESHUTDOWN instead of plain >> ESHUTDOWN. Not sure if you have that fixed or how come the tests passed for >> you. After fixing that ext4 bug everything passes for me. >> > > Gabriel mentioned that bug in the cover letter of the LTP series :-) > https://lore.kernel.org/linux-ext4/20211026173302.84000-1-krisman@collabora.com/T/#u Yes :) Also, thank you both for the extensive review and ideas during the development of this series. It was really appreciated! I'm sending out the new version for tests + man pages today.
> Also, thank you both for the extensive review and ideas during the > development of this series. It was really appreciated! > Thank you for your appreciated effort! It was a wild journey through some interesting experiments, but you survived it well ;-) Would you be interested in pursuing FAN_WB_ERROR after a due rest and after all the dust on FAN_FS_ERROR has settled? FAN_WB_ERROR can use the same info record and same internal error event struct as FAN_FS_ERROR. A call to fsnotify_wb_error(sb, inode, err) can be placed inside mapping_set_error() and maybe for other sporadic callers of errseq_set(). For wb error, we can consider storing a snapshot of errseq of the sb/inode in the sb/inode mark and compute error_count from the errseq diff instead of counting it when merging events. This will keep a more accurate report even when error reports are dropped due to allocation failure or event queue overflow. I have a feeling that the Postgres project would find this functionality useful (?). Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: >> Also, thank you both for the extensive review and ideas during the >> development of this series. It was really appreciated! >> > > Thank you for your appreciated effort! > It was a wild journey through some interesting experiments, but > you survived it well ;-) > > Would you be interested in pursuing FAN_WB_ERROR after a due rest > and after all the dust on FAN_FS_ERROR has settled? I think it would make sense for me to continue working on it, yes. But, before that, I think I still have some support to add to FAN_FS_ERROR, like a detailed, fs-specific, info record, and an error location info record, which has a use-case in Google Cloud environments. I have to discuss priorities internally, but we (collabora) do have an interest in supporting WB_ERROR too. For the detailed error report, fanotify could have a new info record that carries a structure sent out by the file system. fanotify could handle the lifetime of this object, by keeping a larger mempool, or delegate its allocation/destruction to the filesystem. Like I proposed in an earlier version of FAN_FS_ERROR, the format could be as simple as: struct fanotify_error_data_info { struct fanotify_event_info_header hdr; char data[]; } I think xfs, at least, would be able to make good use of this record with xfs_scrub, as the xfs maintainers mentioned.
On Sat, Oct 30, 2021 at 1:24 AM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > >> Also, thank you both for the extensive review and ideas during the > >> development of this series. It was really appreciated! > >> > > > > Thank you for your appreciated effort! > > It was a wild journey through some interesting experiments, but > > you survived it well ;-) > > > > Would you be interested in pursuing FAN_WB_ERROR after a due rest > > and after all the dust on FAN_FS_ERROR has settled? > > I think it would make sense for me to continue working on it, yes. But, > before that, I think I still have some support to add to FAN_FS_ERROR, > like a detailed, fs-specific, info record, and an error location info > record, which has a use-case in Google Cloud environments. I have to > discuss priorities internally, but we (collabora) do have an interest in > supporting WB_ERROR too. > > For the detailed error report, fanotify could have a new info record > that carries a structure sent out by the file system. fanotify could > handle the lifetime of this object, by keeping a larger mempool, or > delegate its allocation/destruction to the filesystem. > Before you try anything radical, please check the size of prospect fs-specific data. My hunch says that in most cases fs-specific data could fit cozy along side the file handle within MAX_HANDLE_SZ and if this is true, then we do not need to worry about extreme cases right now. If there comes a time when we have a justified case of a filesystem that needs to report much bigger fs-specific data, we can consider it then. Until that time, we simply drop the over sized fs-specific data same as we do if filesystem passed in a file handle larger than MAX_HANDLE_SZ. > Like I proposed in an earlier version of FAN_FS_ERROR, the format could > be as simple as: > > struct fanotify_error_data_info { > struct fanotify_event_info_header hdr; > char data[]; > } > We can add char data[] field to the end of struct fanotify_event_info_error. It does not change the layout nor size of the structure and the info record is variable size per definition anyway. I know Jan didn't like this so much at the time and contemplated a separate info record for filename, but eventually, fanotify_event_info_fid also has an optional name following the unsigned char handle[]. > I think xfs, at least, would be able to make good use of this record with > xfs_scrub, as the xfs maintainers mentioned. I am not sure if that was the final conclusion. xfs_scrub is proactive and should have no problem reporting its own findings, but I have no objections to fs-specific details in FS_ERROR event. Thanks, Amir.