Discussion:
[PATCH] kernel: audit_tree: Fix a sleep-in-atomic-context bug
Matthew Wilcox
2018-06-21 04:29:12 UTC
Permalink
The kernel may sleep with holding a spinlock.
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in tag_chunk
spin_lock in tag_chunk
There are several locks here; your report would be improved by saying
which one is the problem. I'm assuming it's old_entry->lock.

spin_lock(&old_entry->lock);
...
if (fsnotify_add_inode_mark_locked(chunk_entry,
old_entry->connector->inode, 1)) {
...
return fsnotify_add_mark_locked(mark, inode, NULL, allow_dups);
...
ret = fsnotify_add_mark_list(mark, inode, mnt, allow_dups);
...
if (inode)
connp = &inode->i_fsnotify_marks;
conn = fsnotify_grab_connector(connp);
if (!conn) {
err = fsnotify_attach_connector_to_object(connp, inode, mnt);

It seems to me that this is safe because old_entry is looked up from
fsnotify_find_mark, and it can't be removed while its lock is held.
Therefore there's always a 'conn' returned from fsnotify_grab_connector(),
and so this path will never be taken.

But this code path is confusing to me, and I could be wrong. Jan, please
confirm my analysis is correct?
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in untag_chunk
spin_lock in untag_chunk
I'm just going to assume this one is the same.
Paul Moore
2018-06-22 18:56:09 UTC
Permalink
Post by Matthew Wilcox
The kernel may sleep with holding a spinlock.
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in tag_chunk
spin_lock in tag_chunk
There are several locks here; your report would be improved by saying
which one is the problem. I'm assuming it's old_entry->lock.
spin_lock(&old_entry->lock);
...
if (fsnotify_add_inode_mark_locked(chunk_entry,
old_entry->connector->inode, 1)) {
...
return fsnotify_add_mark_locked(mark, inode, NULL, allow_dups);
...
ret = fsnotify_add_mark_list(mark, inode, mnt, allow_dups);
...
if (inode)
connp = &inode->i_fsnotify_marks;
conn = fsnotify_grab_connector(connp);
if (!conn) {
err = fsnotify_attach_connector_to_object(connp, inode, mnt);
It seems to me that this is safe because old_entry is looked up from
fsnotify_find_mark, and it can't be removed while its lock is held.
Therefore there's always a 'conn' returned from fsnotify_grab_connector(),
and so this path will never be taken.
But this code path is confusing to me, and I could be wrong. Jan, please
confirm my analysis is correct?
Yes, you are correct. The presence of another mark in the list (and the
fact we pin it there using refcount & mark_mutex) guarantees we won't need
to allocate the connector. I agree the audit code's use of fsnotify would
deserve some cleanup.
I'm always open to suggestions and patches (hint, hint) from the
fsnotify experts ;)
--
paul moore
www.paul-moore.com
Jan Kara
2018-06-25 09:22:57 UTC
Permalink
Post by Paul Moore
Post by Matthew Wilcox
The kernel may sleep with holding a spinlock.
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in tag_chunk
spin_lock in tag_chunk
There are several locks here; your report would be improved by saying
which one is the problem. I'm assuming it's old_entry->lock.
spin_lock(&old_entry->lock);
...
if (fsnotify_add_inode_mark_locked(chunk_entry,
old_entry->connector->inode, 1)) {
...
return fsnotify_add_mark_locked(mark, inode, NULL, allow_dups);
...
ret = fsnotify_add_mark_list(mark, inode, mnt, allow_dups);
...
if (inode)
connp = &inode->i_fsnotify_marks;
conn = fsnotify_grab_connector(connp);
if (!conn) {
err = fsnotify_attach_connector_to_object(connp, inode, mnt);
It seems to me that this is safe because old_entry is looked up from
fsnotify_find_mark, and it can't be removed while its lock is held.
Therefore there's always a 'conn' returned from fsnotify_grab_connector(),
and so this path will never be taken.
But this code path is confusing to me, and I could be wrong. Jan, please
confirm my analysis is correct?
Yes, you are correct. The presence of another mark in the list (and the
fact we pin it there using refcount & mark_mutex) guarantees we won't need
to allocate the connector. I agree the audit code's use of fsnotify would
deserve some cleanup.
I'm always open to suggestions and patches (hint, hint) from the
fsnotify experts ;)
Yeah, I was looking into it on Friday and today :). Currently I've got a
bit stuck because I think I've found some races in audit_tree code and I
haven't yet decided how to fix them. E.g. am I right the following can
happen?

CPU1 CPU2
tag_chunk(inode, tree1) tag_chunk(inode, tree2)
old_entry = fsnotify_find_mark(); old_entry = fsnotify_find_mark();
old = container_of(old_entry); old = container_of(old_entry);
chunk = alloc_chunk(old->count + 1); chunk = alloc_chunk(old->count + 1);
mutex_lock(&group->mark_mutex);
adds new mark
replaces chunk
old->dead = 1;
mutex_unlock(&group->mark_mutex);
mutex_lock(&group->mark_mutex);
if (!(old_entry->flags &
FSNOTIFY_MARK_FLAG_ATTACHED)) {
Check fails as old_entry is
not yet destroyed
adds new mark
replaces old chunk again ->
list corruption, lost refs, ...
mutex_unlock(&group->mark_mutex);

Generally there's a bigger problem that audit_tree code can have multiple
marks attached to one inode but only one of them is the "valid" one (i.e.,
the one embedded in the latest chunk). This is only a temporary state until
fsnotify_destroy_mark() detaches the mark and then on last reference drop
we really remove the mark from inode's list but during that window it is
undefined which mark is returned from fsnotify_find_mark()...

So am I right the above can really happen or is there some higher level
synchronization I'm missing? If this can really happen, I think I'll need
to rework the code so that audit_tree has just one mark attached and
let it probably point to the current chunk.

Honza
--
Jan Kara <***@suse.com>
SUSE Labs, CR
Jan Kara
2018-06-25 12:55:40 UTC
Permalink
Post by Jan Kara
Post by Paul Moore
Post by Matthew Wilcox
The kernel may sleep with holding a spinlock.
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in tag_chunk
spin_lock in tag_chunk
There are several locks here; your report would be improved by saying
which one is the problem. I'm assuming it's old_entry->lock.
spin_lock(&old_entry->lock);
...
if (fsnotify_add_inode_mark_locked(chunk_entry,
old_entry->connector->inode, 1)) {
...
return fsnotify_add_mark_locked(mark, inode, NULL, allow_dups);
...
ret = fsnotify_add_mark_list(mark, inode, mnt, allow_dups);
...
if (inode)
connp = &inode->i_fsnotify_marks;
conn = fsnotify_grab_connector(connp);
if (!conn) {
err = fsnotify_attach_connector_to_object(connp, inode, mnt);
It seems to me that this is safe because old_entry is looked up from
fsnotify_find_mark, and it can't be removed while its lock is held.
Therefore there's always a 'conn' returned from fsnotify_grab_connector(),
and so this path will never be taken.
But this code path is confusing to me, and I could be wrong. Jan, please
confirm my analysis is correct?
Yes, you are correct. The presence of another mark in the list (and the
fact we pin it there using refcount & mark_mutex) guarantees we won't need
to allocate the connector. I agree the audit code's use of fsnotify would
deserve some cleanup.
I'm always open to suggestions and patches (hint, hint) from the
fsnotify experts ;)
Yeah, I was looking into it on Friday and today :). Currently I've got a
bit stuck because I think I've found some races in audit_tree code and I
haven't yet decided how to fix them. E.g. am I right the following can
happen?
CPU1 CPU2
tag_chunk(inode, tree1) tag_chunk(inode, tree2)
old_entry = fsnotify_find_mark(); old_entry = fsnotify_find_mark();
old = container_of(old_entry); old = container_of(old_entry);
chunk = alloc_chunk(old->count + 1); chunk = alloc_chunk(old->count + 1);
mutex_lock(&group->mark_mutex);
adds new mark
replaces chunk
old->dead = 1;
mutex_unlock(&group->mark_mutex);
mutex_lock(&group->mark_mutex);
if (!(old_entry->flags &
FSNOTIFY_MARK_FLAG_ATTACHED)) {
Check fails as old_entry is
not yet destroyed
adds new mark
replaces old chunk again ->
list corruption, lost refs, ...
mutex_unlock(&group->mark_mutex);
Generally there's a bigger problem that audit_tree code can have multiple
marks attached to one inode but only one of them is the "valid" one (i.e.,
the one embedded in the latest chunk). This is only a temporary state until
fsnotify_destroy_mark() detaches the mark and then on last reference drop
we really remove the mark from inode's list but during that window it is
undefined which mark is returned from fsnotify_find_mark()...
So am I right the above can really happen or is there some higher level
synchronization I'm missing? If this can really happen, I think I'll need
to rework the code so that audit_tree has just one mark attached and
let it probably point to the current chunk.
Also am I right to assume that if two tag_chunk() calls race, both try to
add new fsnotify mark in create_chunk() and one of them fails, then the
resulting ENOSPC error from create_chunk() is actually a bug? Because from
looking at the code it seems that the desired functionality is that
tag_chunk() should add 'tree' to the chunk, expanding chunk as needed.

Honza
--
Jan Kara <***@suse.com>
SUSE Labs, CR
Paul Moore
2018-06-25 22:25:12 UTC
Permalink
Post by Jan Kara
Post by Paul Moore
Post by Matthew Wilcox
The kernel may sleep with holding a spinlock.
[FUNC] kmem_cache_alloc(GFP_KERNEL)
kmem_cache_alloc in fsnotify_attach_connector_to_object
fsnotify_attach_connector_to_object in fsnotify_add_mark_list
fsnotify_add_mark_list in fsnotify_add_mark_locked
fsnotify_add_mark_locked in tag_chunk
spin_lock in tag_chunk
There are several locks here; your report would be improved by saying
which one is the problem. I'm assuming it's old_entry->lock.
spin_lock(&old_entry->lock);
...
if (fsnotify_add_inode_mark_locked(chunk_entry,
old_entry->connector->inode, 1)) {
...
return fsnotify_add_mark_locked(mark, inode, NULL, allow_dups);
...
ret = fsnotify_add_mark_list(mark, inode, mnt, allow_dups);
...
if (inode)
connp = &inode->i_fsnotify_marks;
conn = fsnotify_grab_connector(connp);
if (!conn) {
err = fsnotify_attach_connector_to_object(connp, inode, mnt);
It seems to me that this is safe because old_entry is looked up from
fsnotify_find_mark, and it can't be removed while its lock is held.
Therefore there's always a 'conn' returned from fsnotify_grab_connector(),
and so this path will never be taken.
But this code path is confusing to me, and I could be wrong. Jan, please
confirm my analysis is correct?
Yes, you are correct. The presence of another mark in the list (and the
fact we pin it there using refcount & mark_mutex) guarantees we won't need
to allocate the connector. I agree the audit code's use of fsnotify would
deserve some cleanup.
I'm always open to suggestions and patches (hint, hint) from the
fsnotify experts ;)
Yeah, I was looking into it on Friday and today :). Currently I've got a
bit stuck because I think I've found some races in audit_tree code and I
haven't yet decided how to fix them. E.g. am I right the following can
happen?
CPU1 CPU2
tag_chunk(inode, tree1) tag_chunk(inode, tree2)
old_entry = fsnotify_find_mark(); old_entry = fsnotify_find_mark();
old = container_of(old_entry); old = container_of(old_entry);
chunk = alloc_chunk(old->count + 1); chunk = alloc_chunk(old->count + 1);
mutex_lock(&group->mark_mutex);
adds new mark
replaces chunk
old->dead = 1;
mutex_unlock(&group->mark_mutex);
mutex_lock(&group->mark_mutex);
if (!(old_entry->flags &
FSNOTIFY_MARK_FLAG_ATTACHED)) {
Check fails as old_entry is
not yet destroyed
adds new mark
replaces old chunk again ->
list corruption, lost refs, ...
mutex_unlock(&group->mark_mutex);
Generally there's a bigger problem that audit_tree code can have multiple
marks attached to one inode but only one of them is the "valid" one (i.e.,
the one embedded in the latest chunk). This is only a temporary state until
fsnotify_destroy_mark() detaches the mark and then on last reference drop
we really remove the mark from inode's list but during that window it is
undefined which mark is returned from fsnotify_find_mark()...
So am I right the above can really happen or is there some higher level
synchronization I'm missing? If this can really happen, I think I'll need
to rework the code so that audit_tree has just one mark attached and
let it probably point to the current chunk.
Full disclosure: I inherited almost all of this code, and aside from
work by Richard to add some new functionality early in my tenure, and
the fixes you've sent, I've not ventured too far into this
audit/fsnotify code. This means I don't have a quick answer for you,
in fact since you've been digging through the code recently, you're
probably more knowledgeable than I am with this particular corner of
the audit code.

I've added Richard to the CC line; he was the last audit person to
spend any quality time with this code, he may have some insight;
although it has been many years if the git log is to be believed.

As far as the "bigger problem" is concerned, we actually are aware of
this - partially anyway - we just haven't had the cycles to deal with
it:

* https://github.com/linux-audit/audit-kernel/issues/12
--
paul moore
www.paul-moore.com
Loading...