audit_backlog_limit messages

Discussion:

Debora Velarde

2005-06-29 22:53:41 UTC

(decided it was best to move this discussion to the list)

We're hitting a system hang that repeatedly displays this to the terminal:
audit: audit_backlog=258 > audit_backlog_limit=256
audit: audit_lost=58 audit_rate_limit=0 audit_backlog_limit=256
audit: audit_backlog_limit exceeded

The systems (we're seeing it on multiple platforms) were running simple
testcases that used this audit rule:
auditctl -a exit,always -F auid=<tester_auid>

I was able to reproduce the hang on my system. Here's some info about my
environment before running the test:
# auditctl -s
AUDIT_STATUS: enabled=1 flag=1 pid=1143 rate_limit=0 backlog_limit=256
lost=0 backlog=0

auditctl version 0.9.14
Linux 2.6.9-11.EL.audit.71 SMP ppc64

Is 'auditctl -a exit,always -F auid=<tester_auid>' not a reasonable

filter

rule, and therefore we shouldn't worry about this?

This is a reasonable rule. However, I don't know anything else about your
environment. What do you have for flush? How big is your backlog queue?

These

matter more than the rule.

David Woodhouse

2005-06-29 23:11:31 UTC

Permalink

Post by Debora Velarde
(decided it was best to move this discussion to the list)

You have been asked before _not_ to send HTML mail.

Post by Debora Velarde
audit: audit_backlog=258 > audit_backlog_limit=256
audit: audit_lost=58 audit_rate_limit=0 audit_backlog_limit=256
audit: audit_backlog_limit exceeded

As discussed, the system isn't hung; it's just going slowly because
every auditable action is waiting 1 minute for space on the backlog
queue. In fact from Steve's reports it looks like auditd itself is
getting audited again -- I'm not sure how. I wasn't able to reproduce it
using Steve's method; I'll try yours first thing in the morning.

I'll do a kernel that reverts to immediate failure after audit_panic()
has been called, to prevent the appearance of a 'hang'. But for the
purposes of our own testing, we generally shouldn't be getting into a
situation where audit_panic() is called in the first place.

--
dwmw2

Steve Grubb

2005-06-30 11:01:25 UTC

Permalink

Post by David Woodhouse
As discussed, the system isn't hung; it's just going slowly because
every auditable action is waiting 1 minute for space on the backlog
queue. In fact from Steve's reports it looks like auditd itself is
getting audited again -- I'm not sure how. I wasn't able to reproduce it
using Steve's method; I'll try yours first thing in the morning.

Just a guess. Its using ctx->pid. Maybe tsk->pid is better? I would suggest 2
changes, though. The first is to plug the hole so that auditd doesn't get
audited. The other step is to inspect the pid when adding to the backlog wait
queue to make sure auditd doesn't get added to it. This way if there is
another sneak path, auditd won't get added to wait queue.

Post by David Woodhouse
But for the purposes of our own testing, we generally shouldn't be getting
into a situation where audit_panic() is called in the first place.

Well, they should check that panic really does work. I know that I don't run
my system like that. :)

Also, the backlog limit of 256 is low. This is the default set for people who
are not doing auditing. You should bump it up higher to maybe 1024 or 4096.
The default config is for people collecting the occasional avc denial
message.

-Steve

David Woodhouse

2005-06-30 17:11:03 UTC

Permalink

Post by Steve Grubb
Just a guess. Its using ctx->pid. Maybe tsk->pid is better?

I had another look through but I couldn't see anywhere where this should
actually matter, and we do use tsk->pid wherever it's available.

Post by Steve Grubb
I would suggest 2 changes, though. The first is to plug the hole so
that auditd doesn't get audited.

Well yes, if I could see it :)

Post by Steve Grubb
The other step is to inspect the pid when adding to the backlog wait
queue to make sure auditd doesn't get added to it. This way if there is
another sneak path, auditd won't get added to wait queue.

That's a bandaid which I've implemented for now in order to find the
offending code, but which probably shouldn't be included long-term...

--- linux-2.6.9/kernel/auditsc.c~ 2005-06-24 17:36:04.000000000 +0100
+++ linux-2.6.9/kernel/auditsc.c 2005-06-30 17:51:02.000000000 +0100
@@ -810,6 +810,16 @@ static void audit_log_exit(struct audit_
struct audit_watch_info *winfo;
struct hlist_node *pos;

+ if (context->pid == audit_pid) {
+ static int argh = 10;
+ if (argh) {
+ argh--;
+ printk(KERN_ERR "Eep. Trying to audit auditd.");
+ WARN_ON(1);
+ }
+ else return;
+ /* Let it continue anyway the first ten times, so we see why it was getting audited */
+ }
ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL);
if (!ab)
return; /* audit_panic has been called */

--
dwmw2

Steve Grubb

2005-06-30 19:23:47 UTC

Permalink

Post by David Woodhouse
That's a bandaid which I've implemented for now in order to find the
offending code, but which probably shouldn't be included long-term...

yes bandaid...but this seems to have a history of rearing its ugly head. If it
auditd ever does get on the wait queue, the game is over. You can add this as
a bugon or runtime check, but I'd keep it because the consequences are bad
when it does happen.

-Steve

David Woodhouse

2005-06-30 20:35:50 UTC

Permalink

Post by Steve Grubb
Just a guess. Its using ctx->pid. Maybe tsk->pid is better?

You're right. We weren't setting ctx->pid when the context was
allocated, so checking it in audit_filter_syscall() wasn't having the
desired effect. This one really ought to fix it for real now...

--- linux-2.6.9/kernel/auditsc.c~ 2005-06-30 17:51:02.000000000 +0100
+++ linux-2.6.9/kernel/auditsc.c 2005-06-30 21:29:48.000000000 +0100
@@ -527,7 +527,7 @@ static enum audit_state audit_filter_sys
int word = AUDIT_WORD(ctx->major);
int bit = AUDIT_BIT(ctx->major);

- if (audit_pid && ctx->pid == audit_pid)
+ if (audit_pid && tsk->pid == audit_pid)
return AUDIT_DISABLED;

rcu_read_lock();
@@ -612,7 +612,7 @@ static inline struct audit_context *audi
context->return_valid = return_valid;
context->return_code = return_code;

- if (context->in_syscall && !context->auditable && tsk->pid != audit_pid) {
+ if (context->in_syscall && !context->auditable) {
enum audit_state state;
state = audit_filter_syscall(tsk, context, &audit_filter_list[AUDIT_FILTER_EXIT]);
if (state == AUDIT_RECORD_CONTEXT)

--
dwmw2

David Woodhouse

2005-06-30 18:42:26 UTC

Permalink

Post by Debora Velarde
I was able to reproduce the hang on my system.

Please could you try to reproduce it on the audit.73 kernel. If auditd
gets audited, please show the audit messages and the kernel messages.

--
dwmw2

Debora Velarde

2005-06-30 19:06:00 UTC

Permalink

Post by David Woodhouse

Post by Debora Velarde
I was able to reproduce the hang on my system.

Please could you try to reproduce it on the audit.73 kernel. If auditd
gets audited, please show the audit messages and the kernel messages.

The test case that caused the problem works fine on audit.73 kernel.
Thanks!
debbie

David Woodhouse

2005-06-30 19:35:55 UTC

Permalink

Post by Debora Velarde
The test case that caused the problem works fine on audit.73 kernel.
Thanks!

Hrm. That isn't a good thing. I only added debugging to try to _catch_
the problem, and made it stop waiting after audit_panic() was called. I
didn't do anything which is known to fix the problem.

Did audit_panic() get called? Did you see a backtrace when auditd got
audited?

I did clean up the waiting code a little bit but I hadn't noticed that
it was actually buggy (in important ways).

--
dwmw2