Discussion:
[PATCH V6 00/10] namespaces: log namespaces per task
(too old to reply)
Richard Guy Briggs
2015-04-17 07:35:47 UTC
Permalink
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).


1/10 exposes proc's ns entries structure which lists a number of useful
operations per namespace type for other subsystems to use.

2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST

3/10 provides an example of usage for audit_log_task_info() which is used by
syscall audits, among others. audit_log_task() and audit_common_recv_message()
would be other potential use cases.

Proposed output format:
This differs slightly from Aristeu's patch because of the label conflict with
"pid=" due to including it in existing records rather than it being a seperate
record. It has now returned to being a seperate record. The proc device
major/minor are listed in hexadecimal and namespace IDs are the proc inode
minus the base offset.
type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4 pidns=-1 userns=-2 mntns=0

4/10 change audit startup from __initcall to subsys_initcall to get it started
earlier to be able to receive initial namespace log messages.

5/10 tracks the creation and deletion of namespaces, listing the type of
namespace instance, proc device ID, related namespace id if there is one and
the newly minted namespace ID.

Proposed output format for initial namespace creation:
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1
type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1
type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1
type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1

And a CLONE action would result in:
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1

While deleting a namespace would result in:
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1

6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
type (CAP_AUDIT_CONTROL required).

7/10 is a macro for CLONE_NEW_* flags.

8/10 adds auditing on creation of namespace(s) in fork.

9/10 adds auditing a change of namespace on setns.

10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
(CAP_AUDIT_WRITE required).


v5 -> v6:
Switch to using namespace ID based on namespace proc inode minus base offset
Added proc device ID to qualify proc inode reference
Eliminate exposed /proc interface

v4 -> v5:
Clean up prototypes for dependencies on CONFIG_NAMESPACES.
Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
Log AUDIT_NS_INFO with PID.
Move /proc/<pid>/ns_* patches to end of patchset to deprecate them.
Log on changing ns (setns).
Log on creating new namespaces when forking.
Added a macro for CLONE_NEW*.

v3 -> v4:
Seperate out the NS_INFO message from the SYSCALL message.
Moved audit_log_namespace_info() out of audit_log_task_info().
Use a seperate message type per namespace type for each of INIT/DEL.
Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
Add /proc/<pid>/ns/ documentation.
Fix dynamic initial ns logging.

v2 -> v3:
Use atomic64_t in ns_serial to simplify it.
Avoid funciton duplication in proc, keying on dentry.
Squash down audit patch to avoid rcu sleep issues.
Add tracking for creation and deletion of namespace instances.

v1 -> v2:
Avoid rollover by switching from an int to a long long.
Change rollover behaviour from simply avoiding zero to raising a BUG.
Expose serial numbers in /proc/<pid>/ns/*_snum.
Expose ns_entries and use it in audit.


Notes:
As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
capabilities of userspace processes that try to join netlink broadcast groups.

This set does not try to solve the non-init namespace audit messages and
auditd problem yet. That will come later, likely with additional auditd
instances running in another namespace with a limited ability to influence the
master auditd. I echo Eric B's idea that messages destined for different
namespaces would have to be tailored for that namespace with references that
make sense (such as the right pid number reported to that pid namespace, and
not leaking info about parents or peers).

Questions:
Is there a way to link serial numbers of namespaces involved in migration of a
container to another kernel? It sounds like what is needed is a part of a
mangement application that is able to pull the audit records from constituent
hosts to build an audit trail of a container.

What additional events should list this information?

Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL
(and now CAP_AUDIT_READ) in init_user_ns can get to this information in
the init namespace at the moment from audit.


Richard Guy Briggs (10):
namespaces: expose ns_entries
proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
audit: log namespace ID numbers
audit: initialize at subsystem time rather than device time
audit: log creation and deletion of namespace instances
audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO
sched: add a macro to ref all CLONE_NEW* flags
fork: audit on creation of new namespace(s)
audit: log on switching namespace (setns)
audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record

fs/namespace.c | 13 +++
fs/proc/generic.c | 3 +-
fs/proc/namespaces.c | 2 +-
include/linux/audit.h | 20 +++++
include/linux/proc_ns.h | 10 ++-
include/uapi/linux/audit.h | 21 +++++
include/uapi/linux/sched.h | 6 ++
ipc/namespace.c | 12 +++
kernel/audit.c | 169 +++++++++++++++++++++++++++++++++++++-
kernel/auditsc.c | 2 +
kernel/fork.c | 3 +
kernel/nsproxy.c | 4 +
kernel/pid_namespace.c | 13 +++
kernel/user_namespace.c | 13 +++
kernel/utsname.c | 12 +++
net/core/net_namespace.c | 12 +++
security/integrity/ima/ima_api.c | 2 +
17 files changed, 309 insertions(+), 8 deletions(-)
Richard Guy Briggs
2015-04-17 07:35:48 UTC
Permalink
Expose ns_entries so subsystems other than proc can use this set of namespace
operations.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
fs/proc/namespaces.c | 2 +-
include/linux/proc_ns.h | 1 +
2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..310da74 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -15,7 +15,7 @@
#include "internal.h"


-static const struct proc_ns_operations *ns_entries[] = {
+const struct proc_ns_operations *ns_entries[] = {
#ifdef CONFIG_NET_NS
&netns_operations,
#endif
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..09ff93c 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -27,6 +27,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations *ns_entries[];

/*
* We always define these enumerators
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:49 UTC
Permalink
Since PROC_*_INIT_INO are all defined relative to PROC_DYNAMIC_FIRST, make it
explicit.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
fs/proc/generic.c | 3 +--
include/linux/proc_ns.h | 9 +++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index b7f268e..9f7726a 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -11,6 +11,7 @@
#include <linux/errno.h>
#include <linux/time.h>
#include <linux/proc_fs.h>
+#include <linux/proc_ns.h>
#include <linux/stat.h>
#include <linux/mm.h>
#include <linux/module.h>
@@ -121,8 +122,6 @@ static int xlate_proc_name(const char *name, struct proc_dir_entry **ret,
static DEFINE_IDA(proc_inum_ida);
static DEFINE_SPINLOCK(proc_inum_lock); /* protects the above */

-#define PROC_DYNAMIC_FIRST 0xF0000000U
-
/*
* Return an inode number between PROC_DYNAMIC_FIRST and
* 0xffffffff, or zero on failure.
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 09ff93c..340372b 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,12 +32,13 @@ extern const struct proc_ns_operations *ns_entries[];
/*
* We always define these enumerators
*/
+#define PROC_DYNAMIC_FIRST 0xF0000000U
enum {
PROC_ROOT_INO = 1,
- PROC_IPC_INIT_INO = 0xEFFFFFFFU,
- PROC_UTS_INIT_INO = 0xEFFFFFFEU,
- PROC_USER_INIT_INO = 0xEFFFFFFDU,
- PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_IPC_INIT_INO = PROC_DYNAMIC_FIRST - 1,
+ PROC_UTS_INIT_INO = PROC_DYNAMIC_FIRST - 2,
+ PROC_USER_INIT_INO = PROC_DYNAMIC_FIRST - 3,
+ PROC_PID_INIT_INO = PROC_DYNAMIC_FIRST - 4,
};

#ifdef CONFIG_PROC_FS
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:50 UTC
Permalink
Log the namespace identifiers (device ID and proc inode minus base offset) of a
task in a new record type (1329) (usually accompanies audit_log_task_info()
type=SYSCALL record) which is used by syscall audits, among others..

Idea first presented:
https://www.redhat.com/archives/linux-audit/2013-March/msg00020.html

Typical output format would look something like:
type=NS_INFO msg=audit(1408577535.306:82): pid=374 dev=00:03 netns=116 utsns=-2 ipcns=-1 pidns=-4 userns=-3 mntns=0

The namespace identifier values are printed relative to PROC_DYNAMIC_FIRST
(0xF000000) (so the first 4 are negative).

Suggested-by: Aristeu Rozanski <***@redhat.com>
Signed-off-by: Richard Guy Briggs <***@redhat.com>
Acked-by: Serge Hallyn <***@canonical.com>
---
include/linux/audit.h | 8 ++++++++
include/uapi/linux/audit.h | 1 +
kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++
kernel/auditsc.c | 2 ++
security/integrity/ima/ima_api.c | 2 ++
5 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index b481779..71698ec 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -478,6 +478,12 @@ static inline void audit_log_secctx(struct audit_buffer *ab, u32 secid)
extern int audit_log_task_context(struct audit_buffer *ab);
extern void audit_log_task_info(struct audit_buffer *ab,
struct task_struct *tsk);
+#ifdef CONFIG_NAMESPACES
+extern void audit_log_ns_info(struct task_struct *tsk);
+#else
+static inline void audit_log_ns_info(struct task_struct *tsk)
+{ }
+#endif

extern int audit_update_lsm_rules(void);

@@ -534,6 +540,8 @@ static inline int audit_log_task_context(struct audit_buffer *ab)
static inline void audit_log_task_info(struct audit_buffer *ab,
struct task_struct *tsk)
{ }
+static inline void audit_log_ns_info(struct task_struct *tsk)
+{ }
#define audit_enabled 0
#endif /* CONFIG_AUDIT */
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 2ccf19e..1ffb151 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -110,6 +110,7 @@
#define AUDIT_SECCOMP 1326 /* Secure Computing event */
#define AUDIT_PROCTITLE 1327 /* Proctitle emit event */
#define AUDIT_FEATURE_CHANGE 1328 /* audit log listing feature changes */
+#define AUDIT_NS_INFO 1329 /* Record process namespace IDs */

#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index d5a1220..68200ce 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -66,7 +66,9 @@
#include <linux/freezer.h>
#include <linux/tty.h>
#include <linux/pid_namespace.h>
+#include <linux/proc_ns.h>
#include <net/netns/generic.h>
+#include <linux/mount.h> /* struct vfsmount */

#include "audit.h"

@@ -745,6 +747,8 @@ static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature
audit_feature_names[which], !!old_feature, !!new_feature,
!!old_lock, !!new_lock, res);
audit_log_end(ab);
+
+ audit_log_ns_info(current);
}

static int audit_set_feature(struct sk_buff *skb)
@@ -1653,6 +1657,35 @@ void audit_log_session_info(struct audit_buffer *ab)
audit_log_format(ab, " auid=%u ses=%u", auid, sessionid);
}

+#ifdef CONFIG_NAMESPACES
+void audit_log_ns_info(struct task_struct *tsk)
+{
+ const struct proc_ns_operations **entry;
+ bool end = false;
+ struct audit_buffer *ab;
+ struct vfsmount *mnt = task_active_pid_ns(tsk)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+
+ if (!tsk)
+ return;
+ ab = audit_log_start(tsk->audit_context, GFP_KERNEL,
+ AUDIT_NS_INFO);
+ if (!ab)
+ return;
+ audit_log_format(ab, "pid=%d", task_pid_nr(tsk));
+ audit_log_format(ab, " dev=%02x:%02x", MAJOR(sb->s_dev), MINOR(sb->s_dev));
+ for (entry = ns_entries; !end; entry++) {
+ void *ns = (*entry)->get(tsk);
+
+ audit_log_format(ab, " %sns=%d", (*entry)->name,
+ (*entry)->inum(ns) - PROC_DYNAMIC_FIRST);
+ (*entry)->put(ns);
+ end = (*entry)->type == CLONE_NEWNS;
+ }
+ audit_log_end(ab);
+}
+#endif /* CONFIG_NAMESPACES */
+
void audit_log_key(struct audit_buffer *ab, char *key)
{
audit_log_format(ab, " key=");
@@ -1935,6 +1968,8 @@ void audit_log_link_denied(const char *operation, struct path *link)
audit_log_format(ab, " res=0");
audit_log_end(ab);

+ audit_log_ns_info(current);
+
/* Generate AUDIT_PATH record with object. */
name->type = AUDIT_TYPE_NORMAL;
audit_copy_inode(name, link->dentry, link->dentry->d_inode);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 4b89f7f..3ca3416 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1378,6 +1378,8 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts
audit_log_key(ab, context->filterkey);
audit_log_end(ab);

+ audit_log_ns_info(tsk);
+
for (aux = context->aux; aux; aux = aux->next) {

ab = audit_log_start(context, GFP_KERNEL, aux->type);
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index d9cd5ce..58ac695 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -323,6 +323,8 @@ void ima_audit_measurement(struct integrity_iint_cache *iint,
audit_log_task_info(ab, current);
audit_log_end(ab);

+ audit_log_ns_info(current);
+
iint->flags |= IMA_AUDITED;
}
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:53 UTC
Permalink
When a task with CAP_AUDIT_CONTROL sends a NETLINK_AUDIT message of type
AUDIT_NS_INFO with a PID of interest, dump the namespace IDs of that task to
the audit log.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
kernel/audit.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index e6230c4..b7f10e9 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -674,6 +674,7 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
case AUDIT_TTY_SET:
case AUDIT_TRIM:
case AUDIT_MAKE_EQUIV:
+ case AUDIT_NS_INFO:
/* Only support auditd and auditctl in initial pid namespace
* for now. */
if (task_active_pid_ns(current) != &init_pid_ns)
@@ -1070,6 +1071,19 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
audit_log_end(ab);
break;
}
+ case AUDIT_NS_INFO:
+#ifdef CONFIG_NAMESPACES
+ {
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(*(pid_t *)data);
+ rcu_read_unlock();
+ audit_log_ns_info(tsk);
+ }
+#else /* CONFIG_NAMESPACES */
+ err = -EOPNOTSUPP;
+#endif /* CONFIG_NAMESPACES */
default:
err = -EINVAL;
break;
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:51 UTC
Permalink
The audit subsystem should be initialized a bit earlier so that it is in place
in time for initial namespace ID number logging.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
kernel/audit.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 68200ce..63f32f4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1188,7 +1188,7 @@ static int __init audit_init(void)

return 0;
}
-__initcall(audit_init);
+subsys_initcall(audit_init);

/* Process kernel command-line parameter at boot time. audit=0 or audit=1. */
static int __init audit_enable(char *str)
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:57 UTC
Permalink
Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
include/uapi/linux/audit.h | 2 ++
kernel/audit.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 567b45f..b6a55fe 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -163,6 +163,8 @@

#define AUDIT_KERNEL 2000 /* Asynchronous audit record. NOT A REQUEST. */

+#define AUDIT_VIRT_CONTROL 2500 /* Start, Pause, Stop VM */
+
/* Rule flags */
#define AUDIT_FILTER_USER 0x00 /* Apply rule to user-generated messages */
#define AUDIT_FILTER_TASK 0x01 /* Apply rule at task creation (not syscall) */
diff --git a/kernel/audit.c b/kernel/audit.c
index a7b1b61..8a01d88 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -943,6 +943,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
}
audit_set_portid(ab, NETLINK_CB(skb).portid);
audit_log_end(ab);
+ if (msg_type == AUDIT_VIRT_CONTROL)
+ audit_log_ns_info(NULL);
mutex_lock(&audit_cmd_mutex);
}
break;
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:55 UTC
Permalink
When clone(2) is called to fork a new process creating one or more namespaces,
audit the event to tie the new pid with the namespace IDs.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
kernel/fork.c | 3 +++
kernel/nsproxy.c | 1 +
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 6a13c46..2ea1225 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1624,6 +1624,9 @@ long do_fork(unsigned long clone_flags,
get_task_struct(p);
}

+ if (unlikely(clone_flags & CLONE_NEW_MASK_ALL))
+ audit_log_ns_info(p);
+
wake_up_new_task(p);

/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e78110..d5353c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/audit.h>

static struct kmem_cache *nsproxy_cachep;
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:56 UTC
Permalink
Added six new audit message types, AUDIT_NS_SET_* and function
audit_log_ns_set() to log a switch of namespace.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
include/linux/audit.h | 4 +++
include/uapi/linux/audit.h | 6 +++++
kernel/audit.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
kernel/nsproxy.c | 3 ++
4 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index b28dfb0..c71c819 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -26,6 +26,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <uapi/linux/audit.h>
+#include <linux/proc_ns.h>

struct audit_sig_info {
uid_t uid;
@@ -487,6 +488,7 @@ static inline void audit_log_ns_info(struct task_struct *tsk)
extern void audit_log_ns_init(int type, unsigned int old_inum,
unsigned int inum);
extern void audit_log_ns_del(int type, unsigned int inum);
+extern void audit_log_ns_set(const struct proc_ns_operations *ops, void *ns);

extern int audit_update_lsm_rules(void);

@@ -550,6 +552,8 @@ static inline int audit_log_ns_init(int type, unsigned int old_inum,
{ }
static inline int audit_log_ns_del(int type, unsigned int inum)
{ }
+static inline void audit_log_ns_set(const struct proc_ns_operations *ops, void *ns)
+{ }
#define audit_enabled 0
#endif /* CONFIG_AUDIT */
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 487cad6..567b45f 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -123,6 +123,12 @@
#define AUDIT_NS_DEL_USER 1339 /* Record USER namespace instance deletion */
#define AUDIT_NS_DEL_PID 1340 /* Record PID namespace instance deletion */
#define AUDIT_NS_DEL_NET 1341 /* Record NET namespace instance deletion */
+#define AUDIT_NS_SET_MNT 1342 /* Record mount namespace instance deletion */
+#define AUDIT_NS_SET_UTS 1343 /* Record UTS namespace instance deletion */
+#define AUDIT_NS_SET_IPC 1344 /* Record IPC namespace instance deletion */
+#define AUDIT_NS_SET_USER 1345 /* Record USER namespace instance deletion */
+#define AUDIT_NS_SET_PID 1346 /* Record PID namespace instance deletion */
+#define AUDIT_NS_SET_NET 1347 /* Record NET namespace instance deletion */

#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index b7f10e9..a7b1b61 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2054,6 +2054,58 @@ void audit_log_ns_del(int type, unsigned int inum)
inum - PROC_DYNAMIC_FIRST);
audit_log_end(ab);
}
+
+/**
+ * audit_log_ns_set - report a namespace set change
+ * @ops: the ops structure for the namespace to be changed
+ * @ns: the new namespace
+ */
+void audit_log_ns_set(const struct proc_ns_operations *ops, void *ns)
+{
+ struct audit_buffer *ab;
+ void *old_ns;
+ int msg_type;
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+ char old_ns_s[16];
+
+ switch (ops->type) {
+ case CLONE_NEWNS:
+ msg_type = AUDIT_NS_SET_MNT;
+ break;
+ case CLONE_NEWUTS:
+ msg_type = AUDIT_NS_SET_UTS;
+ break;
+ case CLONE_NEWIPC:
+ msg_type = AUDIT_NS_SET_IPC;
+ break;
+ case CLONE_NEWUSER:
+ msg_type = AUDIT_NS_SET_USER;
+ break;
+ case CLONE_NEWPID:
+ msg_type = AUDIT_NS_SET_PID;
+ break;
+ case CLONE_NEWNET:
+ msg_type = AUDIT_NS_SET_NET;
+ break;
+ default:
+ return;
+ }
+ audit_log_common_recv_msg(&ab, ops->type);
+ if (!ab)
+ return;
+ old_ns = ops->get(current);
+ if (!ops->inum(old_ns))
+ sprintf(old_ns_s, "(none)");
+ else
+ sprintf(old_ns_s, "%d", ops->inum(old_ns) - PROC_DYNAMIC_FIRST);
+ audit_log_format(ab, " dev=%02x:%02x old_%sns=%s %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev),
+ ops->name, old_ns_s,
+ ops->name, ops->inum(ns));
+ ops->put(old_ns);
+ audit_log_end(ab);
+}
#endif /* CONFIG_NAMESPACES */

/**
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index d5353c2..2ca86cf 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -257,6 +257,9 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
goto out;
}
switch_task_namespaces(tsk, new_nsproxy);
+
+ audit_log_ns_set(ops, ei->ns);
+
out:
fput(file);
return err;
--
1.7.1
Richard Guy Briggs
2015-04-17 07:35:54 UTC
Permalink
Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
include/uapi/linux/sched.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..21ed8f4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -28,6 +28,12 @@
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
+#define CLONE_NEW_MASK_ALL (CLONE_NEWNS \
+ | CLONE_NEWUTS \
+ | CLONE_NEWIPC \
+ | CLONE_NEWUSER \
+ | CLONE_NEWPID \
+ | CLONE_NEWNET) /* mask of all namespace type flags */
#define CLONE_IO 0x80000000 /* Clone io context */

/*
--
1.7.1
Peter Zijlstra
2015-04-17 08:18:43 UTC
Permalink
Post by Richard Guy Briggs
Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.
A wee bit about why might be nice..
Richard Guy Briggs
2015-04-17 15:42:50 UTC
Permalink
Post by Peter Zijlstra
Post by Richard Guy Briggs
Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.
A wee bit about why might be nice..
It makes the following patch much cleaner to read:
[PATCH V6 08/10] fork: audit on creation of new namespace(s)
https://lkml.org/lkml/2015/4/17/50

I was hoping it might also make a lot of other code cleaner, but most of
the other places where multiple CLONE_NEW* flags are used, not all six
are used together, but only 5 are used. Ok, so it is helpful in 1 of 3:

It would actually be useful in check_unshare_flags():
https://github.com/torvalds/linux/blob/v3.17/kernel/fork.c#L1791

but not in copy_namespaces() or unshare_nsproxy_namespaces():
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L130
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L183

- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Peter Zijlstra
2015-04-17 17:41:31 UTC
Permalink
Post by Richard Guy Briggs
Post by Peter Zijlstra
Post by Richard Guy Briggs
Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.
A wee bit about why might be nice..
[PATCH V6 08/10] fork: audit on creation of new namespace(s)
https://lkml.org/lkml/2015/4/17/50
I was hoping it might also make a lot of other code cleaner, but most of
the other places where multiple CLONE_NEW* flags are used, not all six
https://github.com/torvalds/linux/blob/v3.17/kernel/fork.c#L1791
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L130
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L183
Right, so no objections from me on this, its just that I only saw this
one patch in isolation without context and the changelog failed on
rationale.

Does it perchance make sense to fold this patch into the next patch that
actually makes use of it?
Richard Guy Briggs
2015-04-17 22:00:04 UTC
Permalink
Post by Peter Zijlstra
Post by Richard Guy Briggs
Post by Peter Zijlstra
Post by Richard Guy Briggs
Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.
A wee bit about why might be nice..
[PATCH V6 08/10] fork: audit on creation of new namespace(s)
https://lkml.org/lkml/2015/4/17/50
I was hoping it might also make a lot of other code cleaner, but most of
the other places where multiple CLONE_NEW* flags are used, not all six
https://github.com/torvalds/linux/blob/v3.17/kernel/fork.c#L1791
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L130
https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L183
Right, so no objections from me on this, its just that I only saw this
one patch in isolation without context and the changelog failed on
rationale.
I realize you only saw a small window of this patchset, but this feels
like bike shedding about the main objective of the set...

I'll add a bit more justification and context if/when I respin for the
rest of the set.
Post by Peter Zijlstra
Does it perchance make sense to fold this patch into the next patch that
actually makes use of it?
It would if it were the only potential user. I don't want to bury a
surprise in something bigger. Is there a preferred way to use such a
macro to make the other three examples cleaner, or is that just useless
churn and obfuscation? Would there be a concise way to express all
CLONE_NEW* flags *except* user?

- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Richard Guy Briggs
2015-04-17 07:35:52 UTC
Permalink
Log the creation and deletion of namespace instances in all 6 types of
namespaces.

Twelve new audit message types have been introduced:
AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance creation */
AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance creation */
AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace instance creation */
AUDIT_NS_INIT_USER 1333 /* Record USER namespace instance creation */
AUDIT_NS_INIT_PID 1334 /* Record PID namespace instance creation */
AUDIT_NS_INIT_NET 1335 /* Record NET namespace instance creation */
AUDIT_NS_DEL_MNT 1336 /* Record mount namespace instance deletion */
AUDIT_NS_DEL_UTS 1337 /* Record UTS namespace instance deletion */
AUDIT_NS_DEL_IPC 1338 /* Record IPC namespace instance deletion */
AUDIT_NS_DEL_USER 1339 /* Record USER namespace instance deletion */
AUDIT_NS_DEL_PID 1340 /* Record PID namespace instance deletion */
AUDIT_NS_DEL_NET 1341 /* Record NET namespace instance deletion */

As suggested by Eric Paris, there are 12 message types, one for each of
creation and deletion, one for each type of namespace so that text searches are
easier in conjunction with the AUDIT_NS_INFO message type, being able to search
for all records such as "netns=4 " and to avoid fields disappearing per message
type to make ausearch more efficient.

A typical startup would look roughly like:

type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-2 res=1
type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-3 res=1
type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-4 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-1 res=1
type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1

And a CLONE action would result in:
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1

While deleting a namespace would result in:
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1

If not "(none)", old_XXXns lists the namespace from which it was cloned.

Signed-off-by: Richard Guy Briggs <***@redhat.com>
---
fs/namespace.c | 13 +++++++++
include/linux/audit.h | 8 +++++
include/uapi/linux/audit.h | 12 ++++++++
ipc/namespace.c | 12 ++++++++
kernel/audit.c | 64 ++++++++++++++++++++++++++++++++++++++++++++
kernel/pid_namespace.c | 13 +++++++++
kernel/user_namespace.c | 13 +++++++++
kernel/utsname.c | 12 ++++++++
net/core/net_namespace.c | 12 ++++++++
9 files changed, 159 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 182bc41..7b62543 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -24,6 +24,7 @@
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
+#include <linux/audit.h>
#include "pnode.h"
#include "internal.h"

@@ -2459,6 +2460,7 @@ dput_out:

static void free_mnt_ns(struct mnt_namespace *ns)
{
+ audit_log_ns_del(AUDIT_NS_DEL_MNT, ns->proc_inum);
proc_free_inum(ns->proc_inum);
put_user_ns(ns->user_ns);
kfree(ns);
@@ -2518,6 +2520,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
return new_ns;
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, ns->proc_inum, new_ns->proc_inum);

namespace_lock();
/* First pass: copy the tree topology */
@@ -2830,6 +2833,16 @@ static void __init init_mount_tree(void)
set_fs_root(current->fs, &root);
}

+/* log the ID of init mnt namespace after audit service starts */
+static int __init mnt_ns_init_log(void)
+{
+ struct mnt_namespace *init_mnt_ns = init_task.nsproxy->mnt_ns;
+
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, 0, init_mnt_ns->proc_inum);
+ return 0;
+}
+late_initcall(mnt_ns_init_log);
+
void __init mnt_init(void)
{
unsigned u;
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 71698ec..b28dfb0 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -484,6 +484,9 @@ extern void audit_log_ns_info(struct task_struct *tsk);
static inline void audit_log_ns_info(struct task_struct *tsk)
{ }
#endif
+extern void audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum);
+extern void audit_log_ns_del(int type, unsigned int inum);

extern int audit_update_lsm_rules(void);

@@ -542,6 +545,11 @@ static inline void audit_log_task_info(struct audit_buffer *ab,
{ }
static inline void audit_log_ns_info(struct task_struct *tsk)
{ }
+static inline int audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum)
+{ }
+static inline int audit_log_ns_del(int type, unsigned int inum)
+{ }
#define audit_enabled 0
#endif /* CONFIG_AUDIT */
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 1ffb151..487cad6 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,18 @@
#define AUDIT_PROCTITLE 1327 /* Proctitle emit event */
#define AUDIT_FEATURE_CHANGE 1328 /* audit log listing feature changes */
#define AUDIT_NS_INFO 1329 /* Record process namespace IDs */
+#define AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance creation */
+#define AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance creation */
+#define AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace instance creation */
+#define AUDIT_NS_INIT_USER 1333 /* Record USER namespace instance creation */
+#define AUDIT_NS_INIT_PID 1334 /* Record PID namespace instance creation */
+#define AUDIT_NS_INIT_NET 1335 /* Record NET namespace instance creation */
+#define AUDIT_NS_DEL_MNT 1336 /* Record mount namespace instance deletion */
+#define AUDIT_NS_DEL_UTS 1337 /* Record UTS namespace instance deletion */
+#define AUDIT_NS_DEL_IPC 1338 /* Record IPC namespace instance deletion */
+#define AUDIT_NS_DEL_USER 1339 /* Record USER namespace instance deletion */
+#define AUDIT_NS_DEL_PID 1340 /* Record PID namespace instance deletion */
+#define AUDIT_NS_DEL_NET 1341 /* Record NET namespace instance deletion */

#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59451c1..73727ce 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -13,6 +13,7 @@
#include <linux/mount.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>

#include "util.h"

@@ -41,6 +42,8 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
}
atomic_inc(&nr_ipc_ns);

+ audit_log_ns_init(AUDIT_NS_INIT_IPC, old_ns->proc_inum, ns->proc_inum);
+
sem_init_ns(ns);
msg_init_ns(ns);
shm_init_ns(ns);
@@ -119,6 +122,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
*/
ipcns_notify(IPCNS_REMOVED);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_IPC, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -197,3 +201,11 @@ const struct proc_ns_operations ipcns_operations = {
.install = ipcns_install,
.inum = ipcns_inum,
};
+
+/* log the ID of init IPC namespace after audit service starts */
+static int __init ipc_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_IPC, 0, init_ipc_ns.proc_inum);
+ return 0;
+}
+late_initcall(ipc_namespaces_init);
diff --git a/kernel/audit.c b/kernel/audit.c
index 63f32f4..e6230c4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1978,6 +1978,70 @@ out:
kfree(name);
}

+#ifdef CONFIG_NAMESPACES
+static char *ns_name[] = {
+ "mnt",
+ "uts",
+ "ipc",
+ "user",
+ "pid",
+ "net",
+};
+
+/**
+ * audit_log_ns_init - report a namespace instance creation
+ * @type: type of audit namespace instance created message
+ * @old_inum: the ID number of the cloned namespace instance
+ * @inum: the ID number of the new namespace instance
+ */
+void audit_log_ns_init(int type, unsigned int old_inum, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_INIT_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+ char old_ns[16];
+
+ if (type < AUDIT_NS_INIT_MNT || type > AUDIT_NS_INIT_NET) {
+ WARN(1, "audit_log_ns_init: type:%d out of range", type);
+ return;
+ }
+ if (!old_inum)
+ sprintf(old_ns, "(none)");
+ else
+ sprintf(old_ns, "%d", old_inum - PROC_DYNAMIC_FIRST);
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x old_%sns=%s %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev),
+ audit_ns_name, old_ns,
+ audit_ns_name, inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+
+/**
+ * audit_log_ns_del - report a namespace instance deleted
+ * @type: type of audit namespace instance deleted message
+ * @inum: the ID number of the namespace instance
+ */
+void audit_log_ns_del(int type, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_DEL_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+
+ if (type < AUDIT_NS_DEL_MNT || type > AUDIT_NS_DEL_NET) {
+ WARN(1, "audit_log_ns_del: type:%d out of range", type);
+ return;
+ }
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev), audit_ns_name,
+ inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+#endif /* CONFIG_NAMESPACES */
+
/**
* audit_log_end - end one audit record
* @ab: the audit_buffer
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..d28fd14 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -18,6 +18,7 @@
#include <linux/proc_ns.h>
#include <linux/reboot.h>
#include <linux/export.h>
+#include <linux/audit.h>

struct pid_cache {
int nr_ids;
@@ -109,6 +110,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
if (err)
goto out_free_map;

+ audit_log_ns_init(AUDIT_NS_INIT_PID, parent_pid_ns->proc_inum,
+ ns->proc_inum);
+
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
@@ -142,6 +146,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
{
int i;

+ audit_log_ns_del(AUDIT_NS_DEL_PID, ns->proc_inum);
proc_free_inum(ns->proc_inum);
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
@@ -388,3 +393,11 @@ static __init int pid_namespaces_init(void)
}

__initcall(pid_namespaces_init);
+
+/* log the ID of init PID namespace after audit service starts */
+static __init int pid_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_PID, 0, init_pid_ns.proc_inum);
+ return 0;
+}
+late_initcall(pid_namespaces_late_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index fcc0256..89c2517 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -22,6 +22,7 @@
#include <linux/ctype.h>
#include <linux/projid.h>
#include <linux/fs_struct.h>
+#include <linux/audit.h>

static struct kmem_cache *user_ns_cachep __read_mostly;

@@ -92,6 +93,9 @@ int create_user_ns(struct cred *new)
return ret;
}

+ audit_log_ns_init(AUDIT_NS_INIT_USER, parent_ns->proc_inum,
+ ns->proc_inum);
+
atomic_set(&ns->count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
@@ -136,6 +140,7 @@ void free_user_ns(struct user_namespace *ns)
#ifdef CONFIG_PERSISTENT_KEYRINGS
key_put(ns->persistent_keyring_register);
#endif
+ audit_log_ns_del(AUDIT_NS_DEL_USER, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kmem_cache_free(user_ns_cachep, ns);
ns = parent;
@@ -909,3 +914,11 @@ static __init int user_namespaces_init(void)
return 0;
}
subsys_initcall(user_namespaces_init);
+
+/* log the ID of init user namespace after audit service starts */
+static __init int user_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_USER, 0, init_user_ns.proc_inum);
+ return 0;
+}
+late_initcall(user_namespaces_late_init);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fd39312..fa21e8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -16,6 +16,7 @@
#include <linux/slab.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>

static struct uts_namespace *create_uts_ns(void)
{
@@ -48,6 +49,8 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
return ERR_PTR(err);
}

+ audit_log_ns_init(AUDIT_NS_INIT_UTS, old_ns->proc_inum, ns->proc_inum);
+
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
ns->user_ns = get_user_ns(user_ns);
@@ -84,6 +87,7 @@ void free_uts_ns(struct kref *kref)

ns = container_of(kref, struct uts_namespace, kref);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_UTS, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -138,3 +142,11 @@ const struct proc_ns_operations utsns_operations = {
.install = utsns_install,
.inum = utsns_inum,
};
+
+/* log the ID of init UTS namespace after audit service starts */
+static int __init uts_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_UTS, 0, init_uts_ns.proc_inum);
+ return 0;
+}
+late_initcall(uts_namespaces_init);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..562eb85 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -17,6 +17,7 @@
#include <linux/user_namespace.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
+#include <linux/audit.h>

/*
* Our network namespace constructor/destructor lists
@@ -253,6 +254,8 @@ struct net *copy_net_ns(unsigned long flags,
mutex_lock(&net_mutex);
rv = setup_net(net, user_ns);
if (rv == 0) {
+ audit_log_ns_init(AUDIT_NS_INIT_NET, old_net->proc_inum,
+ net->proc_inum);
rtnl_lock();
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();
@@ -389,6 +392,7 @@ static __net_init int net_ns_net_init(struct net *net)

static __net_exit void net_ns_net_exit(struct net *net)
{
+ audit_log_ns_del(AUDIT_NS_DEL_NET, net->proc_inum);
proc_free_inum(net->proc_inum);
}

@@ -435,6 +439,14 @@ static int __init net_ns_init(void)

pure_initcall(net_ns_init);

+/* log the ID of init_net namespace after audit service starts */
+static int __init net_ns_init_log(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_NET, 0, init_net.proc_inum);
+ return 0;
+}
+late_initcall(net_ns_init_log);
+
#ifdef CONFIG_NET_NS
static int __register_pernet_operations(struct list_head *list,
struct pernet_operations *ops)
--
1.7.1
Steve Grubb
2015-05-05 14:22:32 UTC
Permalink
Hello,

I think there needs to be some more discussion around this. It seems like this
is not exactly recording things that are useful for audit.
Post by Richard Guy Briggs
Log the creation and deletion of namespace instances in all 6 types of
namespaces.
AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance creation
*/ AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance
creation */ AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace
instance creation */ AUDIT_NS_INIT_USER 1333 /* Record USER
namespace instance creation */ AUDIT_NS_INIT_PID 1334 /* Record
PID namespace instance creation */ AUDIT_NS_INIT_NET 1335 /*
Record NET namespace instance creation */ AUDIT_NS_DEL_MNT 1336
/* Record mount namespace instance deletion */ AUDIT_NS_DEL_UTS 1337
/* Record UTS namespace instance deletion */ AUDIT_NS_DEL_IPC
1338 /* Record IPC namespace instance deletion */ AUDIT_NS_DEL_USER
1339 /* Record USER namespace instance deletion */ AUDIT_NS_DEL_PID
1340 /* Record PID namespace instance deletion */ AUDIT_NS_DEL_NET
1341 /* Record NET namespace instance deletion */
The requirements for auditing of containers should be derived from VPP. In it,
it asks for selectable auditing, selective audit, and selective audit review.
What this means is that we need the container and all its children to have one
identifier that is inserted into all the events that are associated with the
container.

With this, its possible to do a search for all events related to a container.
Its possible to exclude events from a container. Its possible to not get any
events.

The requirements also call out for the identification of the subject. This
means that the event should be bound to a syscall such as clone, setns, or
unshare.

Also, any user space events originating inside the container needs to have the
container ID added to the user space event - just like auid and session id.

Recording each instance of a name space is giving me something that I cannot
use to do queries required by the security target. Given these events, how do
I locate a web server event where it accesses a watched file? That
authentication failed? That an update within the container failed?

The requirements are that we have to log the creation, suspension, migration,
and termination of a container. The requirements are not on the individual
name space.

Maybe I'm missing how these events give me that. But I'd like to hear how I
would be able to meet requirements with these 12 events.

-Steve
Post by Richard Guy Briggs
As suggested by Eric Paris, there are 12 message types, one for each of
creation and deletion, one for each type of namespace so that text searches
are easier in conjunction with the AUDIT_NS_INFO message type, being able
to search for all records such as "netns=4 " and to avoid fields
disappearing per message type to make ausearch more efficient.
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none)
utsns=-2 res=1 type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03
old_userns=(none) userns=-3 res=1 type=AUDIT_NS_INIT_PID
msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295
subj=kernel dev=00:03 old_pidns=(none) pidns=-4 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none)
mntns=0 res=1 type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none)
ipcns=-1 res=1 type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none)
netns=2 res=1
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03
old_netns=2 netns=3 res=1
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03
mntns=4 res=1
If not "(none)", old_XXXns lists the namespace from which it was cloned.
---
fs/namespace.c | 13 +++++++++
include/linux/audit.h | 8 +++++
include/uapi/linux/audit.h | 12 ++++++++
ipc/namespace.c | 12 ++++++++
kernel/audit.c | 64
++++++++++++++++++++++++++++++++++++++++++++ kernel/pid_namespace.c |
13 +++++++++
kernel/user_namespace.c | 13 +++++++++
kernel/utsname.c | 12 ++++++++
net/core/net_namespace.c | 12 ++++++++
9 files changed, 159 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 182bc41..7b62543 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -24,6 +24,7 @@
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
+#include <linux/audit.h>
#include "pnode.h"
#include "internal.h"
static void free_mnt_ns(struct mnt_namespace *ns)
{
+ audit_log_ns_del(AUDIT_NS_DEL_MNT, ns->proc_inum);
proc_free_inum(ns->proc_inum);
put_user_ns(ns->user_ns);
kfree(ns);
@@ -2518,6 +2520,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags,
struct mnt_namespace *ns, new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
return new_ns;
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, ns->proc_inum, new_ns->proc_inum);
namespace_lock();
/* First pass: copy the tree topology */
@@ -2830,6 +2833,16 @@ static void __init init_mount_tree(void)
set_fs_root(current->fs, &root);
}
+/* log the ID of init mnt namespace after audit service starts */
+static int __init mnt_ns_init_log(void)
+{
+ struct mnt_namespace *init_mnt_ns = init_task.nsproxy->mnt_ns;
+
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, 0, init_mnt_ns->proc_inum);
+ return 0;
+}
+late_initcall(mnt_ns_init_log);
+
void __init mnt_init(void)
{
unsigned u;
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 71698ec..b28dfb0 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -484,6 +484,9 @@ extern void audit_log_ns_info(struct
task_struct
Post by Richard Guy Briggs
*tsk); static inline void audit_log_ns_info(struct task_struct *tsk) {
}
#endif
+extern void audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum);
+extern void audit_log_ns_del(int type, unsigned int inum);
extern int audit_update_lsm_rules(void);
@@ -542,6 +545,11 @@ static inline void audit_log_task_info(struct
audit_buffer *ab, { }
static inline void audit_log_ns_info(struct task_struct *tsk)
{ }
+static inline int audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum)
+{ }
+static inline int audit_log_ns_del(int type, unsigned int inum)
+{ }
#define audit_enabled 0
#endif /* CONFIG_AUDIT */
static inline void audit_log_string(struct audit_buffer *ab, const char
*buf) diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 1ffb151..487cad6 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,18 @@
#define AUDIT_PROCTITLE 1327 /* Proctitle emit event */
#define AUDIT_FEATURE_CHANGE 1328 /* audit log listing feature changes */
#define AUDIT_NS_INFO 1329 /* Record process namespace IDs */
+#define AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance
creation
Post by Richard Guy Briggs
*/ +#define AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance
creation */ +#define AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace
instance creation */ +#define AUDIT_NS_INIT_USER 1333 /* Record USER
namespace instance creation */ +#define AUDIT_NS_INIT_PID 1334 /* Record
PID namespace instance creation */ +#define AUDIT_NS_INIT_NET 1335 /*
Record NET namespace instance creation */ +#define AUDIT_NS_DEL_MNT 1336
/*
Post by Richard Guy Briggs
Record mount namespace instance deletion */ +#define
AUDIT_NS_DEL_UTS 1337 /* Record UTS namespace instance deletion */
+#define
Post by Richard Guy Briggs
AUDIT_NS_DEL_IPC 1338 /* Record IPC namespace instance deletion */
+#define
Post by Richard Guy Briggs
AUDIT_NS_DEL_USER 1339 /* Record USER namespace instance deletion */
+#define AUDIT_NS_DEL_PID 1340 /* Record PID namespace instance deletion */
+#define AUDIT_NS_DEL_NET 1341 /* Record NET namespace instance deletion */
#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59451c1..73727ce 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -13,6 +13,7 @@
#include <linux/mount.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>
#include "util.h"
@@ -41,6 +42,8 @@ static struct ipc_namespace *create_ipc_ns(struct
user_namespace *user_ns, }
atomic_inc(&nr_ipc_ns);
+ audit_log_ns_init(AUDIT_NS_INIT_IPC, old_ns->proc_inum, ns->proc_inum);
+
sem_init_ns(ns);
msg_init_ns(ns);
shm_init_ns(ns);
@@ -119,6 +122,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
*/
ipcns_notify(IPCNS_REMOVED);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_IPC, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -197,3 +201,11 @@ const struct proc_ns_operations ipcns_operations = {
.install = ipcns_install,
.inum = ipcns_inum,
};
+
+/* log the ID of init IPC namespace after audit service starts */
+static int __init ipc_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_IPC, 0, init_ipc_ns.proc_inum);
+ return 0;
+}
+late_initcall(ipc_namespaces_init);
diff --git a/kernel/audit.c b/kernel/audit.c
index 63f32f4..e6230c4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
kfree(name);
}
+#ifdef CONFIG_NAMESPACES
+static char *ns_name[] = {
+ "mnt",
+ "uts",
+ "ipc",
+ "user",
+ "pid",
+ "net",
+};
+
+/**
+ * audit_log_ns_init - report a namespace instance creation
+ */
+void audit_log_ns_init(int type, unsigned int old_inum, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_INIT_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+ char old_ns[16];
+
+ if (type < AUDIT_NS_INIT_MNT || type > AUDIT_NS_INIT_NET) {
+ WARN(1, "audit_log_ns_init: type:%d out of range", type);
+ return;
+ }
+ if (!old_inum)
+ sprintf(old_ns, "(none)");
+ else
+ sprintf(old_ns, "%d", old_inum - PROC_DYNAMIC_FIRST);
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x old_%sns=%s %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev),
+ audit_ns_name, old_ns,
+ audit_ns_name, inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+
+/**
+ * audit_log_ns_del - report a namespace instance deleted
+ */
+void audit_log_ns_del(int type, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_DEL_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+
+ if (type < AUDIT_NS_DEL_MNT || type > AUDIT_NS_DEL_NET) {
+ WARN(1, "audit_log_ns_del: type:%d out of range", type);
+ return;
+ }
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev), audit_ns_name,
+ inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+#endif /* CONFIG_NAMESPACES */
+
/**
* audit_log_end - end one audit record
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..d28fd14 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -18,6 +18,7 @@
#include <linux/proc_ns.h>
#include <linux/reboot.h>
#include <linux/export.h>
+#include <linux/audit.h>
struct pid_cache {
int nr_ids;
@@ -109,6 +110,9 @@ static struct pid_namespace *create_pid_namespace(struct
user_namespace *user_ns if (err)
goto out_free_map;
+ audit_log_ns_init(AUDIT_NS_INIT_PID, parent_pid_ns->proc_inum,
+ ns->proc_inum);
+
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
@@ -142,6 +146,7 @@ static void destroy_pid_namespace(struct pid_namespace
*ns) {
int i;
+ audit_log_ns_del(AUDIT_NS_DEL_PID, ns->proc_inum);
proc_free_inum(ns->proc_inum);
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
@@ -388,3 +393,11 @@ static __init int pid_namespaces_init(void)
}
__initcall(pid_namespaces_init);
+
+/* log the ID of init PID namespace after audit service starts */
+static __init int pid_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_PID, 0, init_pid_ns.proc_inum);
+ return 0;
+}
+late_initcall(pid_namespaces_late_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index fcc0256..89c2517 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -22,6 +22,7 @@
#include <linux/ctype.h>
#include <linux/projid.h>
#include <linux/fs_struct.h>
+#include <linux/audit.h>
static struct kmem_cache *user_ns_cachep __read_mostly;
@@ -92,6 +93,9 @@ int create_user_ns(struct cred *new)
return ret;
}
+ audit_log_ns_init(AUDIT_NS_INIT_USER, parent_ns->proc_inum,
+ ns->proc_inum);
+
atomic_set(&ns->count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
@@ -136,6 +140,7 @@ void free_user_ns(struct user_namespace *ns)
#ifdef CONFIG_PERSISTENT_KEYRINGS
key_put(ns->persistent_keyring_register);
#endif
+ audit_log_ns_del(AUDIT_NS_DEL_USER, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kmem_cache_free(user_ns_cachep, ns);
ns = parent;
@@ -909,3 +914,11 @@ static __init int user_namespaces_init(void)
return 0;
}
subsys_initcall(user_namespaces_init);
+
+/* log the ID of init user namespace after audit service starts */
+static __init int user_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_USER, 0, init_user_ns.proc_inum);
+ return 0;
+}
+late_initcall(user_namespaces_late_init);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fd39312..fa21e8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -16,6 +16,7 @@
#include <linux/slab.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>
static struct uts_namespace *create_uts_ns(void)
{
@@ -48,6 +49,8 @@ static struct uts_namespace *clone_uts_ns(struct
user_namespace *user_ns, return ERR_PTR(err);
}
+ audit_log_ns_init(AUDIT_NS_INIT_UTS, old_ns->proc_inum, ns->proc_inum);
+
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
ns->user_ns = get_user_ns(user_ns);
@@ -84,6 +87,7 @@ void free_uts_ns(struct kref *kref)
ns = container_of(kref, struct uts_namespace, kref);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_UTS, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -138,3 +142,11 @@ const struct proc_ns_operations utsns_operations = {
.install = utsns_install,
.inum = utsns_inum,
};
+
+/* log the ID of init UTS namespace after audit service starts */
+static int __init uts_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_UTS, 0, init_uts_ns.proc_inum);
+ return 0;
+}
+late_initcall(uts_namespaces_init);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..562eb85 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -17,6 +17,7 @@
#include <linux/user_namespace.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
+#include <linux/audit.h>
/*
* Our network namespace constructor/destructor lists
@@ -253,6 +254,8 @@ struct net *copy_net_ns(unsigned long flags,
mutex_lock(&net_mutex);
rv = setup_net(net, user_ns);
if (rv == 0) {
+ audit_log_ns_init(AUDIT_NS_INIT_NET, old_net->proc_inum,
+ net->proc_inum);
rtnl_lock();
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();
@@ -389,6 +392,7 @@ static __net_init int net_ns_net_init(struct net *net)
static __net_exit void net_ns_net_exit(struct net *net)
{
+ audit_log_ns_del(AUDIT_NS_DEL_NET, net->proc_inum);
proc_free_inum(net->proc_inum);
}
@@ -435,6 +439,14 @@ static int __init net_ns_init(void)
pure_initcall(net_ns_init);
+/* log the ID of init_net namespace after audit service starts */
+static int __init net_ns_init_log(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_NET, 0, init_net.proc_inum);
+ return 0;
+}
+late_initcall(net_ns_init_log);
+
#ifdef CONFIG_NET_NS
static int __register_pernet_operations(struct list_head *list,
struct pernet_operations *ops)
Aristeu Rozanski
2015-05-05 14:31:20 UTC
Permalink
Hi Steve,
Post by Steve Grubb
The requirements for auditing of containers should be derived from VPP. In it,
it asks for selectable auditing, selective audit, and selective audit review.
What this means is that we need the container and all its children to have one
identifier that is inserted into all the events that are associated with the
container.
With this, its possible to do a search for all events related to a container.
Its possible to exclude events from a container. Its possible to not get any
events.
The requirements also call out for the identification of the subject. This
means that the event should be bound to a syscall such as clone, setns, or
unshare.
Also, any user space events originating inside the container needs to have the
container ID added to the user space event - just like auid and session id.
Recording each instance of a name space is giving me something that I cannot
use to do queries required by the security target. Given these events, how do
I locate a web server event where it accesses a watched file? That
authentication failed? That an update within the container failed?
The requirements are that we have to log the creation, suspension, migration,
and termination of a container. The requirements are not on the individual
name space.
Maybe I'm missing how these events give me that. But I'd like to hear how I
would be able to meet requirements with these 12 events.
what about cases you don't use lxc, libvirt to create namespaces? It's
easier if the logging is done by namespaces and in case they're created
by any container manager, it can generate a new event notifying it
created a container named "foo" with these namespaces: x, y, z, w and
from that you can piece together everything that happened. Userspace
tools can change to adapt to using namespaces and the idea of container
to make it easier to lookup for events instead of relying on a number
that might not be there (think someone using unshare, ip netns, ...). It
was discussed in the past and having the concept of "container" in
kernel space and it's not going to happen, so userspace should deal with
it.
--
Aristeu
Steve Grubb
2015-05-05 14:46:44 UTC
Permalink
Post by Aristeu Rozanski
Hi Steve,
Post by Steve Grubb
The requirements for auditing of containers should be derived from VPP. In
it, it asks for selectable auditing, selective audit, and selective audit
review. What this means is that we need the container and all its
children to have one identifier that is inserted into all the events that
are associated with the container.
With this, its possible to do a search for all events related to a
container. Its possible to exclude events from a container. Its possible
to not get any events.
The requirements also call out for the identification of the subject. This
means that the event should be bound to a syscall such as clone, setns, or
unshare.
Also, any user space events originating inside the container needs to have
the container ID added to the user space event - just like auid and
session id.
Recording each instance of a name space is giving me something that I
cannot use to do queries required by the security target. Given these
events, how do I locate a web server event where it accesses a watched
file? That authentication failed? That an update within the container
failed?
The requirements are that we have to log the creation, suspension,
migration, and termination of a container. The requirements are not on
the individual name space.
Maybe I'm missing how these events give me that. But I'd like to hear how
I would be able to meet requirements with these 12 events.
what about cases you don't use lxc, libvirt to create namespaces?
There's a pretty good chance that we don't care. We've had file system
namespace for about 8 or 9 years and we never needed to have a namespace
identifier added.
Post by Aristeu Rozanski
It's easier if the logging is done by namespaces and in case they're created
by any container manager, it can generate a new event notifying it
created a container named "foo" with these namespaces: x, y, z, w and
from that you can piece together everything that happened.
OK, if they are emitted they should be an auxiliary record to clone, setns, or
unshare system calls. But lets go down this path. We have 6 or so name spaces.
These identifiers will need to be added to every single event in the system so
that I can figure out what event belongs to which container.
Post by Aristeu Rozanski
Userspace tools can change to adapt to using namespaces and the idea of
container to make it easier to lookup for events instead of relying on a
number that might not be there (think someone using unshare, ip netns, ...).
That's what I am trying to do...figure out how I can these identifiers to see if
this actually solves the problem. This is why I wanted to state the actual
requirements. Its easy to lose the overall view.

Also, I am concerned about how much extra disk space this is going to eat up.
Post by Aristeu Rozanski
It was discussed in the past and having the concept of "container" in
kernel space and it's not going to happen, so userspace should deal with
it.
This is what I am asking for help with. How do I locate an authentication
event from container using the information in these events?

-Steve
Eric W. Biederman
2015-05-05 14:56:03 UTC
Permalink
Post by Steve Grubb
The requirements for auditing of containers should be derived from VPP. In it,
it asks for selectable auditing, selective audit, and selective audit review.
What this means is that we need the container and all its children to have one
identifier that is inserted into all the events that are associated with the
container.
That is technically impossible. Nested containers exist.

That is when container G is nested in container F which is in turn
nested in container E which is in turn nested in container D which is in
turn nested in container C which is in turn nested in container B which
is nested in container A there is no one label you can put on audit
messages from container G which is the ``correct'' one.

Or are you proposing that something in container G have labels
A B C D E F G included on every audit message? That introduces enough
complexity in generating and parsing the messages I wouldn't trust those
messages as the least bug in generation and parsing would be a security
issue.

What is the world is VPP? It sounds like something non-public thing.
Certainly it has never been a part of the public container discussion
and as such it appears to be completely ridiculous to bring up in a
public discussion.

Eric
Steve Grubb
2015-05-05 15:16:56 UTC
Permalink
Post by Eric W. Biederman
Post by Steve Grubb
The requirements for auditing of containers should be derived from VPP. In
it, it asks for selectable auditing, selective audit, and selective audit
review. What this means is that we need the container and all its
children to have one identifier that is inserted into all the events that
are associated with the container.
That is technically impossible. Nested containers exist.
OK, then lets talk about that, too. When something is 2 layers deep, the
outside world cannot make sense of it. The inner one can be a loopback mounted
file in the outer one. That means that I need the container itself to be
responsible for events so that things are recorded using paths, uids, and pids
that make sense to it. It can enrich the events and send them to the outer
container.
Post by Eric W. Biederman
That is when container G is nested in container F which is in turn
nested in container E which is in turn nested in container D which is in
turn nested in container C which is in turn nested in container B which
is nested in container A there is no one label you can put on audit
messages from container G which is the ``correct'' one.
Or are you proposing that something in container G have labels
A B C D E F G included on every audit message?
We need to have audit events to either be globally tagged so that the outside
world understand what happening no matter how deep. Or we need each layer to
be responsible for itself. This means having an audit rule match engine for
each namespace like netfilter is to networking.
Post by Eric W. Biederman
That introduces enough complexity in generating and parsing the messages I
wouldn't trust those messages as the least bug in generation and parsing
would be a security issue.
That goes with the territory.
Post by Eric W. Biederman
What is the world is VPP?
Virtualization Protection Profile. Before people say it doesn't apply, it kind
of does. It defines the necessary security mechanisms for either full blown
virt like QEMU/Xen based or it gives enough wiggle room for containers and
other types of VMs. Specifically, it defines the audit requirements needed for
this kind of technology.
Post by Eric W. Biederman
It sounds like something non-public thing. Certainly it has never been a
part of the public container discussion and as such it appears to be
completely ridiculous to bring up in a public discussion.
No, its a public thing. Audit requirements start in section 5.2:

https://www.niap-ccevs.org/pp/PP_SV_V1.0/

-Steve
Richard Guy Briggs
2015-05-12 19:57:59 UTC
Permalink
Post by Steve Grubb
Hello,
I think there needs to be some more discussion around this. It seems like this
is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Post by Steve Grubb
Post by Richard Guy Briggs
Log the creation and deletion of namespace instances in all 6 types of
namespaces.
AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance creation
*/ AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance
creation */ AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace
instance creation */ AUDIT_NS_INIT_USER 1333 /* Record USER
namespace instance creation */ AUDIT_NS_INIT_PID 1334 /* Record
PID namespace instance creation */ AUDIT_NS_INIT_NET 1335 /*
Record NET namespace instance creation */ AUDIT_NS_DEL_MNT 1336
/* Record mount namespace instance deletion */ AUDIT_NS_DEL_UTS 1337
/* Record UTS namespace instance deletion */ AUDIT_NS_DEL_IPC
1338 /* Record IPC namespace instance deletion */ AUDIT_NS_DEL_USER
1339 /* Record USER namespace instance deletion */ AUDIT_NS_DEL_PID
1340 /* Record PID namespace instance deletion */ AUDIT_NS_DEL_NET
1341 /* Record NET namespace instance deletion */
The requirements for auditing of containers should be derived from VPP. In it,
it asks for selectable auditing, selective audit, and selective audit review.
What this means is that we need the container and all its children to have one
identifier that is inserted into all the events that are associated with the
container.
Is that requirement for the records that are sent from the kernel, or
for the records stored by auditd, or by another facility that delivers
those records to a final consumer?
Post by Steve Grubb
With this, its possible to do a search for all events related to a container.
Its possible to exclude events from a container. Its possible to not get any
events.
The requirements also call out for the identification of the subject. This
means that the event should be bound to a syscall such as clone, setns, or
unshare.
Is it useful to have a reference of the init namespace set from which
all others are spawned?

If it isn't bound, I assume the subject should be added to the message
format? I'm thinking of messages without an audit_context such as audit
user messages (such as AUDIT_NS_INFO and AUDIT_VIRT_CONTROL).

For now, we should not need to log namespaces with AUDIT_FEATURE_CHANGE
or AUDIT_CONFIG_CHANGE messages since only initial user namespace with
initial pid namespace has permission to do so. This will need to be
addressed by having non-init config changes be limited to that container
or set of namespaces and possibly its children. The other possibility
is to add the subject to the stand-alone message.
Post by Steve Grubb
Also, any user space events originating inside the container needs to have the
container ID added to the user space event - just like auid and session id.
This sounds like every task needs to record a container ID since that
information is otherwise unknown by the kernel except by what might be
provided by an audit user message such as AUDIT_VIRT_CONTROL or possibly
the new AUDIT_NS_INFO request. It could be stored in struct task_struct
or in struct audit_context. I don't have a suggestion on how to get
that information securely into the kernel.
Post by Steve Grubb
Recording each instance of a name space is giving me something that I cannot
use to do queries required by the security target. Given these events, how do
I locate a web server event where it accesses a watched file? That
authentication failed? That an update within the container failed?
The requirements are that we have to log the creation, suspension, migration,
and termination of a container. The requirements are not on the individual
name space.
Ok. Do we have a robust definition of a container? Where is that
definition managed? If it is a userspace concept, then I think either
userspace should be assembling this information, or providing that
information to the entity that will be expected to know about and
provide it.
Post by Steve Grubb
Maybe I'm missing how these events give me that. But I'd like to hear how I
would be able to meet requirements with these 12 events.
Adding the infrastructure to give each of those 12 events an audit
context to be able to give meaningful subject fields in audit records
appears to require adding a struct task_struct argument to calls to
copy_mnt_ns(), copy_utsname(), copy_ipcs(), copy_pid_ns(),
copy_net_ns(), create_user_ns() unless I use current. I think we must
use current since the userns is created before the spawned process is
mature or has an audit context in the case of clone.

Either that, or I have mis-understood and I should be stashing this
namespace ID information in an audit_aux_data structure or a more
permanent part of struct audit_context to be printed when required on
syscall exit. I'm trying to think through if it is needed in any
non-syscall audit messages.

Another RFC patch set coming...
Post by Steve Grubb
-Steve
Post by Richard Guy Briggs
As suggested by Eric Paris, there are 12 message types, one for each of
creation and deletion, one for each type of namespace so that text searches
are easier in conjunction with the AUDIT_NS_INFO message type, being able
to search for all records such as "netns=4 " and to avoid fields
disappearing per message type to make ausearch more efficient.
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none)
utsns=-2 res=1 type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03
old_userns=(none) userns=-3 res=1 type=AUDIT_NS_INIT_PID
msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295
subj=kernel dev=00:03 old_pidns=(none) pidns=-4 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none)
mntns=0 res=1 type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none)
ipcns=-1 res=1 type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1
uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none)
netns=2 res=1
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03
old_netns=2 netns=3 res=1
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03
mntns=4 res=1
If not "(none)", old_XXXns lists the namespace from which it was cloned.
---
fs/namespace.c | 13 +++++++++
include/linux/audit.h | 8 +++++
include/uapi/linux/audit.h | 12 ++++++++
ipc/namespace.c | 12 ++++++++
kernel/audit.c | 64
++++++++++++++++++++++++++++++++++++++++++++ kernel/pid_namespace.c |
13 +++++++++
kernel/user_namespace.c | 13 +++++++++
kernel/utsname.c | 12 ++++++++
net/core/net_namespace.c | 12 ++++++++
9 files changed, 159 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 182bc41..7b62543 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -24,6 +24,7 @@
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
+#include <linux/audit.h>
#include "pnode.h"
#include "internal.h"
static void free_mnt_ns(struct mnt_namespace *ns)
{
+ audit_log_ns_del(AUDIT_NS_DEL_MNT, ns->proc_inum);
proc_free_inum(ns->proc_inum);
put_user_ns(ns->user_ns);
kfree(ns);
@@ -2518,6 +2520,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags,
struct mnt_namespace *ns, new_ns = alloc_mnt_ns(user_ns);
if (IS_ERR(new_ns))
return new_ns;
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, ns->proc_inum, new_ns->proc_inum);
namespace_lock();
/* First pass: copy the tree topology */
@@ -2830,6 +2833,16 @@ static void __init init_mount_tree(void)
set_fs_root(current->fs, &root);
}
+/* log the ID of init mnt namespace after audit service starts */
+static int __init mnt_ns_init_log(void)
+{
+ struct mnt_namespace *init_mnt_ns = init_task.nsproxy->mnt_ns;
+
+ audit_log_ns_init(AUDIT_NS_INIT_MNT, 0, init_mnt_ns->proc_inum);
+ return 0;
+}
+late_initcall(mnt_ns_init_log);
+
void __init mnt_init(void)
{
unsigned u;
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 71698ec..b28dfb0 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -484,6 +484,9 @@ extern void audit_log_ns_info(struct
task_struct
Post by Richard Guy Briggs
*tsk); static inline void audit_log_ns_info(struct task_struct *tsk) {
}
#endif
+extern void audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum);
+extern void audit_log_ns_del(int type, unsigned int inum);
extern int audit_update_lsm_rules(void);
@@ -542,6 +545,11 @@ static inline void audit_log_task_info(struct
audit_buffer *ab, { }
static inline void audit_log_ns_info(struct task_struct *tsk)
{ }
+static inline int audit_log_ns_init(int type, unsigned int old_inum,
+ unsigned int inum)
+{ }
+static inline int audit_log_ns_del(int type, unsigned int inum)
+{ }
#define audit_enabled 0
#endif /* CONFIG_AUDIT */
static inline void audit_log_string(struct audit_buffer *ab, const char
*buf) diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 1ffb151..487cad6 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,18 @@
#define AUDIT_PROCTITLE 1327 /* Proctitle emit event */
#define AUDIT_FEATURE_CHANGE 1328 /* audit log listing feature changes
*/
Post by Richard Guy Briggs
#define AUDIT_NS_INFO 1329 /* Record process namespace IDs */
+#define AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance
creation
Post by Richard Guy Briggs
*/ +#define AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance
creation */ +#define AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace
instance creation */ +#define AUDIT_NS_INIT_USER 1333 /* Record USER
namespace instance creation */ +#define AUDIT_NS_INIT_PID 1334 /* Record
PID namespace instance creation */ +#define AUDIT_NS_INIT_NET 1335 /*
Record NET namespace instance creation */ +#define AUDIT_NS_DEL_MNT 1336
/*
Post by Richard Guy Briggs
Record mount namespace instance deletion */ +#define
AUDIT_NS_DEL_UTS 1337 /* Record UTS namespace instance deletion */
+#define
Post by Richard Guy Briggs
AUDIT_NS_DEL_IPC 1338 /* Record IPC namespace instance deletion */
+#define
Post by Richard Guy Briggs
AUDIT_NS_DEL_USER 1339 /* Record USER namespace instance deletion */
+#define AUDIT_NS_DEL_PID 1340 /* Record PID namespace instance
deletion */
Post by Richard Guy Briggs
+#define AUDIT_NS_DEL_NET 1341 /* Record NET namespace instance deletion
*/
Post by Richard Guy Briggs
#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59451c1..73727ce 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -13,6 +13,7 @@
#include <linux/mount.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>
#include "util.h"
@@ -41,6 +42,8 @@ static struct ipc_namespace *create_ipc_ns(struct
user_namespace *user_ns, }
atomic_inc(&nr_ipc_ns);
+ audit_log_ns_init(AUDIT_NS_INIT_IPC, old_ns->proc_inum, ns->proc_inum);
+
sem_init_ns(ns);
msg_init_ns(ns);
shm_init_ns(ns);
@@ -119,6 +122,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
*/
ipcns_notify(IPCNS_REMOVED);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_IPC, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -197,3 +201,11 @@ const struct proc_ns_operations ipcns_operations = {
.install = ipcns_install,
.inum = ipcns_inum,
};
+
+/* log the ID of init IPC namespace after audit service starts */
+static int __init ipc_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_IPC, 0, init_ipc_ns.proc_inum);
+ return 0;
+}
+late_initcall(ipc_namespaces_init);
diff --git a/kernel/audit.c b/kernel/audit.c
index 63f32f4..e6230c4 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
kfree(name);
}
+#ifdef CONFIG_NAMESPACES
+static char *ns_name[] = {
+ "mnt",
+ "uts",
+ "ipc",
+ "user",
+ "pid",
+ "net",
+};
+
+/**
+ * audit_log_ns_init - report a namespace instance creation
+ */
+void audit_log_ns_init(int type, unsigned int old_inum, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_INIT_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+ char old_ns[16];
+
+ if (type < AUDIT_NS_INIT_MNT || type > AUDIT_NS_INIT_NET) {
+ WARN(1, "audit_log_ns_init: type:%d out of range", type);
+ return;
+ }
+ if (!old_inum)
+ sprintf(old_ns, "(none)");
+ else
+ sprintf(old_ns, "%d", old_inum - PROC_DYNAMIC_FIRST);
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x old_%sns=%s %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev),
+ audit_ns_name, old_ns,
+ audit_ns_name, inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+
+/**
+ * audit_log_ns_del - report a namespace instance deleted
+ */
+void audit_log_ns_del(int type, unsigned int inum)
+{
+ struct audit_buffer *ab;
+ char *audit_ns_name = ns_name[type - AUDIT_NS_DEL_MNT];
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
+ struct super_block *sb = mnt->mnt_sb;
+
+ if (type < AUDIT_NS_DEL_MNT || type > AUDIT_NS_DEL_NET) {
+ WARN(1, "audit_log_ns_del: type:%d out of range", type);
+ return;
+ }
+ audit_log_common_recv_msg(&ab, type);
+ audit_log_format(ab, " dev=%02x:%02x %sns=%d res=1",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev), audit_ns_name,
+ inum - PROC_DYNAMIC_FIRST);
+ audit_log_end(ab);
+}
+#endif /* CONFIG_NAMESPACES */
+
/**
* audit_log_end - end one audit record
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..d28fd14 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -18,6 +18,7 @@
#include <linux/proc_ns.h>
#include <linux/reboot.h>
#include <linux/export.h>
+#include <linux/audit.h>
struct pid_cache {
int nr_ids;
@@ -109,6 +110,9 @@ static struct pid_namespace *create_pid_namespace(struct
user_namespace *user_ns if (err)
goto out_free_map;
+ audit_log_ns_init(AUDIT_NS_INIT_PID, parent_pid_ns->proc_inum,
+ ns->proc_inum);
+
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
@@ -142,6 +146,7 @@ static void destroy_pid_namespace(struct pid_namespace
*ns) {
int i;
+ audit_log_ns_del(AUDIT_NS_DEL_PID, ns->proc_inum);
proc_free_inum(ns->proc_inum);
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
@@ -388,3 +393,11 @@ static __init int pid_namespaces_init(void)
}
__initcall(pid_namespaces_init);
+
+/* log the ID of init PID namespace after audit service starts */
+static __init int pid_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_PID, 0, init_pid_ns.proc_inum);
+ return 0;
+}
+late_initcall(pid_namespaces_late_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index fcc0256..89c2517 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -22,6 +22,7 @@
#include <linux/ctype.h>
#include <linux/projid.h>
#include <linux/fs_struct.h>
+#include <linux/audit.h>
static struct kmem_cache *user_ns_cachep __read_mostly;
@@ -92,6 +93,9 @@ int create_user_ns(struct cred *new)
return ret;
}
+ audit_log_ns_init(AUDIT_NS_INIT_USER, parent_ns->proc_inum,
+ ns->proc_inum);
+
atomic_set(&ns->count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
@@ -136,6 +140,7 @@ void free_user_ns(struct user_namespace *ns)
#ifdef CONFIG_PERSISTENT_KEYRINGS
key_put(ns->persistent_keyring_register);
#endif
+ audit_log_ns_del(AUDIT_NS_DEL_USER, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kmem_cache_free(user_ns_cachep, ns);
ns = parent;
@@ -909,3 +914,11 @@ static __init int user_namespaces_init(void)
return 0;
}
subsys_initcall(user_namespaces_init);
+
+/* log the ID of init user namespace after audit service starts */
+static __init int user_namespaces_late_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_USER, 0, init_user_ns.proc_inum);
+ return 0;
+}
+late_initcall(user_namespaces_late_init);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fd39312..fa21e8d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -16,6 +16,7 @@
#include <linux/slab.h>
#include <linux/user_namespace.h>
#include <linux/proc_ns.h>
+#include <linux/audit.h>
static struct uts_namespace *create_uts_ns(void)
{
@@ -48,6 +49,8 @@ static struct uts_namespace *clone_uts_ns(struct
user_namespace *user_ns, return ERR_PTR(err);
}
+ audit_log_ns_init(AUDIT_NS_INIT_UTS, old_ns->proc_inum, ns->proc_inum);
+
down_read(&uts_sem);
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
ns->user_ns = get_user_ns(user_ns);
@@ -84,6 +87,7 @@ void free_uts_ns(struct kref *kref)
ns = container_of(kref, struct uts_namespace, kref);
put_user_ns(ns->user_ns);
+ audit_log_ns_del(AUDIT_NS_DEL_UTS, ns->proc_inum);
proc_free_inum(ns->proc_inum);
kfree(ns);
}
@@ -138,3 +142,11 @@ const struct proc_ns_operations utsns_operations = {
.install = utsns_install,
.inum = utsns_inum,
};
+
+/* log the ID of init UTS namespace after audit service starts */
+static int __init uts_namespaces_init(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_UTS, 0, init_uts_ns.proc_inum);
+ return 0;
+}
+late_initcall(uts_namespaces_init);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..562eb85 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -17,6 +17,7 @@
#include <linux/user_namespace.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
+#include <linux/audit.h>
/*
* Our network namespace constructor/destructor lists
@@ -253,6 +254,8 @@ struct net *copy_net_ns(unsigned long flags,
mutex_lock(&net_mutex);
rv = setup_net(net, user_ns);
if (rv == 0) {
+ audit_log_ns_init(AUDIT_NS_INIT_NET, old_net->proc_inum,
+ net->proc_inum);
rtnl_lock();
list_add_tail_rcu(&net->list, &net_namespace_list);
rtnl_unlock();
@@ -389,6 +392,7 @@ static __net_init int net_ns_net_init(struct net *net)
static __net_exit void net_ns_net_exit(struct net *net)
{
+ audit_log_ns_del(AUDIT_NS_DEL_NET, net->proc_inum);
proc_free_inum(net->proc_inum);
}
@@ -435,6 +439,14 @@ static int __init net_ns_init(void)
pure_initcall(net_ns_init);
+/* log the ID of init_net namespace after audit service starts */
+static int __init net_ns_init_log(void)
+{
+ audit_log_ns_init(AUDIT_NS_INIT_NET, 0, init_net.proc_inum);
+ return 0;
+}
+late_initcall(net_ns_init_log);
+
#ifdef CONFIG_NET_NS
static int __register_pernet_operations(struct list_head *list,
struct pernet_operations *ops)
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Steve Grubb
2015-05-14 14:57:14 UTC
Permalink
Post by Richard Guy Briggs
Post by Steve Grubb
I think there needs to be some more discussion around this. It seems like
this is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Auditing is something that has a lot of requirements imposed on it by security
standards. There was no requirement to have an auid until audit came along and
said that uid is not good enough to know who is issuing commands because of su
or sudo. There was no requirement for sessionid until we had to track each
action back to a login so we could see if the login came from the expected
place.

What I am saying is we have the same situation. Audit needs to track a
container and we need an ID. The information that is being logged is not
useful for auditing. Maybe someone wants that info in syslog, but I doubt it.
The audit trail's purpose is to allow a security officer to reconstruct the
events to determine what happened during some security incident.

What they would want to know is what resources were assigned; if two
containers shared a resource, what resource and container was it shared with;
if two containers can communicate, we need to see or control information flow
when necessary; and we need to see termination and release of resources.

Also, if the host OS cannot make sense of the information being logged because
the pid maps to another process name, or a uid maps to another user, or a file
access maps to something not in the host's, then we need the container to do
its own auditing and resolve these mappings and optionally pass these to an
aggregation server.

Nothing else makes sense.
Post by Richard Guy Briggs
Post by Steve Grubb
Post by Richard Guy Briggs
Log the creation and deletion of namespace instances in all 6 types of
namespaces.
AUDIT_NS_INIT_MNT 1330 /* Record mount namespace instance creation
*/ AUDIT_NS_INIT_UTS 1331 /* Record UTS namespace instance
creation */ AUDIT_NS_INIT_IPC 1332 /* Record IPC namespace
instance creation */ AUDIT_NS_INIT_USER 1333 /* Record USER
namespace instance creation */ AUDIT_NS_INIT_PID 1334 /* Record
PID namespace instance creation */ AUDIT_NS_INIT_NET 1335 /*
Record NET namespace instance creation */ AUDIT_NS_DEL_MNT 1336
/* Record mount namespace instance deletion */ AUDIT_NS_DEL_UTS
1337
/* Record UTS namespace instance deletion */ AUDIT_NS_DEL_IPC
1338 /* Record IPC namespace instance deletion */ AUDIT_NS_DEL_USER
1339 /* Record USER namespace instance deletion */ AUDIT_NS_DEL_PID
1340 /* Record PID namespace instance deletion */ AUDIT_NS_DEL_NET
1341 /* Record NET namespace instance deletion */
The requirements for auditing of containers should be derived from VPP. In
it, it asks for selectable auditing, selective audit, and selective audit
review. What this means is that we need the container and all its
children to have one identifier that is inserted into all the events that
are associated with the container.
Is that requirement for the records that are sent from the kernel, or
for the records stored by auditd, or by another facility that delivers
those records to a final consumer?
A little of both. Selective audit means that you can set rules to include or
exclude an event. This is done in the kernel. Selectable review means that the
user space tools need to be able to skip past records not of interest to a
specific line of inquiry. Also, logging everything and letting user space work
it out later is also not a solution because the needle is harder to find in a
larger haystack. Or, the logs may rotate and its gone forever because the
partition is filled.
Post by Richard Guy Briggs
Post by Steve Grubb
With this, its possible to do a search for all events related to a
container. Its possible to exclude events from a container. Its possible
to not get any events.
The requirements also call out for the identification of the subject. This
means that the event should be bound to a syscall such as clone, setns, or
unshare.
Is it useful to have a reference of the init namespace set from which
all others are spawned?
For things directly observable by the init name space, yes.
Post by Richard Guy Briggs
If it isn't bound, I assume the subject should be added to the message
format? I'm thinking of messages without an audit_context such as audit
user messages (such as AUDIT_NS_INFO and AUDIT_VIRT_CONTROL).
Making these events auxiliary records to a syscall is all that is needed. The
same way that PATH is added to an open event. If someone wants to have
container/namespace events, they add a rule on clone(2).
Post by Richard Guy Briggs
For now, we should not need to log namespaces with AUDIT_FEATURE_CHANGE
or AUDIT_CONFIG_CHANGE messages since only initial user namespace with
initial pid namespace has permission to do so. This will need to be
addressed by having non-init config changes be limited to that container
or set of namespaces and possibly its children. The other possibility
is to add the subject to the stand-alone message.
Post by Steve Grubb
Also, any user space events originating inside the container needs to have
the container ID added to the user space event - just like auid and
session id.
This sounds like every task needs to record a container ID since that
information is otherwise unknown by the kernel except by what might be
provided by an audit user message such as AUDIT_VIRT_CONTROL or possibly
the new AUDIT_NS_INFO request.
Right. The same as we record auid and ses on every event. We'll need a
container ID logged with everything. -1 for unset, meaning init namespace.
Post by Richard Guy Briggs
It could be stored in struct task_struct or in struct audit_context. I
don't have a suggestion on how to get that information securely into the
kernel.
That is where I'd suggest. Its for audit subsystem needs.
Post by Richard Guy Briggs
Post by Steve Grubb
Recording each instance of a name space is giving me something that I
cannot use to do queries required by the security target. Given these
events, how do I locate a web server event where it accesses a watched
file? That authentication failed? That an update within the container
failed?
The requirements are that we have to log the creation, suspension,
migration, and termination of a container. The requirements are not on
the individual name space.
Ok. Do we have a robust definition of a container?
We call the combination of name spaces, cgroups, and seccomp rules a
container.
Post by Richard Guy Briggs
Where is that definition managed?
In the thing that invokes a container.
Post by Richard Guy Briggs
If it is a userspace concept, then I think either userspace should be
assembling this information, or providing that information to the entity
that will be expected to know about and provide it.
Well, uid is a userspace concept, too. But we record an auid and keep it
immutable so that we can check enforcement of system security policy which is
also a user space concept. These things need to be collected to a place that
can be associated with events as needed. That place is the kernel.
Post by Richard Guy Briggs
Post by Steve Grubb
Maybe I'm missing how these events give me that. But I'd like to hear how I
would be able to meet requirements with these 12 events.
Adding the infrastructure to give each of those 12 events an audit
context to be able to give meaningful subject fields in audit records
appears to require adding a struct task_struct argument to calls to
copy_mnt_ns(), copy_utsname(), copy_ipcs(), copy_pid_ns(),
copy_net_ns(), create_user_ns() unless I use current. I think we must
use current since the userns is created before the spawned process is
mature or has an audit context in the case of clone.
I think you are heading down the wrong path. We can tell from syscall flags
what is being done. Try this:

## Optional - log container creation
-a always,exit -F arch=b32 -S clone -F a0&0x7C020000 -F key=container-create
-a always,exit -F arch=b64 -S clone -F a0&0x7C020000 -F key=container-create

## Optional - watch for containers that may change their configuration
-a always,exit -F arch=b32 -S unshare,setns -F key=container-config
-a always,exit -F arch=b64 -S unshare,setns -F key=container-config

Then muck with containers, then use ausearch --start recent -k container -i. I
think you'll see that we know a bit about what's happening. What's needed is
the breadcrumb trail to tie future events back to the container so that we can
check for violations of host security policy.
Post by Richard Guy Briggs
Either that, or I have mis-understood and I should be stashing this
namespace ID information in an audit_aux_data structure or a more
permanent part of struct audit_context to be printed when required on
syscall exit. I'm trying to think through if it is needed in any
non-syscall audit messages.
I think this is what is required. But we also have the issue where an event's
meaning can't be determined outside of a container. (For example, login,
account creation, password change, uid change, file access, etc.) So, I think
auditing needs to be local to the container for enrichment and ultimately
forwarded to an aggregating server.

-Steve
LC Bruzenak
2015-05-14 15:12:44 UTC
Permalink
Post by Steve Grubb
Also, if the host OS cannot make sense of the information being logged because
the pid maps to another process name, or a uid maps to another user, or a file
access maps to something not in the host's, then we need the container to do
its own auditing and resolve these mappings and optionally pass these to an
aggregation server.
Nothing else makes sense.
+1

Except, being that is IS a container, I'd say that for anyone who cares
about the audited data, the passing to an aggregation server would not
be optional.
At least not for any use-case I can envision.

LCB
--
LC (Lenny) Bruzenak
***@magitekltd.com
Eric W. Biederman
2015-05-14 15:42:38 UTC
Permalink
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
I think there needs to be some more discussion around this. It seems like
this is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Auditing is something that has a lot of requirements imposed on it by security
standards. There was no requirement to have an auid until audit came along and
said that uid is not good enough to know who is issuing commands because of su
or sudo. There was no requirement for sessionid until we had to track each
action back to a login so we could see if the login came from the expected
place.
Stop right there.

You want a global identifier in a realm where only relative identifiers
exist, and make sense.

I am sorry that isn't going to happen. EVER.

Square peg, round hole. It doesn't work, it doesn't make sense, and
most especially it doesn't allow anyone to reconstruct anything, because
it does not make sense and does not match what the kernel is doing.

Container IDs do not, and will not exist. There is probably something
reasonable in your request but until you stop talking that nonsense I
can't see it.

Global IDs take us into the namespace of namespaces problem and that
isn't going to happen. I have already bent as far in this direction as
I can go. Further namespace creation is not a privileged event which
makes the requestion for a container ID make even less sense. With
anyone able to create whatever they want it will not be a identifier
that makes any sense to someone reading an audit log.

Eric
Steve Grubb
2015-05-14 16:21:53 UTC
Permalink
Post by Eric W. Biederman
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
I think there needs to be some more discussion around this. It seems like
this is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Auditing is something that has a lot of requirements imposed on it by
security standards. There was no requirement to have an auid until audit
came along and said that uid is not good enough to know who is issuing
commands because of su or sudo. There was no requirement for sessionid
until we had to track each action back to a login so we could see if the
login came from the expected place.
Stop right there.
You want a global identifier in a realm where only relative identifiers
exist, and make sense.
Global to a name space for me is I guess relative for you. The ID is needed to
tie events together to check for violations of the security policy of the
container/namespace invoking child container/namespace.

As a concrete example, suppose a container is to have its own /etc/shadow. If
for some reason the container used the host's copy, then that would point to a
misconfiguration or perhaps indicate an escape from the container.

I would imagine that the next layer down has its own set of global identifiers
so that it can verify enforcement of its own security assumptions. This does
not need to be global to the system from top to 9 layers down. Each layer
needs to have a way of locating events common to a child container instance.
Post by Eric W. Biederman
I am sorry that isn't going to happen. EVER.
Then I'd suggest we either scrap this set of patches and forget auditing of
containers. (This would have the effect of disallowing them in a lot of
environments because violations of security policy can't be detected.)

Or someone please explain how what is proposed to be logged allows the tying
together of events. Or even supports the requirements I stated in my last
email.

-Steve
LC Bruzenak
2015-05-14 16:36:41 UTC
Permalink
Post by Steve Grubb
Then I'd suggest we either scrap this set of patches and forget auditing of
containers. (This would have the effect of disallowing them in a lot of
environments because violations of security policy can't be detected.)
Again +1.

I personally have envisioned a use-case in which I feel containers would
be architecturally ideal, however in my situation, and I'm fairly sure
anyone for whom the security requirements matter (i.e. WHY we use
SElinux in the first place), this is mandatory.

Without context-aware definitive audit records which discretely identify
people/actions/objects, the use of any otherwise attractive technology
is untenable.

LCB
--
LC (Lenny) Bruzenak
***@magitekltd.com
Richard Guy Briggs
2015-05-15 02:03:57 UTC
Permalink
Post by Eric W. Biederman
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
I think there needs to be some more discussion around this. It seems like
this is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Auditing is something that has a lot of requirements imposed on it by security
standards. There was no requirement to have an auid until audit came along and
said that uid is not good enough to know who is issuing commands because of su
or sudo. There was no requirement for sessionid until we had to track each
action back to a login so we could see if the login came from the expected
place.
Stop right there.
You want a global identifier in a realm where only relative identifiers
exist, and make sense.
I am assuming he wants an identifier unique per container on one kernel
and what happens on other kernels is a matter for a management
application to take care of. This kernel doesn't have to deal with it
other than taking information from a container management application.
Post by Eric W. Biederman
I am sorry that isn't going to happen. EVER.
Square peg, round hole. It doesn't work, it doesn't make sense, and
most especially it doesn't allow anyone to reconstruct anything, because
it does not make sense and does not match what the kernel is doing.
Container IDs do not, and will not exist. There is probably something
reasonable in your request but until you stop talking that nonsense I
can't see it.
I didn't see anything in any of what Steve said that suggested it was to
be unique beyond that one kernel.
Post by Eric W. Biederman
Global IDs take us into the namespace of namespaces problem and that
isn't going to happen. I have already bent as far in this direction as
I can go. Further namespace creation is not a privileged event which
makes the requestion for a container ID make even less sense. With
anyone able to create whatever they want it will not be a identifier
that makes any sense to someone reading an audit log.
Again, I assume this is up to a container management application that
will manage its pool of container hosts and an audit aggregator.

You keep raising an objection about the unworkability of a "namespace of
namespaces". Just so we are all on the same page here, can you explain
exactly what you mean with "namespace of namespaces"?
Post by Eric W. Biederman
Eric
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Paul Moore
2015-05-14 19:19:33 UTC
Permalink
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
I think there needs to be some more discussion around this. It seems
like this is not exactly recording things that are useful for audit.
It seems to me that either audit has to assemble that information, or
the kernel has to do so. The kernel doesn't know about containers
(yet?).
Auditing is something that has a lot of requirements imposed on it by
security standards. There was no requirement to have an auid until audit
came along and said that uid is not good enough to know who is issuing
commands because of su or sudo. There was no requirement for sessionid
until we had to track each action back to a login so we could see if the
login came from the expected place.
What I am saying is we have the same situation. Audit needs to track a
container and we need an ID. The information that is being logged is not
useful for auditing. Maybe someone wants that info in syslog, but I doubt
it. The audit trail's purpose is to allow a security officer to reconstruct
the events to determine what happened during some security incident.
As Eric, and others, have stated, the container concept is a userspace idea,
not a kernel idea; the kernel only knows, and cares about, namespaces. This
is unlikely to change.

However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of this
in general, but I do recognize that it does satisfy a legitimate need. Think
of things like auid and the sessionid as necessary evils; audit is already
chock full of evilness I doubt one more will doom us all to hell.

Moving forward, I'd like to see the following:

* Record the creation/removal/mgmt of the individual namespaces as Richard's
patchset currently does. However, I'd suggest using an explicit namespace
value for the init namespace instead of the "unset" value in the V6 patchset
(my apologies if you've already changed this Richard, I haven't looked at V7
yet).

* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be used
in audit records. I'd like to see some discussion on how we manage this, e.g.
how do handle container ID inheritance, how do we handle nested containers
(setting the containerid when it is already set), do we care if multiple
different containers share the same namespace config, etc.?

* When userspace sets the container ID, emit a new audit record with the
associated namespace tokens and the container ID.

* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are not
the init namespace.

Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be true,
but it doesn't help us solve the problem ;)
--
paul moore
security @ redhat
Andy Lutomirski
2015-05-15 06:23:09 UTC
Permalink
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are not
the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.

I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.

--Andy
Andy Lutomirski
2015-05-15 13:17:31 UTC
Permalink
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
No, its a real thing just like a login. Does the kernel have any concept of a
login? Yet it happens. And it causes us to generate events describing who,
where from, role, success, and time of day. :-)
I really hope those records come from userspace, not the kernel. I
also wonder what happens when a user logs in and types "sudo agetty
/dev/ttyS0 115200". If a user does that and then someone logs in on
/dev/ttyS0, which login are they?
Post by Andy Lutomirski
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
I don't doubt that just as user space sequences the actions that are a login.
I just need the kernel to do some book keeping and associate the necessary
attributes in the event record to be able to reconstruct what is actually
happening.
A precondition for that is having those records have some
correspondence to what is actually happening. Since the kernel has no
concept of a container, and since the same kernel mechanisms could be
used for things that are probably not whatever the Common Criteria
rules think a container is, this could be quite difficult to define in
a meaningful manner.

Hence my suggestion to add only minimal support in the kernel and to
do this in userspace.

--Andy
Paul Moore
2015-05-15 21:05:24 UTC
Permalink
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
It might be helpful to climb up a few levels in this thread ...

I think we all agree that containers are not a kernel construct. I further
believe that the kernel has no business generating container IDs, those should
come from userspace and will likely be different depending on how you define
"container". However, what is less clear to me at this point is how the
kernel should handle the setting, reporting, and general management of this
container ID token.
--
paul moore
security @ redhat
Paul Moore
2015-05-16 12:16:55 UTC
Permalink
Post by Paul Moore
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
It might be helpful to climb up a few levels in this thread ...
I think we all agree that containers are not a kernel construct. I further
believe that the kernel has no business generating container IDs, those should
come from userspace and will likely be different depending on how you define
"container". However, what is less clear to me at this point is how the
kernel should handle the setting, reporting, and general management of this
container ID token.
Wouldn't the easiest thing be to just treat add a containerid to the
process context like auid.
I believe so. At least that was the point I was trying to get across
when I first jumped into this thread.
Then make it a privileged operation to set it. Then tools that care about
auditing like docker can set the ID
and remove the Capability from it sub processes if it cares. All
processes adopt parent processes containerid.
Now containers can be audited and as long as userspace is written
correctly nested containers can either override the containerid or not
depending on what the audit rules are.
This part I'm still less certain on. I agree that setting the
container ID should be privileged in some sense, but the kernel
shouldn't *require* privilege to create a new container (however the
user chooses to define it). Simply requiring privilege to set the
container ID and failing silently may be sufficient.
--
paul moore
www.paul-moore.com
Eric W. Biederman
2015-05-16 14:46:29 UTC
Permalink
Post by Paul Moore
Post by Paul Moore
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
It might be helpful to climb up a few levels in this thread ...
I think we all agree that containers are not a kernel construct. I further
believe that the kernel has no business generating container IDs, those should
come from userspace and will likely be different depending on how you define
"container". However, what is less clear to me at this point is how the
kernel should handle the setting, reporting, and general management of this
container ID token.
Wouldn't the easiest thing be to just treat add a containerid to the
process context like auid.
I believe so. At least that was the point I was trying to get across
when I first jumped into this thread.
It sounds nice but containers are not just a per process construct.
Sometimes you might know anamespace but not which process instigated
action to happen on that namespace.
Post by Paul Moore
Then make it a privileged operation to set it. Then tools that care about
auditing like docker can set the ID
and remove the Capability from it sub processes if it cares. All
processes adopt parent processes containerid.
Now containers can be audited and as long as userspace is written
correctly nested containers can either override the containerid or not
depending on what the audit rules are.
This part I'm still less certain on. I agree that setting the
container ID should be privileged in some sense, but the kernel
shouldn't *require* privilege to create a new container (however the
user chooses to define it). Simply requiring privilege to set the
container ID and failing silently may be sufficient.
My hope is as things mature fewer and fewer container things will need
any special privilege to create.

I think it needs to start with a clear definition of what is wanted and
then working backwards through which messages in which contexts you want
to have your magic bits.

Eric
Paul Moore
2015-05-16 22:49:39 UTC
Permalink
On Sat, May 16, 2015 at 10:46 AM, Eric W. Biederman
Post by Eric W. Biederman
Post by Paul Moore
Post by Paul Moore
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
It might be helpful to climb up a few levels in this thread ...
I think we all agree that containers are not a kernel construct. I further
believe that the kernel has no business generating container IDs, those should
come from userspace and will likely be different depending on how you define
"container". However, what is less clear to me at this point is how the
kernel should handle the setting, reporting, and general management of this
container ID token.
Wouldn't the easiest thing be to just treat add a containerid to the
process context like auid.
I believe so. At least that was the point I was trying to get across
when I first jumped into this thread.
It sounds nice but containers are not just a per process construct.
Sometimes you might know anamespace but not which process instigated
action to happen on that namespace.
From an auditing perspective I'm not sure we will ever hit those
cases; did you have a particular example in mind?
--
paul moore
www.paul-moore.com
Richard Guy Briggs
2015-05-19 13:09:11 UTC
Permalink
Post by Paul Moore
On Sat, May 16, 2015 at 10:46 AM, Eric W. Biederman
Post by Eric W. Biederman
Post by Paul Moore
Post by Paul Moore
Post by Andy Lutomirski
Post by Paul Moore
* Look at our existing audit records to determine which records should have
namespace and container ID tokens added. We may only want to add the
additional fields in the case where the namespace/container ID tokens are
not the init namespace.
If we have a record that ties a set of namespace IDs with a container
ID, then I expect we only need to list the containerID along with auid
and sessionID.
The problem here is that the kernel has no concept of a "container", and I
don't think it makes any sense to add one just for audit. "Container" is a
marketing term used by some userspace tools.
I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines). Mapping these to "containers" belongs
in userspace, I think.
It might be helpful to climb up a few levels in this thread ...
I think we all agree that containers are not a kernel construct. I further
believe that the kernel has no business generating container IDs, those should
come from userspace and will likely be different depending on how you define
"container". However, what is less clear to me at this point is how the
kernel should handle the setting, reporting, and general management of this
container ID token.
Wouldn't the easiest thing be to just treat add a containerid to the
process context like auid.
I believe so. At least that was the point I was trying to get across
when I first jumped into this thread.
It sounds nice but containers are not just a per process construct.
Sometimes you might know anamespace but not which process instigated
action to happen on that namespace.
From an auditing perspective I'm not sure we will ever hit those
cases; did you have a particular example in mind?
The example that immediately came to mind when I first read Eric's
comment was a packet coming in off a network in a particular network
namespace. That could narrow it down to a subset of containers based on
which network namespace it inhabits, but since it isn't associated with
a particular task yet (other than a kernel thread) it will not be
possible to select the precise nsproxy, let alone the container.
Post by Paul Moore
paul moore
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Paul Moore
2015-05-19 14:27:30 UTC
Permalink
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Eric W. Biederman
It sounds nice but containers are not just a per process construct.
Sometimes you might know anamespace but not which process instigated
action to happen on that namespace.
From an auditing perspective I'm not sure we will ever hit those
cases; did you have a particular example in mind?
The example that immediately came to mind when I first read Eric's
comment was a packet coming in off a network in a particular network
namespace. That could narrow it down to a subset of containers based on
which network namespace it inhabits, but since it isn't associated with
a particular task yet (other than a kernel thread) it will not be
possible to select the precise nsproxy, let alone the container.
Thanks, I was stuck thinking about syscall based auditing and forgot
about the various LSM based audit records. Of all people you would
think I would remember per-packet audit records ;)

Anyway, in this case I think including the namespace ID is sufficient,
largely because the container userspace doesn't have access to the
packet at this point. In order to actually receive the data the
container's userspace will need to issue a syscall where we can
include the container ID. An overly zealous security officer who
wants to trace all the kernel level audit events, like the one you
describe, can match up the namespace to a container in post-processing
if needed.
--
paul moore
www.paul-moore.com
Eric W. Biederman
2015-05-15 01:31:45 UTC
Permalink
Post by Paul Moore
As Eric, and others, have stated, the container concept is a userspace idea,
not a kernel idea; the kernel only knows, and cares about, namespaces. This
is unlikely to change.
However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of this
in general, but I do recognize that it does satisfy a legitimate need. Think
of things like auid and the sessionid as necessary evils; audit is already
chock full of evilness I doubt one more will doom us all to hell.
* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be used
in audit records. I'd like to see some discussion on how we manage this, e.g.
how do handle container ID inheritance, how do we handle nested containers
(setting the containerid when it is already set), do we care if multiple
different containers share the same namespace config, etc.?
Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be true,
but it doesn't help us solve the problem ;)
Without stopping and defining what someone means by container I think it
is pretty much nonsense.

Should every vsftp connection get a container every? Every chrome tab?

At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two. Will any of that make sense
to someone reading the audit logs?

Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense. Do I get to say I am any
container I want? That would seem to invalidate the concept of
userspace setting a container id.

How does any of this interact with setns? AKA entering a container?

I will go as far as looking at patches. If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.

But for right now I just hear proposals for things that make no sense
and can not possibly work. Not least because it will require modifying
every program that creates a container and who knows how many of them
there are. Especially since you don't need to be root. Modifying
/usr/bin/unshare seems a little far out to me.

Eric
Richard Guy Briggs
2015-05-15 02:25:27 UTC
Permalink
Post by Eric W. Biederman
Post by Paul Moore
As Eric, and others, have stated, the container concept is a userspace idea,
not a kernel idea; the kernel only knows, and cares about, namespaces. This
is unlikely to change.
However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of this
in general, but I do recognize that it does satisfy a legitimate need. Think
of things like auid and the sessionid as necessary evils; audit is already
chock full of evilness I doubt one more will doom us all to hell.
* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be used
in audit records. I'd like to see some discussion on how we manage this, e.g.
how do handle container ID inheritance, how do we handle nested containers
(setting the containerid when it is already set), do we care if multiple
different containers share the same namespace config, etc.?
Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be true,
but it doesn't help us solve the problem ;)
Without stopping and defining what someone means by container I think it
is pretty much nonsense.
Not complete, but this is why I'm asking for a standards document...
Post by Eric W. Biederman
Should every vsftp connection get a container every? Every chrome tab?
At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two. Will any of that make sense
to someone reading the audit logs?
So making it 64bits buys us some time, but sure... I think your
definition of a container may be a bit more liberal than what we're
trying to understand...
Post by Eric W. Biederman
Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense. Do I get to say I am any
container I want? That would seem to invalidate the concept of
userspace setting a container id.
Ok, my impression was that we're dealing with a privileged application
as I alluded with the need to create a new CAP_AUDIT_CONTAINER_ID or
something...
Post by Eric W. Biederman
How does any of this interact with setns? AKA entering a container?
You mean entering another namespace that might all be part of one
container? Or an an application attempting to enter the namespace of
another container?
Post by Eric W. Biederman
I will go as far as looking at patches. If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.
I don't pretend these patches are anywhere near finished or ready for
upstream.
Post by Eric W. Biederman
But for right now I just hear proposals for things that make no sense
and can not possibly work. Not least because it will require modifying
every program that creates a container and who knows how many of them
there are. Especially since you don't need to be root. Modifying
/usr/bin/unshare seems a little far out to me.
My understanding is that just spawning or changing namespace doesn't
imply spawning or changing containers. I also don't necessarily assume
that creating a container is an atomic operation, though that concept
might make some sense to understand or predict the boundaries of
actions...
Post by Eric W. Biederman
Eric
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Steve Grubb
2015-05-15 13:17:24 UTC
Permalink
Post by Eric W. Biederman
Post by Paul Moore
As Eric, and others, have stated, the container concept is a userspace
idea, not a kernel idea; the kernel only knows, and cares about,
namespaces. This is unlikely to change.
However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of
this in general, but I do recognize that it does satisfy a legitimate
need. Think of things like auid and the sessionid as necessary evils;
audit is already chock full of evilness I doubt one more will doom us all
to hell.
* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be
used in audit records. I'd like to see some discussion on how we manage
this, e.g. how do handle container ID inheritance, how do we handle
nested containers (setting the containerid when it is already set), do we
care if multiple different containers share the same namespace config,
etc.?
Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be
true, but it doesn't help us solve the problem ;)
Without stopping and defining what someone means by container I think it
is pretty much nonsense.
Maybe this is what's hanging everyone up? Its easy to get lost when your view
is down at the syscall level and what is happening in the kernel. Starting a
container is akin to the idea of login. Not every call to setresuid is a
login. It could be a setuid program starting or a daemon dropping privileges.
The idea of a container is a higher level concept that starting a name space.
I think comparing a login with a container is a useful analogy because both
are higher level concepts but employ low level ideas. A login is a collection
of chdir, setuid, setgid, allocating a tty, associating the first 3 file
descriptors, setting a process group, and starting a specific executable. All
these low level concepts each by itself is not special.

A container is what we need auditing events around not creation of namespaces.
If we want creation of namespaces, we can audit the clone/unshare/setns
syscalls. The container is when a managing program such as docker, lxc, or
sometimes systemd creates a special operating environment for the express
purpose of running programs disassociated in some way from the parent
namespaces, cgroups, and security assumptions. Its this orchestration, just as
sshd orchestrates a login, that makes it different.
Post by Eric W. Biederman
Should every vsftp connection get a container every? Every chrome tab?
No. Also, note that not every program that grants a user session constitutes a
login.
Post by Eric W. Biederman
At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two. Will any of that make sense
to someone reading the audit logs?
I would agree if we were auditing creation of name spaces. But going back to
the concept of login, these could occur at a high rate. This is a bruteforce
login attack. We put countermeasures in place to prevent it. But it is
possible for the session id to wrap. But in our case, things like lxc or
docker don't start hundreds of these a minute.
Post by Eric W. Biederman
Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense. Do I get to say I am any
container I want? That would seem to invalidate the concept of
userspace setting a container id.
It would need to be a privileged operation just as setuid is.
Post by Eric W. Biederman
How does any of this interact with setns? AKA entering a container?
We have to audit this. For the moment, auditing the setns syscall may be
enough. I'd have to look at the lifecycle of the application that's doing this
to determine if we need more.
Post by Eric W. Biederman
I will go as far as looking at patches. If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.
Auditing wouldn't impose any restrictions on this. We just need a way to
observe actions within and associate them as needed to investigate violations
of security policy.
Post by Eric W. Biederman
But for right now I just hear proposals for things that make no sense
and can not possibly work. Not least because it will require modifying
every program that creates a container and who knows how many of them
there are.
We only care about a couple programs doing the orchestration. They will need
to have the right support added to them. I'm hoping the analogy of a login
helps demonstrate what we are after.

-Steve
Eric W. Biederman
2015-05-15 14:51:09 UTC
Permalink
Post by Steve Grubb
Post by Eric W. Biederman
Post by Paul Moore
As Eric, and others, have stated, the container concept is a userspace
idea, not a kernel idea; the kernel only knows, and cares about,
namespaces. This is unlikely to change.
However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of
this in general, but I do recognize that it does satisfy a legitimate
need. Think of things like auid and the sessionid as necessary evils;
audit is already chock full of evilness I doubt one more will doom us all
to hell.
* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be
used in audit records. I'd like to see some discussion on how we manage
this, e.g. how do handle container ID inheritance, how do we handle
nested containers (setting the containerid when it is already set), do we
care if multiple different containers share the same namespace config,
etc.?
Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be
true, but it doesn't help us solve the problem ;)
Without stopping and defining what someone means by container I think it
is pretty much nonsense.
Maybe this is what's hanging everyone up? Its easy to get lost when your view
is down at the syscall level and what is happening in the kernel. Starting a
container is akin to the idea of login. Not every call to setresuid is a
login. It could be a setuid program starting or a daemon dropping privileges.
The idea of a container is a higher level concept that starting a name space.
I think comparing a login with a container is a useful analogy because both
are higher level concepts but employ low level ideas. A login is a collection
of chdir, setuid, setgid, allocating a tty, associating the first 3 file
descriptors, setting a process group, and starting a specific executable. All
these low level concepts each by itself is not special.
Except login and setresuid are privileged operation.

CREATING A CONTAINER IS NOT A PRIVILGED OPERATION.
Your analagy fails rather badly with respect to that fact.
Post by Steve Grubb
A container is what we need auditing events around not creation of namespaces.
If we want creation of namespaces, we can audit the clone/unshare/setns
syscalls. The container is when a managing program such as docker, lxc, or
sometimes systemd creates a special operating environment for the express
purpose of running programs disassociated in some way from the parent
namespaces, cgroups, and security assumptions. Its this orchestration, just as
sshd orchestrates a login, that makes it different.
What do you define as a container? From what I can tell we share
a similiar understanding of the term, and running lxc is not a
privileged operation. Running sandstorm.io is not a privileged
operation.
Post by Steve Grubb
Post by Eric W. Biederman
Should every vsftp connection get a container every? Every chrome tab?
No. Also, note that not every program that grants a user session constitutes a
login.
Post by Eric W. Biederman
At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two. Will any of that make sense
to someone reading the audit logs?
I would agree if we were auditing creation of name spaces. But going back to
the concept of login, these could occur at a high rate. This is a bruteforce
login attack. We put countermeasures in place to prevent it. But it is
possible for the session id to wrap. But in our case, things like lxc or
docker don't start hundreds of these a minute.
Except there are reasonable situtations where container creation does
happen at fast rates. Outside of a container per network connection
(which is likely to happen at some point) I have seen builds fire up
more containers than I can count as part of automated testing.
Post by Steve Grubb
Post by Eric W. Biederman
Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense. Do I get to say I am any
container I want? That would seem to invalidate the concept of
userspace setting a container id.
It would need to be a privileged operation just as setuid is.
CONTAINER CREATION IS NOT A PRIVILEGED OPERATION.

That is today. That is talking about lxc.

CONTAINER CREATION IS NOT A PRIVILEGED OPERATION.

And ultimately we don't want it to be, as if you can safely create a
container without privilege your system is safer.
Post by Steve Grubb
Post by Eric W. Biederman
How does any of this interact with setns? AKA entering a container?
We have to audit this. For the moment, auditing the setns syscall may be
enough. I'd have to look at the lifecycle of the application that's doing this
to determine if we need more.
Frequently it will be sysadmins for some arbitrary reason calling
nsenter or a similar program that is more aware of their favorite
container flavor.
Post by Steve Grubb
Post by Eric W. Biederman
I will go as far as looking at patches. If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.
Auditing wouldn't impose any restrictions on this. We just need a way to
observe actions within and associate them as needed to investigate violations
of security policy.
*Rolls eyes* But the rest of the container tool kit in the kernel will
impose limitations on those identifiers.
Post by Steve Grubb
Post by Eric W. Biederman
But for right now I just hear proposals for things that make no sense
and can not possibly work. Not least because it will require modifying
every program that creates a container and who knows how many of them
there are.
We only care about a couple programs doing the orchestration. They will need
to have the right support added to them. I'm hoping the analogy of a login
helps demonstrate what we are after.
All I see is that (a) you have not defined what you see a container as
(b) you have failed to acknowledge I can create a container without
privilege (which breaks your analogy with login).

But I think I am with Andy. If you only care about privileged events
and privileged containers, it is unlikely you need to do anything in the
kernel and you can perform whatever logging you see fit in your
privileged userspace applications.

Of course in the log run I don't see what good that will do you as I
expect increasingly there will not need to be any special permissions to
create containers.

Eric
Paul Moore
2015-05-15 21:01:25 UTC
Permalink
Post by Eric W. Biederman
Post by Paul Moore
As Eric, and others, have stated, the container concept is a userspace
idea, not a kernel idea; the kernel only knows, and cares about,
namespaces. This is unlikely to change.
However, as Steve points out, there is precedence for the kernel to record
userspace tokens for the sake of audit. Personally I'm not a big fan of
this in general, but I do recognize that it does satisfy a legitimate
need. Think of things like auid and the sessionid as necessary evils;
audit is already chock full of evilness I doubt one more will doom us all
to hell.
* Create a container ID token (unsigned 32-bit integer?), similar to
auid/sessionid, that is set by userspace and carried by the kernel to be
used in audit records. I'd like to see some discussion on how we manage
this, e.g. how do handle container ID inheritance, how do we handle
nested containers (setting the containerid when it is already set), do we
care if multiple different containers share the same namespace config,
etc.?
Can we all live with this? If not, please suggest some alternate ideas;
simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be
true, but it doesn't help us solve the problem ;)
Without stopping and defining what someone means by container I think it
is pretty much nonsense.
For what it is worth, I doubt we will ever arrive at a consistent definition
of a container. This is one of the reasons why I don't think we want the
kernel generating a container ID token, although I understand the real world
desire to have the kernel report such information back in the audit logs.
Post by Eric W. Biederman
Should every vsftp connection get a container every? Every chrome tab?
That's up to the individual system. I would argue that's a pretty silly
configuration, but one persons silliness is another's best practice. It's a
mad, mad world.
Post by Eric W. Biederman
At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two. Will any of that make sense
to someone reading the audit logs?
If someone if going to spawn each process in a container then they will need
to live with the fallout of that decision.

Also, if folks thing 32-bits is too small, we can always do 64-bits, but I
don't think that was the point you were trying to make (I could be wrong).
Post by Eric W. Biederman
Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense. Do I get to say I am any
container I want? That would seem to invalidate the concept of
userspace setting a container id.
How does any of this interact with setns? AKA entering a container?
As I said in my email, I think we need some discussion around this; I don't
pretend to think we have this sorted at this point. I just want to make sure
were working towards some common ground instead of shouting the same stuff
back and forth at each other.
Post by Eric W. Biederman
I will go as far as looking at patches. If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.
I think Steve has posted some requirements that Richard is trying to satisfy
with these patches; we've also heard from at least one person who is looking
at how to deploy this in the Real World. Perhaps in the next round of patches
Richard can list the requirements in the 0/X patch and describe how they are
satisfied in the patchset.

Beyond that, and ignoring for a moment the whole "a container is not a
*thing*" argument, can I assume that the auditing of nested "containers" are
your main remaining concern at this point?
Post by Eric W. Biederman
But for right now I just hear proposals for things that make no sense
and can not possibly work. Not least because it will require modifying
every program that creates a container and who knows how many of them
there are. Especially since you don't need to be root. Modifying
/usr/bin/unshare seems a little far out to me.
I think it is very reasonable that there will be some container infrastructure
tools which would handle this, we're already seeing this happening now; asking
for minor changes to these infrastructure applications to support container
auditing doesn't seem like a significant ask to me. Also, to be perfectly
clear, if the applications aren't updated it isn't as if they will fail to
work, it is just that they won't be able to take advantage of the new
container auditing capabilities. That seems reasonable to me.
--
paul moore
security @ redhat
Oren Laadan
2015-05-15 01:10:56 UTC
Permalink
Post by Steve Grubb
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
Recording each instance of a name space is giving me something that I
cannot use to do queries required by the security target. Given these
events, how do I locate a web server event where it accesses a
watched
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
file? That authentication failed? That an update within the container
failed?
The requirements are that we have to log the creation, suspension,
migration, and termination of a container. The requirements are not
on
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
the individual name space.
Ok. Do we have a robust definition of a container?
We call the combination of name spaces, cgroups, and seccomp rules a
container.
Can you detail what information is required from each?
Post by Steve Grubb
Post by Richard Guy Briggs
Where is that definition managed?
In the thing that invokes a container.
I was looking for a reference to a standards document rather than an
application...
[focusing on "containers id" - snipped the rest away]

I am unfamiliar with the audit subsystem, but work with namespaces in other
contexts. Perhaps the term "container" is overloaded here. The definition
suggested by Steve in this thread makes sense to me: "a combination of
namespaces". I imagine people may want to audit subsets of namespaces.

For namespaces, can use a string like "A:B:C:D:E:F" as an identifier for a
particular combination, where A-F are respective namespaces identifiers.
(Can be taken for example from /proc/PID/ns/{mnt,uts,ipc,user,pid,net}).
That will even be grep-able to locate records related to a particular
subset
of namespaces. So a "container" in the classic meaning would have all A-F
unique and different from the init process, but processes separated only by
e.g. mnt-ns and net-ns will differ from the init process in A and F.

(If a string is a no go, then perhaps combine the IDs in a unique way into a
super ID).

Oren.
Richard Guy Briggs
2015-05-15 02:11:26 UTC
Permalink
Post by Oren Laadan
Post by Steve Grubb
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
Recording each instance of a name space is giving me something that I
cannot use to do queries required by the security target. Given these
events, how do I locate a web server event where it accesses a
watched
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
file? That authentication failed? That an update within the container
failed?
The requirements are that we have to log the creation, suspension,
migration, and termination of a container. The requirements are not
on
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
the individual name space.
Ok. Do we have a robust definition of a container?
We call the combination of name spaces, cgroups, and seccomp rules a
container.
Can you detail what information is required from each?
Post by Steve Grubb
Post by Richard Guy Briggs
Where is that definition managed?
In the thing that invokes a container.
I was looking for a reference to a standards document rather than an
application...
[focusing on "containers id" - snipped the rest away]
I am unfamiliar with the audit subsystem, but work with namespaces in other
contexts. Perhaps the term "container" is overloaded here. The definition
suggested by Steve in this thread makes sense to me: "a combination of
namespaces". I imagine people may want to audit subsets of namespaces.
I assume it would be a bit more than that, including cgroup and seccomp info.
Post by Oren Laadan
For namespaces, can use a string like "A:B:C:D:E:F" as an identifier for a
particular combination, where A-F are respective namespaces identifiers.
(Can be taken for example from /proc/PID/ns/{mnt,uts,ipc,user,pid,net}).
That will even be grep-able to locate records related to a particular
subset
of namespaces. So a "container" in the classic meaning would have all A-F
unique and different from the init process, but processes separated only by
e.g. mnt-ns and net-ns will differ from the init process in A and F.
(If a string is a no go, then perhaps combine the IDs in a unique way into a
super ID).
I'd be fine with either, even including the nsfs deviceID.
Post by Oren Laadan
Oren.
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Daniel J Walsh
2015-05-15 13:19:19 UTC
Permalink
Post by Richard Guy Briggs
Post by Oren Laadan
Post by Steve Grubb
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
Recording each instance of a name space is giving me something that I
cannot use to do queries required by the security target. Given these
events, how do I locate a web server event where it accesses a
watched
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
file? That authentication failed? That an update within the container
failed?
The requirements are that we have to log the creation, suspension,
migration, and termination of a container. The requirements are not
on
Post by Steve Grubb
Post by Richard Guy Briggs
Post by Steve Grubb
the individual name space.
Ok. Do we have a robust definition of a container?
We call the combination of name spaces, cgroups, and seccomp rules a
container.
Can you detail what information is required from each?
Post by Steve Grubb
Post by Richard Guy Briggs
Where is that definition managed?
In the thing that invokes a container.
I was looking for a reference to a standards document rather than an
application...
[focusing on "containers id" - snipped the rest away]
I am unfamiliar with the audit subsystem, but work with namespaces in other
contexts. Perhaps the term "container" is overloaded here. The definition
suggested by Steve in this thread makes sense to me: "a combination of
namespaces". I imagine people may want to audit subsets of namespaces.
I assume it would be a bit more than that, including cgroup and seccomp info.
I don't see why seccomp versus other Security mechanism come into this.
Not really
sure of cgroup. That stuff would all be associated with the process. I
would guess
you could look at the process that modified these for logging, but that
should happen
at the time they get changed, Not recorded for every process.
Post by Richard Guy Briggs
Post by Oren Laadan
For namespaces, can use a string like "A:B:C:D:E:F" as an identifier for a
particular combination, where A-F are respective namespaces identifiers.
(Can be taken for example from /proc/PID/ns/{mnt,uts,ipc,user,pid,net}).
That will even be grep-able to locate records related to a particular
subset
of namespaces. So a "container" in the classic meaning would have all A-F
unique and different from the init process, but processes separated only by
e.g. mnt-ns and net-ns will differ from the init process in A and F.
(If a string is a no go, then perhaps combine the IDs in a unique way into a
super ID).
I'd be fine with either, even including the nsfs deviceID.
Post by Oren Laadan
Oren.
- RGB
--
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
--
Linux-audit mailing list
https://www.redhat.com/mailman/listinfo/linux-audit
Paul Moore
2015-05-15 20:42:38 UTC
Permalink
Post by Oren Laadan
[focusing on "containers id" - snipped the rest away]
I am unfamiliar with the audit subsystem, but work with namespaces in other
contexts. Perhaps the term "container" is overloaded here. The definition
suggested by Steve in this thread makes sense to me: "a combination of
namespaces". I imagine people may want to audit subsets of namespaces.
For namespaces, can use a string like "A:B:C:D:E:F" as an identifier for a
particular combination, where A-F are respective namespaces identifiers.
(Can be taken for example from /proc/PID/ns/{mnt,uts,ipc,user,pid,net}).
That will even be grep-able to locate records related to a particular
subset
of namespaces. So a "container" in the classic meaning would have all A-F
unique and different from the init process, but processes separated only by
e.g. mnt-ns and net-ns will differ from the init process in A and F.
(If a string is a no go, then perhaps combine the IDs in a unique way into a
super ID).
As has been mentioned in every other email in this thread, the kernel has no
concept of a container, it is a userspace idea and trying to generate a
meaningful value in the kernel is a mistake in my opinion. My current opinion
is that we allow userspace to set a container ID token as it sees fit and the
kernel will just use the value provided by userspace.
--
paul moore
security @ redhat
Paul Moore
2015-05-15 20:26:41 UTC
Permalink
Post by Steve Grubb
What they would want to know is what resources were assigned; if two
containers shared a resource, what resource and container was it shared
with; if two containers can communicate, we need to see or control
information flow when necessary; and we need to see termination and
release of resources.
So, namespaces are a big part of this. I understand how they are
spawned and potentially shared. I have a more vague idea about how
cgroups contribute to this concept of a container. So far, I have very
little idea how seccomp contributes, but I assume that it will also need
to be part of this tracking.
It doesn't, really. We shouldn't worry about seccomp from a
namespace/container auditing perspective. The normal seccomp auditing should
be sufficient for namespaces/containers.
--
paul moore
security @ redhat
Eric W. Biederman
2015-04-21 04:33:24 UTC
Permalink
Post by Richard Guy Briggs
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).
In broad strokes the user interface appears correct.

Things that I see that concern me:

- After Als most recent changes these inodes no longer live in the proc
superblock so the device number reported in these patches is
incorrect.

- I am nervous about audit logs being flooded with users creating lots
of namespaces. But that is more your lookout than mine.

- unshare is not logging when it creates new namespaces.

As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.

I have answered your comments below.
Post by Richard Guy Briggs
1/10 exposes proc's ns entries structure which lists a number of useful
operations per namespace type for other subsystems to use.
2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
3/10 provides an example of usage for audit_log_task_info() which is used by
syscall audits, among others. audit_log_task() and audit_common_recv_message()
would be other potential use cases.
This differs slightly from Aristeu's patch because of the label conflict with
"pid=" due to including it in existing records rather than it being a seperate
record. It has now returned to being a seperate record. The proc device
major/minor are listed in hexadecimal and namespace IDs are the proc inode
minus the base offset.
type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4 pidns=-1 userns=-2 mntns=0
4/10 change audit startup from __initcall to subsys_initcall to get it started
earlier to be able to receive initial namespace log messages.
5/10 tracks the creation and deletion of namespaces, listing the type of
namespace instance, proc device ID, related namespace id if there is one and
the newly minted namespace ID.
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1
type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1
type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1
type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1
6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
type (CAP_AUDIT_CONTROL required).
7/10 is a macro for CLONE_NEW_* flags.
8/10 adds auditing on creation of namespace(s) in fork.
9/10 adds auditing a change of namespace on setns.
10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
(CAP_AUDIT_WRITE required).
Switch to using namespace ID based on namespace proc inode minus base offset
Added proc device ID to qualify proc inode reference
Eliminate exposed /proc interface
Clean up prototypes for dependencies on CONFIG_NAMESPACES.
Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
Log AUDIT_NS_INFO with PID.
Move /proc/<pid>/ns_* patches to end of patchset to deprecate them.
Log on changing ns (setns).
Log on creating new namespaces when forking.
Added a macro for CLONE_NEW*.
Seperate out the NS_INFO message from the SYSCALL message.
Moved audit_log_namespace_info() out of audit_log_task_info().
Use a seperate message type per namespace type for each of INIT/DEL.
Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
Add /proc/<pid>/ns/ documentation.
Fix dynamic initial ns logging.
Use atomic64_t in ns_serial to simplify it.
Avoid funciton duplication in proc, keying on dentry.
Squash down audit patch to avoid rcu sleep issues.
Add tracking for creation and deletion of namespace instances.
Avoid rollover by switching from an int to a long long.
Change rollover behaviour from simply avoiding zero to raising a BUG.
Expose serial numbers in /proc/<pid>/ns/*_snum.
Expose ns_entries and use it in audit.
As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
capabilities of userspace processes that try to join netlink broadcast groups.
This set does not try to solve the non-init namespace audit messages and
auditd problem yet. That will come later, likely with additional auditd
instances running in another namespace with a limited ability to influence the
master auditd. I echo Eric B's idea that messages destined for different
namespaces would have to be tailored for that namespace with references that
make sense (such as the right pid number reported to that pid namespace, and
not leaking info about parents or peers).
Is there a way to link serial numbers of namespaces involved in migration of a
container to another kernel? It sounds like what is needed is a part of a
mangement application that is able to pull the audit records from constituent
hosts to build an audit trail of a container.
I honestly don't know how much we are going to care about namespace ids
during migration. So far this is not a problem that has come up.

I don't think migration becomes a practical concern (other than
interface wise) until achieve a non-init namespace auditd. The easy way
to handle migration would be to log a setns of every process from their
old namespaces to their new namespaces. As you appear to have a setns
event defined.

How to handle the more general case beyond audit remains unclear. I
think it will be a little while yet before we start dealing with
migrating applications that care. When we do we will either need to
generate some kind of hot-plug event that userspace can respond to and
discover all of the appropriate file-system nodes have changed, or we
will need to build a mechanism in the kernel to preserve these numbers.

I really don't know which solution we will wind up with in the kernel at
this point.
Post by Richard Guy Briggs
What additional events should list this information?
At least unshare.
Post by Richard Guy Briggs
Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL
(and now CAP_AUDIT_READ) in init_user_ns can get to this information in
the init namespace at the moment from audit.
Good question. Today access to this information is generally guarded
with CAP_SYS_PTRACE.

I suspect for some of audits tracing features like this one we should
also use CAP_SYS_PTRACE so that we have a consistent set of checks for
getting information about applications.

Eric
Post by Richard Guy Briggs
namespaces: expose ns_entries
proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
audit: log namespace ID numbers
audit: initialize at subsystem time rather than device time
audit: log creation and deletion of namespace instances
audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO
sched: add a macro to ref all CLONE_NEW* flags
fork: audit on creation of new namespace(s)
audit: log on switching namespace (setns)
audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record
fs/namespace.c | 13 +++
fs/proc/generic.c | 3 +-
fs/proc/namespaces.c | 2 +-
include/linux/audit.h | 20 +++++
include/linux/proc_ns.h | 10 ++-
include/uapi/linux/audit.h | 21 +++++
include/uapi/linux/sched.h | 6 ++
ipc/namespace.c | 12 +++
kernel/audit.c | 169 +++++++++++++++++++++++++++++++++++++-
kernel/auditsc.c | 2 +
kernel/fork.c | 3 +
kernel/nsproxy.c | 4 +
kernel/pid_namespace.c | 13 +++
kernel/user_namespace.c | 13 +++
kernel/utsname.c | 12 +++
net/core/net_namespace.c | 12 +++
security/integrity/ima/ima_api.c | 2 +
17 files changed, 309 insertions(+), 8 deletions(-)
Richard Guy Briggs
2015-04-23 03:07:51 UTC
Permalink
Post by Eric W. Biederman
Post by Richard Guy Briggs
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).
In broad strokes the user interface appears correct.
- After Als most recent changes these inodes no longer live in the proc
superblock so the device number reported in these patches is
incorrect.
Ok, found the patchset you're talking about:
3d3d35b kill proc_ns completely
e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
f77c801 bury struct proc_ns in fs/proc
33c4294 copy address of proc_ns_ops into ns_common
6344c43 new helpers: ns_alloc_inum/ns_free_inum
6496452 make proc_ns_operations work with struct ns_common * instead of void *
3c04118 switch the rest of proc_ns_operations to working with &...->ns
ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
435d5f4 common object embedded into various struct ....ns

Ok, I've got some minor jigging to do to get inum too...
Post by Eric W. Biederman
- I am nervous about audit logs being flooded with users creating lots
of namespaces. But that is more your lookout than mine.
There was a thought to create a filter to en/disable this logging...
It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
Post by Eric W. Biederman
- unshare is not logging when it creates new namespaces.
They are all covered:
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Post by Eric W. Biederman
As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.
That is compelling if I can untangle the proc inode allocation code from the
ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
the place of the existing PROC_*_INIT_INO.
Post by Eric W. Biederman
I have answered your comments below.
More below...
Post by Eric W. Biederman
Post by Richard Guy Briggs
1/10 exposes proc's ns entries structure which lists a number of useful
operations per namespace type for other subsystems to use.
2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
3/10 provides an example of usage for audit_log_task_info() which is used by
syscall audits, among others. audit_log_task() and audit_common_recv_message()
would be other potential use cases.
This differs slightly from Aristeu's patch because of the label conflict with
"pid=" due to including it in existing records rather than it being a seperate
record. It has now returned to being a seperate record. The proc device
major/minor are listed in hexadecimal and namespace IDs are the proc inode
minus the base offset.
type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4 pidns=-1 userns=-2 mntns=0
4/10 change audit startup from __initcall to subsys_initcall to get it started
earlier to be able to receive initial namespace log messages.
5/10 tracks the creation and deletion of namespaces, listing the type of
namespace instance, proc device ID, related namespace id if there is one and
the newly minted namespace ID.
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1
type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1
type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1
type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1
6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
type (CAP_AUDIT_CONTROL required).
7/10 is a macro for CLONE_NEW_* flags.
8/10 adds auditing on creation of namespace(s) in fork.
9/10 adds auditing a change of namespace on setns.
10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
(CAP_AUDIT_WRITE required).
Switch to using namespace ID based on namespace proc inode minus base offset
Added proc device ID to qualify proc inode reference
Eliminate exposed /proc interface
Clean up prototypes for dependencies on CONFIG_NAMESPACES.
Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
Log AUDIT_NS_INFO with PID.
Move /proc/<pid>/ns_* patches to end of patchset to deprecate them.
Log on changing ns (setns).
Log on creating new namespaces when forking.
Added a macro for CLONE_NEW*.
Seperate out the NS_INFO message from the SYSCALL message.
Moved audit_log_namespace_info() out of audit_log_task_info().
Use a seperate message type per namespace type for each of INIT/DEL.
Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
Add /proc/<pid>/ns/ documentation.
Fix dynamic initial ns logging.
Use atomic64_t in ns_serial to simplify it.
Avoid funciton duplication in proc, keying on dentry.
Squash down audit patch to avoid rcu sleep issues.
Add tracking for creation and deletion of namespace instances.
Avoid rollover by switching from an int to a long long.
Change rollover behaviour from simply avoiding zero to raising a BUG.
Expose serial numbers in /proc/<pid>/ns/*_snum.
Expose ns_entries and use it in audit.
As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
capabilities of userspace processes that try to join netlink broadcast groups.
This set does not try to solve the non-init namespace audit messages and
auditd problem yet. That will come later, likely with additional auditd
instances running in another namespace with a limited ability to influence the
master auditd. I echo Eric B's idea that messages destined for different
namespaces would have to be tailored for that namespace with references that
make sense (such as the right pid number reported to that pid namespace, and
not leaking info about parents or peers).
Is there a way to link serial numbers of namespaces involved in migration of a
container to another kernel? It sounds like what is needed is a part of a
mangement application that is able to pull the audit records from constituent
hosts to build an audit trail of a container.
I honestly don't know how much we are going to care about namespace ids
during migration. So far this is not a problem that has come up.
Not for CRIU, but it will be an issue for a container auditor that aggregates
information from individually auditted hosts.
Post by Eric W. Biederman
I don't think migration becomes a practical concern (other than
interface wise) until achieve a non-init namespace auditd. The easy way
to handle migration would be to log a setns of every process from their
old namespaces to their new namespaces. As you appear to have a setns
event defined.
Again, this would be taken care of by a layer above that is container-aware
across multiple hosts.
Post by Eric W. Biederman
How to handle the more general case beyond audit remains unclear. I
think it will be a little while yet before we start dealing with
migrating applications that care. When we do we will either need to
generate some kind of hot-plug event that userspace can respond to and
discover all of the appropriate file-system nodes have changed, or we
will need to build a mechanism in the kernel to preserve these numbers.
I don't expect to need to preserve these numbers. The higher layer application
will be able to do that translation.
Post by Eric W. Biederman
I really don't know which solution we will wind up with in the kernel at
this point.
Post by Richard Guy Briggs
What additional events should list this information?
At least unshare.
Already covered as noted above. If it is a brand new namespace, it will show
the old one as "(none)" (or maybe zero now that we are looking at renumbering
the NS inodes). If it is an unshared one, it will show the old one from which
it was unshared.
Post by Eric W. Biederman
Post by Richard Guy Briggs
Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL
(and now CAP_AUDIT_READ) in init_user_ns can get to this information in
the init namespace at the moment from audit.
Good question. Today access to this information is generally guarded
with CAP_SYS_PTRACE.
I suspect for some of audits tracing features like this one we should
also use CAP_SYS_PTRACE so that we have a consistent set of checks for
getting information about applications.
I assume CAP_SYS_PTRACE is orthogonal to CAP_AUDIT_{CONTROL,READ} and that
CAP_SYS_PTRACE would need to be insufficient to get that information.


Thanks for your thoughtful feedback, Eric.
Post by Eric W. Biederman
Eric
Post by Richard Guy Briggs
namespaces: expose ns_entries
proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
audit: log namespace ID numbers
audit: initialize at subsystem time rather than device time
audit: log creation and deletion of namespace instances
audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO
sched: add a macro to ref all CLONE_NEW* flags
fork: audit on creation of new namespace(s)
audit: log on switching namespace (setns)
audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record
fs/namespace.c | 13 +++
fs/proc/generic.c | 3 +-
fs/proc/namespaces.c | 2 +-
include/linux/audit.h | 20 +++++
include/linux/proc_ns.h | 10 ++-
include/uapi/linux/audit.h | 21 +++++
include/uapi/linux/sched.h | 6 ++
ipc/namespace.c | 12 +++
kernel/audit.c | 169 +++++++++++++++++++++++++++++++++++++-
kernel/auditsc.c | 2 +
kernel/fork.c | 3 +
kernel/nsproxy.c | 4 +
kernel/pid_namespace.c | 13 +++
kernel/user_namespace.c | 13 +++
kernel/utsname.c | 12 +++
net/core/net_namespace.c | 12 +++
security/integrity/ima/ima_api.c | 2 +
17 files changed, 309 insertions(+), 8 deletions(-)
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Richard Guy Briggs
2015-04-23 20:44:29 UTC
Permalink
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).
In broad strokes the user interface appears correct.
- After Als most recent changes these inodes no longer live in the proc
superblock so the device number reported in these patches is
incorrect.
3d3d35b kill proc_ns completely
e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
f77c801 bury struct proc_ns in fs/proc
33c4294 copy address of proc_ns_ops into ns_common
6344c43 new helpers: ns_alloc_inum/ns_free_inum
6496452 make proc_ns_operations work with struct ns_common * instead of void *
3c04118 switch the rest of proc_ns_operations to working with &...->ns
ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
435d5f4 common object embedded into various struct ....ns
Ok, I've got some minor jigging to do to get inum too...
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable? In fact, I never needed to
report the device since proc ida/idr and inodes are kernel-global and
namespace-oblivious.
Post by Richard Guy Briggs
Post by Eric W. Biederman
- I am nervous about audit logs being flooded with users creating lots
of namespaces. But that is more your lookout than mine.
There was a thought to create a filter to en/disable this logging...
It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
Post by Eric W. Biederman
- unshare is not logging when it creates new namespaces.
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Post by Eric W. Biederman
As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.
That is compelling if I can untangle the proc inode allocation code from the
ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
the place of the existing PROC_*_INIT_INO.
Post by Eric W. Biederman
I have answered your comments below.
More below...
Post by Eric W. Biederman
Post by Richard Guy Briggs
1/10 exposes proc's ns entries structure which lists a number of useful
operations per namespace type for other subsystems to use.
2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
3/10 provides an example of usage for audit_log_task_info() which is used by
syscall audits, among others. audit_log_task() and audit_common_recv_message()
would be other potential use cases.
This differs slightly from Aristeu's patch because of the label conflict with
"pid=" due to including it in existing records rather than it being a seperate
record. It has now returned to being a seperate record. The proc device
major/minor are listed in hexadecimal and namespace IDs are the proc inode
minus the base offset.
type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4 pidns=-1 userns=-2 mntns=0
4/10 change audit startup from __initcall to subsys_initcall to get it started
earlier to be able to receive initial namespace log messages.
5/10 tracks the creation and deletion of namespaces, listing the type of
namespace instance, proc device ID, related namespace id if there is one and
the newly minted namespace ID.
type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1
type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1
type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1
type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1
type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1
type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2 netns=3 res=1
type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1
6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
type (CAP_AUDIT_CONTROL required).
7/10 is a macro for CLONE_NEW_* flags.
8/10 adds auditing on creation of namespace(s) in fork.
9/10 adds auditing a change of namespace on setns.
10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
(CAP_AUDIT_WRITE required).
Switch to using namespace ID based on namespace proc inode minus base offset
Added proc device ID to qualify proc inode reference
Eliminate exposed /proc interface
Clean up prototypes for dependencies on CONFIG_NAMESPACES.
Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
Log AUDIT_NS_INFO with PID.
Move /proc/<pid>/ns_* patches to end of patchset to deprecate them.
Log on changing ns (setns).
Log on creating new namespaces when forking.
Added a macro for CLONE_NEW*.
Seperate out the NS_INFO message from the SYSCALL message.
Moved audit_log_namespace_info() out of audit_log_task_info().
Use a seperate message type per namespace type for each of INIT/DEL.
Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
Add /proc/<pid>/ns/ documentation.
Fix dynamic initial ns logging.
Use atomic64_t in ns_serial to simplify it.
Avoid funciton duplication in proc, keying on dentry.
Squash down audit patch to avoid rcu sleep issues.
Add tracking for creation and deletion of namespace instances.
Avoid rollover by switching from an int to a long long.
Change rollover behaviour from simply avoiding zero to raising a BUG.
Expose serial numbers in /proc/<pid>/ns/*_snum.
Expose ns_entries and use it in audit.
As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
capabilities of userspace processes that try to join netlink broadcast groups.
This set does not try to solve the non-init namespace audit messages and
auditd problem yet. That will come later, likely with additional auditd
instances running in another namespace with a limited ability to influence the
master auditd. I echo Eric B's idea that messages destined for different
namespaces would have to be tailored for that namespace with references that
make sense (such as the right pid number reported to that pid namespace, and
not leaking info about parents or peers).
Is there a way to link serial numbers of namespaces involved in migration of a
container to another kernel? It sounds like what is needed is a part of a
mangement application that is able to pull the audit records from constituent
hosts to build an audit trail of a container.
I honestly don't know how much we are going to care about namespace ids
during migration. So far this is not a problem that has come up.
Not for CRIU, but it will be an issue for a container auditor that aggregates
information from individually auditted hosts.
Post by Eric W. Biederman
I don't think migration becomes a practical concern (other than
interface wise) until achieve a non-init namespace auditd. The easy way
to handle migration would be to log a setns of every process from their
old namespaces to their new namespaces. As you appear to have a setns
event defined.
Again, this would be taken care of by a layer above that is container-aware
across multiple hosts.
Post by Eric W. Biederman
How to handle the more general case beyond audit remains unclear. I
think it will be a little while yet before we start dealing with
migrating applications that care. When we do we will either need to
generate some kind of hot-plug event that userspace can respond to and
discover all of the appropriate file-system nodes have changed, or we
will need to build a mechanism in the kernel to preserve these numbers.
I don't expect to need to preserve these numbers. The higher layer application
will be able to do that translation.
Post by Eric W. Biederman
I really don't know which solution we will wind up with in the kernel at
this point.
Post by Richard Guy Briggs
What additional events should list this information?
At least unshare.
Already covered as noted above. If it is a brand new namespace, it will show
the old one as "(none)" (or maybe zero now that we are looking at renumbering
the NS inodes). If it is an unshared one, it will show the old one from which
it was unshared.
Post by Eric W. Biederman
Post by Richard Guy Briggs
Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL
(and now CAP_AUDIT_READ) in init_user_ns can get to this information in
the init namespace at the moment from audit.
Good question. Today access to this information is generally guarded
with CAP_SYS_PTRACE.
I suspect for some of audits tracing features like this one we should
also use CAP_SYS_PTRACE so that we have a consistent set of checks for
getting information about applications.
I assume CAP_SYS_PTRACE is orthogonal to CAP_AUDIT_{CONTROL,READ} and that
CAP_SYS_PTRACE would need to be insufficient to get that information.
Thanks for your thoughtful feedback, Eric.
Post by Eric W. Biederman
Eric
Post by Richard Guy Briggs
namespaces: expose ns_entries
proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
audit: log namespace ID numbers
audit: initialize at subsystem time rather than device time
audit: log creation and deletion of namespace instances
audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO
sched: add a macro to ref all CLONE_NEW* flags
fork: audit on creation of new namespace(s)
audit: log on switching namespace (setns)
audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record
fs/namespace.c | 13 +++
fs/proc/generic.c | 3 +-
fs/proc/namespaces.c | 2 +-
include/linux/audit.h | 20 +++++
include/linux/proc_ns.h | 10 ++-
include/uapi/linux/audit.h | 21 +++++
include/uapi/linux/sched.h | 6 ++
ipc/namespace.c | 12 +++
kernel/audit.c | 169 +++++++++++++++++++++++++++++++++++++-
kernel/auditsc.c | 2 +
kernel/fork.c | 3 +
kernel/nsproxy.c | 4 +
kernel/pid_namespace.c | 13 +++
kernel/user_namespace.c | 13 +++
kernel/utsname.c | 12 +++
net/core/net_namespace.c | 12 +++
security/integrity/ima/ima_api.c | 2 +
17 files changed, 309 insertions(+), 8 deletions(-)
- RGB
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Eric W. Biederman
2015-04-24 19:36:16 UTC
Permalink
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).
In broad strokes the user interface appears correct.
- After Als most recent changes these inodes no longer live in the proc
superblock so the device number reported in these patches is
incorrect.
3d3d35b kill proc_ns completely
e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
f77c801 bury struct proc_ns in fs/proc
33c4294 copy address of proc_ns_ops into ns_common
6344c43 new helpers: ns_alloc_inum/ns_free_inum
6496452 make proc_ns_operations work with struct ns_common * instead of void *
3c04118 switch the rest of proc_ns_operations to working with &...->ns
ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
435d5f4 common object embedded into various struct ....ns
Ok, I've got some minor jigging to do to get inum too...
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable?
We still need the dev. We do have a device number get_anon_bdev fills it in.
Post by Richard Guy Briggs
In fact, I never needed to
report the device since proc ida/idr and inodes are kernel-global and
namespace-oblivious.
This is the bit I really want to keep to be forward looking. If we
every need to preserve the inode numbers across a migration we could
have different super blocks with different inode numbers for the same
namespace.
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
- I am nervous about audit logs being flooded with users creating lots
of namespaces. But that is more your lookout than mine.
There was a thought to create a filter to en/disable this logging...
It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
Post by Eric W. Biederman
- unshare is not logging when it creates new namespaces.
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Then why the special change to fork? That was not reflected on
the unshare path as far as I could see.
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.
That is compelling if I can untangle the proc inode allocation code from the
ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
the place of the existing PROC_*_INIT_INO.
Something like that. Just a new ida/idr allocator specific to that
superblock.

Yeah. It is somewhere on my todo, but I have been prioritizing getting
the bugs that look potentially expoloitable fixed in the mount
namespace. Al made things nice for one case but left a mess for a bunch
of others.
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
I honestly don't know how much we are going to care about namespace ids
during migration. So far this is not a problem that has come up.
Not for CRIU, but it will be an issue for a container auditor that aggregates
information from individually auditted hosts.
Post by Eric W. Biederman
I don't think migration becomes a practical concern (other than
interface wise) until achieve a non-init namespace auditd. The easy way
to handle migration would be to log a setns of every process from their
old namespaces to their new namespaces. As you appear to have a setns
event defined.
Again, this would be taken care of by a layer above that is container-aware
across multiple hosts.
Post by Eric W. Biederman
How to handle the more general case beyond audit remains unclear. I
think it will be a little while yet before we start dealing with
migrating applications that care. When we do we will either need to
generate some kind of hot-plug event that userspace can respond to and
discover all of the appropriate file-system nodes have changed, or we
will need to build a mechanism in the kernel to preserve these numbers.
I don't expect to need to preserve these numbers. The higher layer application
will be able to do that translation.
We need to be very aware of what is happening.

The situation I am concerned about looks something like.

Program A:
fd1 = open(/proc/self/ns/net);
fstat(fd1, &stat1)

... later ...

fd2 = open(/var/run/netns/johnny);
fstat(fd2, &stat2);

if ((stat1.st_dev == stat2.st_dev) &&
(stat1.st_ino == stat2.st_ino)) {
/* Same netns do something... */
}


What happens when we migrate Program A with it's cached stat data of
of a network namespace file?

This requires either a hotplug event that Program A listens to or that
the inode number and device number are preserved across migration.

Exactly what we do depends on where we are when it comes up. But this
is not something some layer about the program can abstract it all out so
we don't need to worry about it.

Eric
Richard Guy Briggs
2015-04-28 02:05:55 UTC
Permalink
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by logging the namespace IDs (device ID and namespace
inode - offset).
In broad strokes the user interface appears correct.
- After Als most recent changes these inodes no longer live in the proc
superblock so the device number reported in these patches is
incorrect.
3d3d35b kill proc_ns completely
e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
f77c801 bury struct proc_ns in fs/proc
33c4294 copy address of proc_ns_ops into ns_common
6344c43 new helpers: ns_alloc_inum/ns_free_inum
6496452 make proc_ns_operations work with struct ns_common * instead of void *
3c04118 switch the rest of proc_ns_operations to working with &...->ns
ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
58be2825 make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
435d5f4 common object embedded into various struct ....ns
Ok, I've got some minor jigging to do to get inum too...
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable?
We still need the dev. We do have a device number get_anon_bdev fills it in.
Fine, it has a device number. There appears to be only one of these
allocated per kernel. I can get it from &nsfs->fs_supers (and take the
first instance given by hlist_for_each_entry and verify there are no
others). Why do I need it, again?
Post by Eric W. Biederman
Post by Richard Guy Briggs
In fact, I never needed to
report the device since proc ida/idr and inodes are kernel-global and
namespace-oblivious.
This is the bit I really want to keep to be forward looking. If we
every need to preserve the inode numbers across a migration we could
have different super blocks with different inode numbers for the same
namespace.
I don't quite follow your argument here, but can accept that in the
future we might add other namespace devices. I wonder if we might do
that augmentation later and leave out the device number for now...
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
- I am nervous about audit logs being flooded with users creating lots
of namespaces. But that is more your lookout than mine.
There was a thought to create a filter to en/disable this logging...
It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
Post by Eric W. Biederman
- unshare is not logging when it creates new namespaces.
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Then why the special change to fork? That was not reflected on
the unshare path as far as I could see.
Fork can specify more than one CLONE flag at once, so collecting them
all in one statementn seemed helpful. setns can only set one at a time.
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.
That is compelling if I can untangle the proc inode allocation code from the
ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
the place of the existing PROC_*_INIT_INO.
Something like that. Just a new ida/idr allocator specific to that
superblock.
Yeah. It is somewhere on my todo, but I have been prioritizing getting
the bugs that look potentially expoloitable fixed in the mount
namespace. Al made things nice for one case but left a mess for a bunch
of others.
Post by Richard Guy Briggs
Post by Richard Guy Briggs
Post by Eric W. Biederman
I honestly don't know how much we are going to care about namespace ids
during migration. So far this is not a problem that has come up.
Not for CRIU, but it will be an issue for a container auditor that aggregates
information from individually auditted hosts.
Post by Eric W. Biederman
I don't think migration becomes a practical concern (other than
interface wise) until achieve a non-init namespace auditd. The easy way
to handle migration would be to log a setns of every process from their
old namespaces to their new namespaces. As you appear to have a setns
event defined.
Again, this would be taken care of by a layer above that is container-aware
across multiple hosts.
Post by Eric W. Biederman
How to handle the more general case beyond audit remains unclear. I
think it will be a little while yet before we start dealing with
migrating applications that care. When we do we will either need to
generate some kind of hot-plug event that userspace can respond to and
discover all of the appropriate file-system nodes have changed, or we
will need to build a mechanism in the kernel to preserve these numbers.
I don't expect to need to preserve these numbers. The higher layer application
will be able to do that translation.
We need to be very aware of what is happening.
The situation I am concerned about looks something like.
fd1 = open(/proc/self/ns/net);
fstat(fd1, &stat1)
... later ...
fd2 = open(/var/run/netns/johnny);
fstat(fd2, &stat2);
if ((stat1.st_dev == stat2.st_dev) &&
(stat1.st_ino == stat2.st_ino)) {
/* Same netns do something... */
}
What happens when we migrate Program A with it's cached stat data of
of a network namespace file?
This requires either a hotplug event that Program A listens to or that
the inode number and device number are preserved across migration.
Exactly what we do depends on where we are when it comes up. But this
is not something some layer about the program can abstract it all out so
we don't need to worry about it.
Ok, understood, we can't just punt this one to a higher layer...

So this comes back to a question above, which is how do we determine
which device it is from? Sounds like we need something added to
ns_common or one of the 6 namespace types structs.
Post by Eric W. Biederman
Eric
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Eric W. Biederman
2015-04-28 02:16:32 UTC
Permalink
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable?
We still need the dev. We do have a device number get_anon_bdev fills it in.
Fine, it has a device number. There appears to be only one of these
allocated per kernel. I can get it from &nsfs->fs_supers (and take the
first instance given by hlist_for_each_entry and verify there are no
others). Why do I need it, again?
Because if we have to preserve the inode number over a migration event I
want to preserve the fact that we are talking about inode numbers from a
superblock with a device number.

Otherwise known as I am allergic to kernel global identifiers, because
they can be major pains. I don't want to have to go back and implement
a namespace for namespaces.
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Richard Guy Briggs
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Then why the special change to fork? That was not reflected on
the unshare path as far as I could see.
Fork can specify more than one CLONE flag at once, so collecting them
all in one statementn seemed helpful. setns can only set one at a time.
unshare can also specify more than one CLONE flag at once.

I just pointed that out becase that seemed really unsymmetrical.
Post by Richard Guy Briggs
Ok, understood, we can't just punt this one to a higher layer...
So this comes back to a question above, which is how do we determine
which device it is from? Sounds like we need something added to
ns_common or one of the 6 namespace types structs.
Or we can just hard code reading it off of the appropriate magic
filesystem. Probably what we want is a well named helper function that
does the job.

I just care that when we talk about these things we are talking about
inode numbers from a superblock that is associated with a given device
number. That way I don't have nightmares about dealing with a namespace
for namespaces.

Eric
Richard Guy Briggs
2015-05-08 14:42:50 UTC
Permalink
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable?
We still need the dev. We do have a device number get_anon_bdev fills it in.
Fine, it has a device number. There appears to be only one of these
allocated per kernel. I can get it from &nsfs->fs_supers (and take the
first instance given by hlist_for_each_entry and verify there are no
others). Why do I need it, again?
Because if we have to preserve the inode number over a migration event I
want to preserve the fact that we are talking about inode numbers from a
superblock with a device number.
Otherwise known as I am allergic to kernel global identifiers, because
they can be major pains. I don't want to have to go back and implement
a namespace for namespaces.
Alright, I'll change the device over to that... We can figure out how
to select the correct device number of nsfs instances if it increases
beyond one.
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Eric W. Biederman
Post by Richard Guy Briggs
Post by Richard Guy Briggs
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname > clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs > get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns > create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
Then why the special change to fork? That was not reflected on
the unshare path as far as I could see.
Fork can specify more than one CLONE flag at once, so collecting them
all in one statementn seemed helpful. setns can only set one at a time.
unshare can also specify more than one CLONE flag at once.
I just pointed that out becase that seemed really unsymmetrical.
Ah sorry, my mistake, I was thinking setns... I've added a call in
sys_unshare().
Post by Eric W. Biederman
Post by Richard Guy Briggs
Ok, understood, we can't just punt this one to a higher layer...
So this comes back to a question above, which is how do we determine
which device it is from? Sounds like we need something added to
ns_common or one of the 6 namespace types structs.
Or we can just hard code reading it off of the appropriate magic
filesystem. Probably what we want is a well named helper function that
does the job.
There is a bit of overhead to read that, so I've added a dev_t member to
ns_common. Simplest way I found was to call iterate_supers() since
struct file_system_type *nsfs isn't exposed.
Post by Eric W. Biederman
I just care that when we talk about these things we are talking about
inode numbers from a superblock that is associated with a given device
number. That way I don't have nightmares about dealing with a namespace
for namespaces.
Eric
- RGB

--
Richard Guy Briggs <***@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
Continue reading on narkive:
Loading...