RFC(V3): Audit Kernel Container IDs

Discussion:

Richard Guy Briggs

2018-01-09 12:16:20 UTC

Containers are a userspace concept. The kernel knows nothing of them.

The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.

Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container
identifier.

The registration is a u64 representing the audit container identifier
written to a special file in a pseudo filesystem (proc, since PID tree
already exists) representing a process that will become a parent process
in that container. This write might place restrictions on mount
namespaces required to define a container, or at least careful checking
of namespaces in the kernel to verify permissions of the orchestrator so
it can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mount namespace. This write
can only happen once per process.

Note: The justification for using a u64 is that it minimizes the
information printed in every audit record, reducing bandwidth and limits
comparisons to a single u64 which will be faster and less error-prone.

Require CAP_AUDIT_CONTROL to be able to carry out the registration. At
that time, record the target container's user-supplied audit container
identifier along with a target container's parent process (which may
become the target container's "init" process) process ID (referenced
from the initial PID namespace) in a new record AUDIT_CONTAINER with a
qualifying op=$action field.

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.

Forked and cloned processes inherit their parent's audit container
identifier, referenced in the process' task_struct. Since the audit
container identifier is inherited rather than written, it can still be
written once. This will prevent tampering while allowing nesting.
(This can be implemented with an internal settable flag upon
registration that does not get copied across a fork/clone.)

Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all of the target's threads and children to the new containerID.

Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.

When a container ceases to exist because the last process in that
container has exited log the fact to balance the registration action.
(This is likely needed for certification accountability.)

At this point it appears unnecessary to add a container session
identifier since this is all tracked from loginuid and sessionid to
communicate with the container orchestrator to spawn an additional
session into an existing container which would be logged. It can be
added at a later date without breaking API should it be deemed
necessary.

The following namespace logging actions are not needed for certification
purposes at this point, but are helpful for tracking namespace activity.
These are auxilliary records that are associated with namespace
manipulation syscalls unshare(2), clone(2) and setns(2), so the records
will only show up if explicit syscall rules have been added to document
this activity.

Log the creation of every namespace, inheriting/adding its spawning
process' audit container identifier(s), if applicable. Include the
spawning and spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process. Since a
namespace can be shared by processes in different containers, the
namespace will need to track all containers to which it has been
assigned.

Upon registration, the target process' namespace IDs (in the form of a
nsfs device number and inode number tuple) will be recorded in an
AUDIT_NS_INFO auxilliary record.

Log the destruction of every namespace that is no longer used by any
process, including the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.

The audit container identifier will need to be reaped from all
implicated namespaces upon the destruction of a container.

This namespace information adds supporting information for tracking
events not attributable to specific processes.

Changelog:

(Upstream V3)
- switch back to u64 (from pmoore, can be expanded to u128 in future if
need arises without breaking API. u32 was originally proposed, up to
c36 discussed)
- write-once, but children inherit audit container identifier and can
then still be written once
- switch to CAP_AUDIT_CONTROL
- group namespace actions together, auxilliary records to namespace
operations.

(Upstream V2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and
children into same container

- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

Simo Sorce

2018-01-09 16:18:56 UTC

Permalink

Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container
identifier.
The registration is a u64 representing the audit container identifier
written to a special file in a pseudo filesystem (proc, since PID tree
already exists) representing a process that will become a parent process
in that container. This write might place restrictions on mount
namespaces required to define a container, or at least careful checking
of namespaces in the kernel to verify permissions of the orchestrator so
it can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mount namespace. This write
can only happen once per process.
Note: The justification for using a u64 is that it minimizes the
information printed in every audit record, reducing bandwidth and limits
comparisons to a single u64 which will be faster and less error-prone.
Require CAP_AUDIT_CONTROL to be able to carry out the registration. At
that time, record the target container's user-supplied audit container
identifier along with a target container's parent process (which may
become the target container's "init" process) process ID (referenced
from the initial PID namespace) in a new record AUDIT_CONTAINER with a
qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's audit container
identifier, referenced in the process' task_struct. Since the audit
container identifier is inherited rather than written, it can still be
written once. This will prevent tampering while allowing nesting.
(This can be implemented with an internal settable flag upon
registration that does not get copied across a fork/clone.)
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all of the target's threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.
When a container ceases to exist because the last process in that
container has exited log the fact to balance the registration action.
(This is likely needed for certification accountability.)
At this point it appears unnecessary to add a container session
identifier since this is all tracked from loginuid and sessionid to
communicate with the container orchestrator to spawn an additional
session into an existing container which would be logged. It can be
added at a later date without breaking API should it be deemed
necessary.
The following namespace logging actions are not needed for certification
purposes at this point, but are helpful for tracking namespace activity.
These are auxilliary records that are associated with namespace
manipulation syscalls unshare(2), clone(2) and setns(2), so the records
will only show up if explicit syscall rules have been added to document
this activity.
Log the creation of every namespace, inheriting/adding its spawning
process' audit container identifier(s), if applicable. Include the
spawning and spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process. Since a
namespace can be shared by processes in different containers, the
namespace will need to track all containers to which it has been
assigned.
Upon registration, the target process' namespace IDs (in the form of a
nsfs device number and inode number tuple) will be recorded in an
AUDIT_NS_INFO auxilliary record.
Log the destruction of every namespace that is no longer used by any
process, including the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
The audit container identifier will need to be reaped from all
implicated namespaces upon the destruction of a container.
This namespace information adds supporting information for tracking
events not attributable to specific processes.
(Upstream V3)
- switch back to u64 (from pmoore, can be expanded to u128 in future if
need arises without breaking API. u32 was originally proposed, up to
c36 discussed)
- write-once, but children inherit audit container identifier and can
then still be written once
- switch to CAP_AUDIT_CONTROL
- group namespace actions together, auxilliary records to namespace
operations.
(Upstream V2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and
children into same container

I am trying to understand the back and forth on the ID size.

Post by Richard Guy Briggs
From an orchestrator POV anything that requires tracking a node

specific ID is not ideal.

Orchestrators tend to span many nodes, and containers tend to have IDs
that are either UUID or have a Hash (like SHA256) as identifier.

The problem here is two-fold:

a) Your auditing requires some mapping to be useful outside of the
system.
If you aggreggate audit logs outside of the system or you want to
correlate the system audit logs with other components dealing with
containers, now you need a place where you provide a mapping from your
audit u64 to the ID a container has in the rest of the system.

b) Now you need a mapping of some sort. The simplest way a container
orchestrator can go about this is to just use the UUID or Hash
representing their view of the container, truncate it to a u64 and use
that for Audit. This means there are some chances there will be a
collision and a duplicate u64 ID will be used by the orchestrator as
the container ID. What happen in that case ?

Simo.

--
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc

Richard Guy Briggs

2018-01-10 07:00:11 UTC