Securing-Linux-Containers/README.md

# Securing Linux Containers

## 1. Table of contents

<!--toc:start-->

- [Securing Linux Containers](#securing-linux-containers)
  - [1. Table of contents](#1-table-of-contents)
  - [2. Introduction](#2-introduction)
  - [3. Secrets](#3-secrets)
    - [3.1 Alternatives](#31-alternatives)
      - [3.1.1 Files](#311-files)
      - [3.1.2 Secrets Management Services (kubernetes)](#312-secrets-management-services-kubernetes)
  - [4. Users and groups](#4-users-and-groups)
    - [Setting user and group](#setting-user-and-group)
      - [Containerfile/Dockerfile](#containerfiledockerfile)
      - [Changing user/group arbitrarily on container startup](#changing-usergroup-arbitrarily-on-container-startup)
    - [Additional security](#additional-security)
  - [5. Filesystem](#5-filesystem)
    - [Read-only](#read-only)
    - [Additional Protection with nosuid, noexec, and nodev](#additional-protection-with-nosuid-noexec-and-nodev)
  - [6. Resources limits](#6-resources-limits)
    - [CPU](#cpu)
    - [RAM](#ram)
  - [7. Network](#7-network)
    - [Desktop tools](#desktop-tools)
    - [Kubernetes](#kubernetes)
  - [8. Images](#8-images)
  - [8.1 Building](#81-building)
  - [8.2 Scanning](#82-scanning)
  - [9. Selinux](#9-selinux)

<!--toc:end-->

## 2. Introduction

This document is a collection of simple, very generic tips and best
practices related to security of Linux containers. Contenerization is
considered safer by default, but then one can hear about discovered
vulnerabilities that are primarly bad for applications in containers
(Example: [CVE-2023-49103](https://nvd.nist.gov/vuln/detail/CVE-2023-49103)).
Tips and best practices collected here should help raise awarness about
how to keep containers really secure. Contents are kept container-engine
agnostic, but examples will be based on actual implementations (Podman, k8s).

## 3. Secrets

Secret is the most vulnerable data, as it usually can open access to other
private data. They might also allow modification of the environment, which
means possibilities for further access or many other forms of attack.

> [!WARNING]
> Don't use environment variables for secrets

Container isolation made providing and managing secrets somewhat harder, as
they need to cross the additional barier. This casued the rather dangerous
trend of providing secrets among many other configuration data in form of
environment variables. At first sight it might look like good idea, but when
actually compared to other means of storing secrets it turns out that
environment variables might be much easier to access by attacker, than
for example arbitrary files. [CVE-2023-49103](https://nvd.nist.gov/vuln/detail/CVE-2023-49103)
is only an example of vulnerability which was considered to be more
dangerous for contenerized apps, because of the vulnerability
being based on gaining access to env variables.

### 3.1 Alternatives

#### 3.1.1 Files

Files with secrets are common and broadly supported. With proper setup they can
be also very secure.

- Keep configuration and secret files on entirely different path than other data
- If application runs main process under different user than worker processes
  (worker usually have direct contact with user interaction), the configuration
  should not be readable by the worker process user.
- Depending on the technology used, storage of the secret files inside of a
  container could be temporary/volatile. In kubernetes Secret objects are mounted
  as tmpfs. Example for mounting secret as tmpfs in pod:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: registry.fedoraproject.org/fedora-minimal:latest
    command: [ "sleep", "infinity" ]
    volumeMounts:
      - mountPath: /config
        name: config
  volumes:
  - name: config
    secret:
      secretName: config
```

This produces readonly tmpfs mount inside:

```bash
bash-5.2# df -h /config/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           4.8G  4.0K  4.8G   1% /config

bash-5.2# ls -la /config/
total 0
drwxrwxrwt. 3 root root 100 Nov  9 14:00 .
drwxr-xr-x. 1 root root  24 Nov  9 14:00 ..
drwxr-xr-x. 2 root root  60 Nov  9 14:00 ..2024_11_09_14_00_47.4065932771
lrwxrwxrwx. 1 root root  32 Nov  9 14:00 ..data -> ..2024_11_09_14_00_47.4065932771
lrwxrwxrwx. 1 root root  18 Nov  9 14:00 secret.conf -> ..data/secret.conf
```

#### 3.1.2 Secrets Management Services (kubernetes)

There are sophisticated tools for secret management and their deployment,
available for kubernetes. For example HashiCorp Vault. It offers dynamic
secrets, secret rotation, and access policies. Such tools are most helpfull in
large environments and infrastructures, where secret management is split
among many people.

## 4. Users and groups

Users and groups are standard mechanisms for security and permissions limiting
in unix-like systems. Contenerization engines usually have possibility to
arbitrarily assign them to the contenerized program process.

> [!NOTE]
> Both user and group can always be specified by numeric id even if no actual
> user or group is assigned to them. When specifying with string name, the user
> or group must exist **inside** of the container (`/etc/passwd`, `/etc/group`)

> [!NOTE]
> Processes of rootless containers or containers with uid/gid mapping have
> different id's inside of container and outside. This can complicate things
> even more, but that also usually greatly increases security.
> In some scenarios such mapping can also cause trouble with files in
> container image, if their id's are out of mapping range.

### Setting user and group

Containers have default user and group specified by Containerfile, but
it can be changed when starting the container.

#### Containerfile/Dockerfile

In Containerfile the user/group assignment might take place many times in
single build. Typical reason for that is to have high privilige (root) during
build, and then set default to unpriviliged user at the end of build, so that
containers will use it by default.

Setting just user to "user1"

```Dockerfile
USER user1
```

Setting both user and group

```Dockerfile
USER user1:group1
```

Setting just group

```Dockerfile
USER :group1
```

#### Changing user/group arbitrarily on container startup

Podman and Docker uses `--user` or shorter `-u` flag to specify both user and
group. The syntax is the same as shown for Containerfile. Example of
setting both user and group to bin, but user is specified with number ID:

```bash
❯ podman run --rm -it --user 1:bin registry.fedoraproject.org/fedora-minimal
bash-5.2$ whoami
bin
bash-5.2$ groups
bin
bash-5.2$ grep ^bin /etc/passwd
bin:x:1:1:bin:/bin:/usr/sbin/nologin
bash-5.2$ grep ^bin /etc/group
bin:x:1:
```

For Kubernetes, the user and group specification is located in pod definition:

```yaml
apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsUser: 1
    runAsGroup: 1
```

> [!NOTE]
> In kubernetes you can't specify user nor group using string name.
> Only numeric values are allowed.

### Additional security

Linux kernel provides usefull feature - [No New Privileges Flag](https://docs.kernel.org/userspace-api/no_new_privs.html).
If set for process, it prevents the process from gaining more privileges than
parent process. This effectively blocks use of capabilities, and setgid,setuid
flags on files, which are known and powerfull tools for exploitation.

In Podman and Docker, the flag can be enabled using parameter `--security-opt no-new-privileges`

In Kubernetes, there is section related to security context per container:

```yaml
(....)
  containers:
  - name: mycontainer
    securityContext:
      allowPrivilegeEscalation: false
(....)
```

## 5. Filesystem

By default the filesystem security of containers is quite good, specially
when used with other mechanisms like selinux or mapped UIDs/GIDs, but it
still have field for improvement.

### Read-only

Both base filesystem and mounted volumes can be set to readonly.
When using a read-only filesystem, certain directories may still need to be
writable, such as /tmp or /var/tmp. This is where tmpfs (temporary filesystem)
can be used. tmpfs filesystem mounts a temporary filesystem in memory, allowing these
directories to be writable without compromising the overall read-only nature
of the filesystem. The directory will be empty and will vanish on container
shutdown which also increases security, if the temporary data is vulnerable.

Running Podman container with readonly base filesystem using `--read-only`:

```bash
podman run --rm -it --read-only registry.fedoraproject.org/fedora-minimal
```

> [!Note]
> Podman simplifies use of --read-only by automatically creating read-write
> tmpfs mounts inside in places where it is usually needed, like `/dev/shm`,
> `/tmp`, `/run`, etc...

Mounting tmpfs dir with specific size limit to Podman container using `--tmpfs`:

```bash
podman run --rm -it --read-only --tmpfs /tmp:rw,size=64m registry.fedoraproject.org/fedora-minimal
```

Mounting podman volume as read-only is done by specifying `ro` mount option
after `:` separator, for example `--tmpfs /test:ro`, `-v /host/path:/container/path:ro`

On Kubernetes to set base filesystem of a container to read-only, there is
`readOnlyRootFilesystem: true` attribute in container security context. To
mount any volume as read-only, there is attribute `readOnly: true` in mount
section.

Full kubernetes example of read-only base filesystem and example volume:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: readonly-pod
spec:
  containers:
  - name: mycontainer
    image: registry.fedoraproject.org/fedora-minimal:latest
    command: ["sleep", "infinity"]
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - mountPath: /test
      readOnly: true
      name: tmpfs
  volumes:
  - name: tmpfs
    emptyDir:
      medium: Memory
      sizeLimit: 64Mi
```

### Additional Protection with nosuid, noexec, and nodev

To further enhance security, you can use the nosuid, noexec, and nodev mount
options for volumes. They can also be used for tmpfs mounts.

- nosuid: Prevents the execution of set-user-identifier or set-group-identifier programs.
- noexec: Prevents the execution of any binaries on the mounted filesystem.
- nodev: Prevents the use of device files on the mounted filesystem.

Example using Podman:

```bash
❯ podman run --rm -it --read-only --tmpfs /test:nodev,nosuid,noexec registry.fedoraproject.org/fedora-minimal
bash-5.2# mount | grep /test
tmpfs on /test type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c240,c646",uid=1000,gid=1000,inode64)
```

## 6. Resources limits

Setting resource limits for containers is required to ensure that no single
container can consume excessive resources, which could impact the performance
and stability of the entire system or neighbour systems.

### CPU

Since there is no virtualization, the cpu is visible with all its cores and
threads inside of a container. Therefore cpu limiting is done by limiting
cpu time using scheduler. Usually the limitation unit is vCPU. In Podman
you can set the limit using `--cpus` flag. For example `--cpus=2` will limit
cpu time to 2/X of total cpu time current host have. In case of cpu with 16
threads this means that container can use up to 12.5% of whole cpu power. This
does not mean assigning the cpu time to specific physical threads, therefore
high load in that container will be loadbalanced on all physical threads,
without allowing to utilize too much of time.

In case of Kubernetes this works the same, limits are specified per container:

```yaml
(....)
spec:
  containers:
  - name: app
    resources:
      limits:
        cpu: "2"
(....)
```

### RAM

Limiting RAM for container looks similar to cpu limiting. Except that
when software inside of a container tries to cross the limits, it will be
handled more brutally - RAM hungry process will be killed. This might be
not that intuitive for application, as here again the app sees all the memory
available in host system, and it does not know about the limits (unless
configured).

Podman have simple flag `--memory` which configures the limit. `--memory=512MiB`
will limit to 512MiB.

Kubernetes works similar:

```yaml
(....)
spec:
  containers:
  - name: app
    resources:
      limits:
        memory: "512Mi"
(....)
```

## 7. Network

For network isolation, Linux containers leverage network namespaces.

A network namespace is a feature provided by the Linux kernel that allows for
the creation of isolated, independent network stacks. Each network namespace
has its own separate set of network interfaces, routing tables, firewall
rules, and other network-related resources. This gives complex possibilities
for network configuration, but it stimulates differences between
container engine implementations.
Additionally rootless containers, which are considered safer, need
to fallback to different network components, with reduced
possibilities, as managing network is strictly root based.

### Desktop tools

Container engines suitable for desktop like Podman usage usually have limited
options for network configuration. They allow to isolate pods from host and
each other with different network addresses pools, and even disabling the
network at all, which is very safe, but very rare.

For such tools there could be few rules that should increase security:

- Don't disable isolation. Isolation makes access harder for remote attacker,
  even if he can access any port on the container host machine.
- When opening ports to access the app from outside, set binding to the least
  accessible but sufficient interface/address. For example If you expect only
  to access the app locally over localhost, you could bind to localhost in
  Podman using flag: `-p 127.0.0.1:8080:8080` to open the port 8080
  only for localhost

### Kubernetes

Kubernetes gives much greater possibilities for both ingress and egress.
Primary tools for that are Network Polcicies, which are implemented via plugins
(therefore they might be not available on some k8s clusters).

Network Policies allow for very accurate limitation of network traffic,
thanks to their possibilities:

- Using labels to select the pods to which the network policy
  applies. This allows you to target specific groups of pods based on their labels.
- Applying network policies across namespaces by selecting
  namespaces based on their labels.
- Defining rules based on specific protocols (TCP, UDP) and ports to allow or deny traffic.
- Support for arbitrary CIDR-formatted network addresses ranges.

Example network policy definition:

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: example
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: app1
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: app
      ports:
      - protocol: TCP
        port: 123
      - protocol: TCP
        port: 456
    - from:
      - ipBlock:
          cidr: 10.43.0.0/16
      - ipBlock:
          cidr: fe80::8cb6:aff8:8dc9:f511/64
      ports:
      - protocol: TCP
        port: 443
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
      ports:
      - protocol: UDP
        port: 53
    - to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: db
      ports:
      - protocol: TCP
        port: 5432
```

## 8. OCI Images

Containers technically don't require images, the base filesystem can be
provided in different way, but OCI images become standard in the industry.
Images are another important element of (in)security in contenerization.
It is crucial to understand basics of that format, as it can for example leak
secrets to the public, if used incorrectly.

## 8.1 Building

It is obvious that one should not hardcode secrets into an image. Unfortunately
less users is aware how not to do that. When building an image, any instruction
that can modify filesystem of the built image, will be saved separately as a
layer. By default each layer is kept in the image, even, when in the end all
contents of some of those layers was removed.

Example of **insecure** Containerfile:

```Dockerfile
FROM registry.fedoraproject.org/fedora-minimal

# Copy secret into the image (bad practice)
COPY secret.txt ./secret.txt

# Use and delete secret (but it's still in a previous layer)
RUN cat secret.txt && rm secret.txt
```

There is a way to modify image-to-be filesystem in much more secure manner,
which also brings other benefits. It is called multi-stage build and, as the
name suggests, contains multiple stages, where only layers of the latest will
be saved in the resulting image.

The Containerfile can look like that:

```dockerfile
# Stage 1: Use secret during the build
FROM registry.fedoraproject.org/fedora-minimal AS builder

WORKDIR /app

# Copy application files
COPY app/ /app/

# Copy the secret into the build stage
COPY secret.txt /app/secret.txt

# Use the secret securely (e.g., configure app)
RUN cat /app/secret.txt && echo "Configuring app with secret" > config.txt

# Removing the secret in this example is needed, because in the next stage
# the /app dir will be copied as a whole
RUN rm /app/secret.txt

# Stage 2: Final image without secrets
FROM registry.fedoraproject.org/fedora-minimal

# Nothing is saved from previous stage
WORKDIR /app

# Copy only the necessary files from the builder stage
COPY --from=builder /app/ /app/
```

This approach also helps keeping the images minimal, without any other
leftovers, which also can improve security.

## 8.2 Scanning

Images can be scanned for vulnerabilities. This is usefull for any type and
source if images, since vulnerabilities appear even in the most basic
components like language interpreters, libC libraries, etc. There are tools
for manual scanning like [trivy](https://github.com/aquasecurity/trivy), and
some registries like [Harbor](https://goharbor.io/) have builting optional 
automatic vulnerability scanning for any stored image.

These tools can provide descriptive analysis of image contents, taking into
account versions of most software stored inside (if supported).

Example fragment of output of trivy scanning a python image:

![trivy](./trivy.jpg)

## 9. Selinux

SELinux (Security-Enhanced Linux) is a security module for Linux that enforces
mandatory access control (MAC) policies to restrict the actions of users and
applications based on predefined rules, enhancing system security. SELinux
works by labeling all files, processes, and resources on a system with security
contexts. Policies define rules about how these labels can interact. When an
action is attempted, SELinux checks the labels against the policies and either
allows or denies the action based on the rules, enforcing least-privilege access.

This document is too short to explain in detail how selinux works, but
for containers management most important concepts are MCS
(Multi-Category Security) and MLS (Multi-Level Security), described in
RedHat docs: [link](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/using_selinux/index#multi-level-security-mls_using-multi-level-security-mls)

Selinux additionally secures the contenerized program, not allowing to access
resources from outside. Container engines like Podman randomize categories by
default, so for example 2 different containers cannot access the same volume.

Proof of categories randomization by running subsequent containers and checking
their selinux context:

```bash
❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
system_u:system_r:container_t:s0:c340,c364
~
❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
system_u:system_r:container_t:s0:c202,c993
~
❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
system_u:system_r:container_t:s0:c259,c971
```
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
+								# Securing Linux Containers
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 1. Table of contents
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
 								<!--toc:start-->
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
+								- [Securing Linux Containers](#securing-linux-containers)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  - [1. Table of contents](#1-table-of-contents)
 								  - [2. Introduction](#2-introduction)
 								  - [3. Secrets](#3-secrets)
 								    - [3.1 Alternatives](#31-alternatives)
 								      - [3.1.1 Files](#311-files)
 								      - [3.1.2 Secrets Management Services (kubernetes)](#312-secrets-management-services-kubernetes)
 								  - [4. Users and groups](#4-users-and-groups)
 								    - [Setting user and group](#setting-user-and-group)
 								      - [Containerfile/Dockerfile](#containerfiledockerfile)
 								      - [Changing user/group arbitrarily on container startup](#changing-usergroup-arbitrarily-on-container-startup)
 								    - [Additional security](#additional-security)
 								  - [5. Filesystem](#5-filesystem)
 								    - [Read-only](#read-only)
 								    - [Additional Protection with nosuid, noexec, and nodev](#additional-protection-with-nosuid-noexec-and-nodev)
 								  - [6. Resources limits](#6-resources-limits)
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								    - [CPU](#cpu)
 								    - [RAM](#ram)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  - [7. Network](#7-network)
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								    - [Desktop tools](#desktop-tools)
 								    - [Kubernetes](#kubernetes)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  - [8. Images](#8-images)
-												add numeration

											
										
										
											2024-11-09 15:10:24 +01:00
+								  - [8.1 Building](#81-building)
 								  - [8.2 Scanning](#82-scanning)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  - [9. Selinux](#9-selinux)
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
+								<!--toc:end-->
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 2. Introduction
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								This document is a collection of simple, very generic tips and best
 								practices related to security of Linux containers. Contenerization is
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
+								considered safer by default, but then one can hear about discovered
 								vulnerabilities that are primarly bad for applications in containers
 								(Example: [CVE-2023-49103](https://nvd.nist.gov/vuln/detail/CVE-2023-49103)).
 								Tips and best practices collected here should help raise awarness about
 								how to keep containers really secure. Contents are kept container-engine
 								agnostic, but examples will be based on actual implementations (Podman, k8s).
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 3. Secrets
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
 								Secret is the most vulnerable data, as it usually can open access to other
 								private data. They might also allow modification of the environment, which
 								means possibilities for further access or many other forms of attack.
 								> [!WARNING]
 								> Don't use environment variables for secrets
 								Container isolation made providing and managing secrets somewhat harder, as
 								they need to cross the additional barier. This casued the rather dangerous
 								trend of providing secrets among many other configuration data in form of
 								environment variables. At first sight it might look like good idea, but when
 								actually compared to other means of storing secrets it turns out that
 								environment variables might be much easier to access by attacker, than
 								for example arbitrary files. [CVE-2023-49103](https://nvd.nist.gov/vuln/detail/CVE-2023-49103)
 								is only an example of vulnerability which was considered to be more
 								dangerous for contenerized apps, because of the vulnerability
 								being based on gaining access to env variables.
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								### 3.1 Alternatives
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								#### 3.1.1 Files
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
 								Files with secrets are common and broadly supported. With proper setup they can
 								be also very secure.
 								- Keep configuration and secret files on entirely different path than other data
 								- If application runs main process under different user than worker processes
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  (worker usually have direct contact with user interaction), the configuration
 								  should not be readable by the worker process user.
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
+								- Depending on the technology used, storage of the secret files inside of a
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								  container could be temporary/volatile. In kubernetes Secret objects are mounted
 								  as tmpfs. Example for mounting secret as tmpfs in pod:
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
 								```yaml
 								apiVersion: v1
 								kind: Pod
 								metadata:
 								  name: app
 								spec:
 								  containers:
 								  - name: app
 								    image: registry.fedoraproject.org/fedora-minimal:latest
 								    command: [ "sleep", "infinity" ]
 								    volumeMounts:
 								      - mountPath: /config
 								        name: config
 								  volumes:
 								  - name: config
 								    secret:
 								      secretName: config
 								```
 								This produces readonly tmpfs mount inside:
 								```bash
 								bash-5.2# df -h /config/
 								Filesystem      Size  Used Avail Use% Mounted on
 								tmpfs           4.8G  4.0K  4.8G   1% /config
 								bash-5.2# ls -la /config/
 								total 0
 								drwxrwxrwt. 3 root root 100 Nov  9 14:00 .
 								drwxr-xr-x. 1 root root  24 Nov  9 14:00 ..
 								drwxr-xr-x. 2 root root  60 Nov  9 14:00 ..2024_11_09_14_00_47.4065932771
 								lrwxrwxrwx. 1 root root  32 Nov  9 14:00 ..data -> ..2024_11_09_14_00_47.4065932771
 								lrwxrwxrwx. 1 root root  18 Nov  9 14:00 secret.conf -> ..data/secret.conf
 								```
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								#### 3.1.2 Secrets Management Services (kubernetes)
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								There are sophisticated tools for secret management and their deployment,
 								available for kubernetes. For example HashiCorp Vault. It offers dynamic
 								secrets, secret rotation, and access policies. Such tools are most helpfull in
 								large environments and infrastructures, where secret management is split
 								among many people.
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 4. Users and groups
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
+								Users and groups are standard mechanisms for security and permissions limiting
 								in unix-like systems. Contenerization engines usually have possibility to
 								arbitrarily assign them to the contenerized program process.
 								> [!NOTE]
 								> Both user and group can always be specified by numeric id even if no actual
 								> user or group is assigned to them. When specifying with string name, the user
 								> or group must exist **inside** of the container (`/etc/passwd`, `/etc/group`)
 								> [!NOTE]
 								> Processes of rootless containers or containers with uid/gid mapping have
 								> different id's inside of container and outside. This can complicate things
 								> even more, but that also usually greatly increases security.
 								> In some scenarios such mapping can also cause trouble with files in
 								> container image, if their id's are out of mapping range.
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								### Setting user and group
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								Containers have default user and group specified by Containerfile, but
 								it can be changed when starting the container.
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								#### Containerfile/Dockerfile
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								In Containerfile the user/group assignment might take place many times in
 								single build. Typical reason for that is to have high privilige (root) during
 								build, and then set default to unpriviliged user at the end of build, so that
 								containers will use it by default.
 								Setting just user to "user1"
 								```Dockerfile
 								USER user1
 								```
 								Setting both user and group
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
+								```Dockerfile
 								USER user1:group1
 								```
 								Setting just group
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
+								```Dockerfile
 								USER :group1
 								```
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								#### Changing user/group arbitrarily on container startup
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								Podman and Docker uses `--user` or shorter `-u` flag to specify both user and
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								group. The syntax is the same as shown for Containerfile. Example of
 								setting both user and group to bin, but user is specified with number ID:
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								```bash
 								❯ podman run --rm -it --user 1:bin registry.fedoraproject.org/fedora-minimal
 								bash-5.2$ whoami
 								bin
 								bash-5.2$ groups
 								bin
 								bash-5.2$ grep ^bin /etc/passwd
 								bin:x:1:1:bin:/bin:/usr/sbin/nologin
 								bash-5.2$ grep ^bin /etc/group
 								bin:x:1:
 								```
 								For Kubernetes, the user and group specification is located in pod definition:
 								```yaml
 								apiVersion: v1
 								kind: Pod
 								spec:
 								  securityContext:
 								    runAsUser: 1
 								    runAsGroup: 1
 								```
 								> [!NOTE]
 								> In kubernetes you can't specify user nor group using string name.
 								> Only numeric values are allowed.
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								### Additional security
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
 								Linux kernel provides usefull feature - [No New Privileges Flag](https://docs.kernel.org/userspace-api/no_new_privs.html).
 								If set for process, it prevents the process from gaining more privileges than
 								parent process. This effectively blocks use of capabilities, and setgid,setuid
 								flags on files, which are known and powerfull tools for exploitation.
 								In Podman and Docker, the flag can be enabled using parameter `--security-opt no-new-privileges`
 								In Kubernetes, there is section related to security context per container:
 								```yaml
 								(....)
 								  containers:
 								  - name: mycontainer
 								    securityContext:
 								      allowPrivilegeEscalation: false
 								(....)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								```
 								## 5. Filesystem
 								By default the filesystem security of containers is quite good, specially
 								when used with other mechanisms like selinux or mapped UIDs/GIDs, but it
 								still have field for improvement.
 								### Read-only
 								Both base filesystem and mounted volumes can be set to readonly.
 								When using a read-only filesystem, certain directories may still need to be
 								writable, such as /tmp or /var/tmp. This is where tmpfs (temporary filesystem)
 								can be used. tmpfs filesystem mounts a temporary filesystem in memory, allowing these
 								directories to be writable without compromising the overall read-only nature
 								of the filesystem. The directory will be empty and will vanish on container
 								shutdown which also increases security, if the temporary data is vulnerable.
 								Running Podman container with readonly base filesystem using `--read-only`:
 								```bash
 								podman run --rm -it --read-only registry.fedoraproject.org/fedora-minimal
 								```
 								> [!Note]
 								> Podman simplifies use of --read-only by automatically creating read-write
 								> tmpfs mounts inside in places where it is usually needed, like `/dev/shm`,
 								> `/tmp`, `/run`, etc...
 								Mounting tmpfs dir with specific size limit to Podman container using `--tmpfs`:
 								```bash
 								podman run --rm -it --read-only --tmpfs /tmp:rw,size=64m registry.fedoraproject.org/fedora-minimal
 								```
 								Mounting podman volume as read-only is done by specifying `ro` mount option
 								after `:` separator, for example `--tmpfs /test:ro`, `-v /host/path:/container/path:ro`
-												users and groups

											
										
										
											2024-12-30 19:01:39 +01:00
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								On Kubernetes to set base filesystem of a container to read-only, there is
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								`readOnlyRootFilesystem: true` attribute in container security context. To
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								mount any volume as read-only, there is attribute `readOnly: true` in mount
 								section.
 								Full kubernetes example of read-only base filesystem and example volume:
 								```yaml
 								apiVersion: v1
 								kind: Pod
 								metadata:
 								  name: readonly-pod
 								spec:
 								  containers:
 								  - name: mycontainer
 								    image: registry.fedoraproject.org/fedora-minimal:latest
 								    command: ["sleep", "infinity"]
 								    securityContext:
 								      readOnlyRootFilesystem: true
 								    volumeMounts:
 								    - mountPath: /test
 								      readOnly: true
 								      name: tmpfs
 								  volumes:
 								  - name: tmpfs
 								    emptyDir:
 								      medium: Memory
 								      sizeLimit: 64Mi
 								```
 								### Additional Protection with nosuid, noexec, and nodev
 								To further enhance security, you can use the nosuid, noexec, and nodev mount
 								options for volumes. They can also be used for tmpfs mounts.
 								- nosuid: Prevents the execution of set-user-identifier or set-group-identifier programs.
 								- noexec: Prevents the execution of any binaries on the mounted filesystem.
 								- nodev: Prevents the use of device files on the mounted filesystem.
 								Example using Podman:
 								```bash
 								❯ podman run --rm -it --read-only --tmpfs /test:nodev,nosuid,noexec registry.fedoraproject.org/fedora-minimal
 								bash-5.2# mount | grep /test
 								tmpfs on /test type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c240,c646",uid=1000,gid=1000,inode64)
 								```
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 6. Resources limits
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								Setting resource limits for containers is required to ensure that no single
 								container can consume excessive resources, which could impact the performance
 								and stability of the entire system or neighbour systems.
 								### CPU
 								Since there is no virtualization, the cpu is visible with all its cores and
 								threads inside of a container. Therefore cpu limiting is done by limiting
 								cpu time using scheduler. Usually the limitation unit is vCPU. In Podman
 								you can set the limit using `--cpus` flag. For example `--cpus=2` will limit
 								cpu time to 2/X of total cpu time current host have. In case of cpu with 16
 								threads this means that container can use up to 12.5% of whole cpu power. This
 								does not mean assigning the cpu time to specific physical threads, therefore
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								high load in that container will be loadbalanced on all physical threads,
 								without allowing to utilize too much of time.
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
 								In case of Kubernetes this works the same, limits are specified per container:
 								```yaml
 								(....)
 								spec:
 								  containers:
 								  - name: app
 								    resources:
 								      limits:
 								        cpu: "2"
 								(....)
 								```
 								### RAM
 								Limiting RAM for container looks similar to cpu limiting. Except that
 								when software inside of a container tries to cross the limits, it will be
 								handled more brutally - RAM hungry process will be killed. This might be
 								not that intuitive for application, as here again the app sees all the memory
 								available in host system, and it does not know about the limits (unless
 								configured).
 								Podman have simple flag `--memory` which configures the limit. `--memory=512MiB`
 								will limit to 512MiB.
 								Kubernetes works similar:
 								```yaml
 								(....)
 								spec:
 								  containers:
 								  - name: app
 								    resources:
 								      limits:
 								        memory: "512Mi"
 								(....)
 								```
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 7. Network
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								For network isolation, Linux containers leverage network namespaces.
 								A network namespace is a feature provided by the Linux kernel that allows for
 								the creation of isolated, independent network stacks. Each network namespace
 								has its own separate set of network interfaces, routing tables, firewall
 								rules, and other network-related resources. This gives complex possibilities
 								for network configuration, but it stimulates differences between
 								container engine implementations.
 								Additionally rootless containers, which are considered safer, need
 								to fallback to different network components, with reduced
 								possibilities, as managing network is strictly root based.
 								### Desktop tools
 								Container engines suitable for desktop like Podman usage usually have limited
 								options for network configuration. They allow to isolate pods from host and
 								each other with different network addresses pools, and even disabling the
 								network at all, which is very safe, but very rare.
 								For such tools there could be few rules that should increase security:
 								- Don't disable isolation. Isolation makes access harder for remote attacker,
 								  even if he can access any port on the container host machine.
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								- When opening ports to access the app from outside, set binding to the least
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								  accessible but sufficient interface/address. For example If you expect only
 								  to access the app locally over localhost, you could bind to localhost in
 								  Podman using flag: `-p 127.0.0.1:8080:8080` to open the port 8080
 								  only for localhost
 								### Kubernetes
 								Kubernetes gives much greater possibilities for both ingress and egress.
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								Primary tools for that are Network Polcicies, which are implemented via plugins
-												resources, network WiP

											
										
										
											2025-01-01 14:28:52 +01:00
+								(therefore they might be not available on some k8s clusters).
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								Network Policies allow for very accurate limitation of network traffic,
 								thanks to their possibilities:
 								- Using labels to select the pods to which the network policy
 								  applies. This allows you to target specific groups of pods based on their labels.
 								- Applying network policies across namespaces by selecting
 								  namespaces based on their labels.
 								- Defining rules based on specific protocols (TCP, UDP) and ports to allow or deny traffic.
 								- Support for arbitrary CIDR-formatted network addresses ranges.
 								Example network policy definition:
 								```yaml
 								apiVersion: networking.k8s.io/v1
 								kind: NetworkPolicy
 								metadata:
 								  name: example
 								spec:
 								  podSelector:
 								    matchLabels:
 								      app.kubernetes.io/name: app1
 								  policyTypes:
 								    - Ingress
 								    - Egress
 								  ingress:
 								    - from:
 								      - namespaceSelector:
 								          matchLabels:
 								            kubernetes.io/metadata.name: app
 								      ports:
 								      - protocol: TCP
 								        port: 123
 								      - protocol: TCP
 								        port: 456
 								    - from:
 								      - ipBlock:
 								          cidr: 10.43.0.0/16
 								      - ipBlock:
 								          cidr: fe80::8cb6:aff8:8dc9:f511/64
 								      ports:
 								      - protocol: TCP
 								        port: 443
 								  egress:
 								    - to:
 								      - namespaceSelector:
 								          matchLabels:
 								            kubernetes.io/metadata.name: kube-system
 								      ports:
 								      - protocol: UDP
 								        port: 53
 								    - to:
 								      - namespaceSelector:
 								          matchLabels:
 								            kubernetes.io/metadata.name: db
 								      ports:
 								      - protocol: TCP
 								        port: 5432
 								```
 								## 8. OCI Images
 								Containers technically don't require images, the base filesystem can be
 								provided in different way, but OCI images become standard in the industry.
 								Images are another important element of (in)security in contenerization.
 								It is crucial to understand basics of that format, as it can for example leak
 								secrets to the public, if used incorrectly.
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												add numeration

											
										
										
											2024-11-09 15:10:24 +01:00
+								## 8.1 Building
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								It is obvious that one should not hardcode secrets into an image. Unfortunately
 								less users is aware how not to do that. When building an image, any instruction
 								that can modify filesystem of the built image, will be saved separately as a
 								layer. By default each layer is kept in the image, even, when in the end all
 								contents of some of those layers was removed.
 								Example of **insecure** Containerfile:
 								```Dockerfile
 								FROM registry.fedoraproject.org/fedora-minimal
 								# Copy secret into the image (bad practice)
 								COPY secret.txt ./secret.txt
 								# Use and delete secret (but it's still in a previous layer)
 								RUN cat secret.txt && rm secret.txt
 								```
 								There is a way to modify image-to-be filesystem in much more secure manner,
 								which also brings other benefits. It is called multi-stage build and, as the
 								name suggests, contains multiple stages, where only layers of the latest will
 								be saved in the resulting image.
 								The Containerfile can look like that:
 								```dockerfile
 								# Stage 1: Use secret during the build
 								FROM registry.fedoraproject.org/fedora-minimal AS builder
 								WORKDIR /app
 								# Copy application files
 								COPY app/ /app/
 								# Copy the secret into the build stage
 								COPY secret.txt /app/secret.txt
 								# Use the secret securely (e.g., configure app)
 								RUN cat /app/secret.txt && echo "Configuring app with secret" > config.txt
 								# Removing the secret in this example is needed, because in the next stage
 								# the /app dir will be copied as a whole
 								RUN rm /app/secret.txt
 								# Stage 2: Final image without secrets
 								FROM registry.fedoraproject.org/fedora-minimal
 								# Nothing is saved from previous stage
 								WORKDIR /app
 								# Copy only the necessary files from the builder stage
 								COPY --from=builder /app/ /app/
 								```
 								This approach also helps keeping the images minimal, without any other
 								leftovers, which also can improve security.
-												add numeration

											
										
										
											2024-11-09 15:10:24 +01:00
+								## 8.2 Scanning
-												initial commit, secrets WiP

											
										
										
											2024-11-09 15:04:29 +01:00
-												finish network, OCI images

											
										
										
											2025-01-01 18:49:53 +01:00
+								Images can be scanned for vulnerabilities. This is usefull for any type and
 								source if images, since vulnerabilities appear even in the most basic
 								components like language interpreters, libC libraries, etc. There are tools
 								for manual scanning like [trivy](https://github.com/aquasecurity/trivy), and
 								some registries like [Harbor](https://goharbor.io/) have builting optional
 								automatic vulnerability scanning for any stored image.
 								These tools can provide descriptive analysis of image contents, taking into
 								account versions of most software stored inside (if supported).
 								Example fragment of output of trivy scanning a python image:
 								![trivy](./trivy.jpg)
-												updates to formatting, filesystem section

											
										
										
											2024-12-31 16:55:32 +01:00
+								## 9. Selinux
-												selinux WiP

											
										
										
											2025-01-02 18:35:18 +01:00
 								SELinux (Security-Enhanced Linux) is a security module for Linux that enforces
 								mandatory access control (MAC) policies to restrict the actions of users and
 								applications based on predefined rules, enhancing system security. SELinux
 								works by labeling all files, processes, and resources on a system with security
 								contexts. Policies define rules about how these labels can interact. When an
 								action is attempted, SELinux checks the labels against the policies and either
 								allows or denies the action based on the rules, enforcing least-privilege access.
 								This document is too short to explain in detail how selinux works, but
 								for containers management most important concepts are MCS
 								(Multi-Category Security) and MLS (Multi-Level Security), described in
 								RedHat docs: [link](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/using_selinux/index#multi-level-security-mls_using-multi-level-security-mls)
 								Selinux additionally secures the contenerized program, not allowing to access
 								resources from outside. Container engines like Podman randomize categories by
 								default, so for example 2 different containers cannot access the same volume.
 								Proof of categories randomization by running subsequent containers and checking
 								their selinux context:
 								```bash
 								❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
 								system_u:system_r:container_t:s0:c340,c364
 								~
 								❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
 								system_u:system_r:container_t:s0:c202,c993
 								~
 								❯ podman run --rm -it fedora-minimal cat /proc/self/attr/current
 								system_u:system_r:container_t:s0:c259,c971
 								```