(This article was written for the MIT 6.858 Computer Systems Security class to supplement lecture content, but is not intended to be a replacement for attending lectures. The 2020 lecture video can be found here.)
What comes to mind when you hear the buzzword “containerization”? Perhaps you have heard of software packages such as Virtuozzo, OpenVZ and Docker (in fact, Docker used lxc
in its early days before it broke off support in favor of their own libcontainer
).
The word “container” is defined pretty loosely – is it a process? Is it a virtual machine? Is it a Docker container? What is an image?
This article aims to demystify Linux containers – specifically lxc
– and give a practical introduction to them.
Introduction
Containerization is best defined as a process isolation mechanism that are enabled through features in the operating system. Hence,
A container is a collection of one or more processes that are isolated from the rest of the system.
The concept of Linux containers is not novel – lxc
has existed for more than a decade in Linux, and earlier Linux distributions have their own implementations: FreeBSD Jails, Solaris Containers, AIX Partitions, etc. Containers were conceived with the goal of software portability in mind. Many a time in software development when packaging software for staging and production, weird problems arose due to differences in the operating environments. This could be a difference in the version of a required shared library, network topology or even underlying storage. By packaging an entire runtime environment (applications and all their dependencies, configuration etc.) into a single container we introduce an abstraction for the differences across multiple environments.
Containers != VMs
A common misconception that is Linux containers are virtual machines. This is not entirely true. lxc
achieves containerization through the use of the following Linux features to abstract the operating system away (and to limit container-to-host privileges):
- Control Groups (
cgroups
) - Capabilities
seccomp
- Mandatory Access Control (via AppArmor, SELinux)
- Namespaces
chroot
jails
In contrast, VMs run on hypervisors which abstract the hardware. The following figures show the main differences between Linux containers and VMs:
Virtual Machine | Linux LXC Container | |
---|---|---|
Operating System | Runs an isolated guest OS inside the VM | Runs on the same OS as the host (shared kernel) |
Networking | via virtual network devices | via an isolated view of a virtual network adapter (through namespaces) |
Isolation | Complete isolation from other guest OSes and the host OS | Namespace- and cgroup-based isolation from the host OS and other guest lxc containers |
Size | Usually on the order of gigabytes | Usually on the order of megabytes |
Startup time | On the order of seconds to minutes depending on storage media | On the order of seconds |
Use-Cases
Before we dive into the inner workings of lxc
, let us consider some requirements in which containerization would be a viable solution:
- Stronger privilege segregation in a microservice architecture on a single host (e.g.
zookd
in lab 2) - Improved blast radius containment in the event of a security compromise
- More effective resource utilization in isolation (compared to hardware-assisted virtualization)
- Ease of software deployment (the purpose for which containers were first developed)
- Increasing the velocity of application delivery and operational efficiency (e.g. through the use of DevSecOps framework)
Four main factors compel the use of containers in modern environments:
- Need for stronger privilege segregation between processes on a host
- Need for blast radius containment in the event of security compromise
- Need for speed (performance) on limited hardware, or need for greater resource utilization efficiency (over VMs)
- Software portability – the ease of packaging and deployment (which increases software development agility and operational consistency)
(In all of the fictitious use-case scenarios discussed in lecture, the attack surface was large and contiguous: exploiting some vulnerability in a single component gave access to multiple other components in the system. In all of these cases, the use of containerization would help apply the principle of least-privilege and defense-in-depth to the system. For example, Bob the journalist could exclusively use sandboxed applications for his work. In the event that one application is compromised, other applications cannot be accessed by the threat actor because they exist in a different PID namespace, to mention the least of the protections. In modern browser apps like Firefox, browser tabs are containerized so that it is more difficult for threat actors to break out of a single tab into the parent process or the host OS.)
chroot
By default, the OS root directory is /
, and processes see that as the system root from which all absolute file paths are rooted at. This “view” can be changed by invoking the chroot()
system call so that we can create a separate isolated environment to run. chroot
changes apparent root directory for current running process and its children.
However, chroot
alone is does not provide strong isolation. It may seem that preventing access to parent directories is sufficient, but chroot
simply modifies the pathname lookups for a process (via pivot_root
) and its children, prepending the new chroot
-ed directory path to any absolute path (paths starting with /
). Among other reasons, access is also allowed to the parent of the chroot
-ed directory if the process has access to a handle outside of the chroot
jail – so this alone is not strong isolation.
Capabilities
The root
superuser used to be all-powerful, capable of performing any action in the OS. The division between traditional UNIX discretionary access control was split into two: root
/superuser/privileged and user/unprivileged. Suppose a system user needed to spawn a server process that needed some root
privileges. In addition, suppose that the server code has a remote code execution vulnerability. Should the vulnerable server process get compromised, the entire system gets compromised (since the process has UID 0). Is there a way to give a process only the privileges it needs (least privilege)?
In a bid to shard the privileges usually afforded wholly to root
, Linux capabilities were introduced into the Linux kernel starting with version 2.2. Each capability represents a distinct unit of privilege and is prefixed by CAP_
. Some capabilities include:
CAP_CHOWN
– the capability to change user and group ownership of filesCAP_NET_ADMIN
– the capability to perform network-related administration on the systemCAP_NET_RAW
– the capability to createRAW
andPACKET
sockets, and arbitrary address bindingCAP_SYS_ADMIN
– the capability to do a lot of things to the point where many regard it as the newroot
. Definitely needs further privilege sharding in the future.
The commands getcap
and setcap
exist to get/set capabilities on a file. Let us take a look at the ping utility, which needs to create a RAW socket to send out ICMP packets:
rayden@uwuntu:~$ ls -al /bin/ping -rwxr-xr-x 1 root root 72776 Jan 30 15:11 /bin/ping
It is owned by root:root
, but it is readable and executable by any user. If we try and ping google.com
we can verify that the UID is that of the current user (since there is no setuid bit set):
USER PID %CPU %MEM VSZ TTY COMMAND rayden 3220 0.0 0.0 18464 pts/0 /bin/ping google.com
The unprivileged user here is able to ping because of a capability set on the /bin/ping
binary:
rayden@uwuntu:~$ getcap /bin/ping /bin/ping = cap_net_raw+ep
Here, two flags are set: Effective (E) and Permitted (P). There are 3 capability flags one may set:
- Effective: whether the capability is active
- Inheritable: whether the capability is inherited by child processes
- Permitted: whether the capability is permitted, regardless of parent’s capability set
What happens if we clear that capability from ping?
rayden@uwuntu:~$ cp /bin/ping . rayden@uwuntu:~$ getcap ./ping rayden@uwuntu:~$ ./ping google.com ping: socket: Operation not permitted rayden@uwuntu:~$ sudo setcap cap_net_raw=ep ./ping rayden@uwuntu:~$ ./ping google.com PING google.com (172.217.11.14) 56(84) bytes of data. ...
By copying the ping binary to a new destination, any extended attributes *setuid, capabilities etc.) are wiped. Without the cap_net_raw
capability, the spawned ping process is unable to open a RAW socket. Once we give that capability back, ping functions normally again.
Capabilities seems like a good idea, but CAP_SYS_ADMIN
still has too many privileges, and this is just another mechanism used by lxc
to enforce stronger isolation.
Control Groups (cgroups
)
Control groups (cgroups
) enables the limiting of system resource utilization based on user-defined groups of processes. Suppose you are running a very intensive data analysis routine which uses a lot of compute and memory to the point where your system is not very responsive. cgroups
is a kernel feature that would allow you to define a group of processes that run the analysis job and limit, account for and isolate the resources allocated – so that you can multitask while the analysis job runs with limited resources. In particular, the cgroup
feature enables:
- Limits: maximum limits can be specified on processor usage, memory usage, device usage, etc.
- Accounting: resource usage is monitored.
- Prioritization: resource usage can be prioritized over other
cgroups
. - Control: the state of processes can be controlled (e.g. stop, restart, suspend)
A cgroup
is a set of one or more processes which are bound to the same set of defined limits for the cgroup
. A cgroup
can also inherit the properties of another cgroup
in a hierarchical manner.
cgroups
is generally available in most modern releases of Linux distros, and most define about 10 subsystems (also known as controllers). From the Red Hat Enterprise Linux documentation:
blkio
— this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, or USB).cpu
— this subsystem uses the scheduler to provide cgroup tasks access to the CPU.cpuacct
— this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.cpuset
— this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.devices
— this subsystem allows or denies access to devices by tasks in a cgroup.freezer
— this subsystem suspends or resumes tasks in a cgroup.memory
— this subsystem sets limits on memory use by tasks in a cgroup and generates automatic reports on memory resources used by those tasks.net_cls
— this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc
) to identify packets originating from a particular cgroup task.net_prio
— this subsystem provides a way to dynamically set the priority of network traffic per network interface.ns
— the namespace subsystem.perf_event
— this subsystem identifies cgroup membership of tasks and can be used for performance analysis.
The cgroup-tools
and libcgroup1
packages are needed to administer them, which can be installed on Ubuntu via:
$ sudo apt install cgroup-tools libcgroup1
To demonstrate how cgroups limits resources, let look at the memory subsystem. Suppose we had a memory-intensive process called memes
that we wish to run on a workstation. We can use cgroups to limit the memory usage by creating a cgroup called memegroup
in the memory subsystem (using cgcreate
), setting its limit (using cgset
) and executing the process under that cgroup (using cgexec
):
rayden@uwuntu:~$ sudo cgcreate -g memory:memegroup rayden@uwuntu:~$ sudo cgset -r memory.limit_in_bytes=1500K memegroup rayden@uwuntu:~$ cgget -r memory.limit_in_bytes memegroup primes: memory.limit_in_bytes: 1536000 rayden@uwuntu:~$ cat /sys/fs/cgroup/memory/primes/memory.limit_in_bytes 1536000 rayden@uwuntu:~$ sudo cgexec -g memory:memegroup ./memes ...
The cgcreate
command helps create the directory under the sysfs
(which is almost always mounted at /sys
and can be manipulated directly from the command line), and cgset
sets the values appropriately. Notice that the system will always correct it to the nearest 4096 B
alignment (from 1500 KB
to 1536 KB
), which is the kernel page size. Finally, we execute memes
in the memegroup
cgroup under the memory subsystem.
After a while, you’ll see the message on the terminal saying that the process has been killed (literally just ‘Killed
‘).
rayden@uwuntu:~$ cat /sys/fs/cgroup/memory/memegroup/memory.oom_control oom_kill_disable 0 under_oom 0 oom_kill 1
We see that oom_kill
has been set to 1, which means that the Kernel Out-Of-Memory Killer (OOM Killer) has terminated the processes in the memegroup
cgroup.
A simple memory-intensive C program that would be killed in the example above is:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main() {
char *ptr;
while(1) {
ptr = (char *)malloc(4096);
memset(ptr, 0, 4096);
sleep(1);
}
return 0;
}
That’s an example of how limits are enforced and process control is done via a cgroup. Most subsystems have accounting features such as memory.usage_in_bytes
, cpuacct.usage_sys
, etc. An example of prioritization would be cpu.shares
(the share of CPU resources available to each process in every cgroup).
Namespaces
A namespace is an abstract object that encapsulates resources so that said resources have a view restricted to other resources in the same namespace. For example, Linux processes form a single process tree that is rooted at init
(PID 1). Typically, privileged processes in this tree can trace or kill other processes. With the introduction of the PID namespace, we can have multiple process trees that are disjoint (that do not know of processes in another namespace). If we create a new PID namespace and run a process in it, that first process becomes PID 1 in that namespace. The process that creates namespace still remains in parent namespace, but makes its child the root of new process tree.
The Linux kernel defines 7 namespaces:
- PID – isolates processes
- Network – isolates networking
- User – isolates User/Group IDs
- UTS – isolates hostname and fully-qualified domain name (FQDN)
- Mount – isolates mountpoints
- cgroup – isolates the cgroup sysfs root directory
- IPC – isolates IPC/message queues
You can see the namespaces defined on your system via the procfs
:
rayden@uwuntu:~$ sudo ls /proc/1/ns cgroup ipc mnt net pid pid_for_children user uts
This level of isolation is useful in containerization. Without namespaces, a process running in a container may be able to change the hostname of another container, unmount a file system , remove a network interface, change limits, etc. By using namespaces to encapsulate these resources, the processes in a container X are unaware of the resources in another container Y.
With the introduction of namespaces, the Linux kernel provides 3 new system calls:
clone()
– creates a new process with specified namespaces. If theCLONE_NEW
flag is passed, then new namespaces are created for each specified namespace.setns()
– allows a process to join an existing namespace. The namespace is specified by a file descriptor reference in theprocfs
like so:
rayden@uwuntu:~$ sudo ls -al /proc/1/ns total 0 dr-x--x--x 2 root root 0 Jul 22 00:46 . dr-xr-xr-x 9 root root 0 Jul 21 22:08 .. lrwxrwxrwx 1 root root 0 Jul 22 00:46 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 ipc -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 mnt -> 'mnt:[4026531840]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 net -> 'net:[4026531992]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 pid -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 Jul 22 00:47 pid_for_children -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 Jul 22 00:46 uts -> 'uts:[4026531838]'
unshare()
– moves calling process to a new namespace.
For example, we can create a new bash shell in a new UTS namespace through the unshare
command:
rayden@uwuntu:~$ hostname uwuntu rayden@uwuntu:~$ sudo unshare -u /bin/bash root@uwuntu:/home/rayden# hostname lmao root@uwuntu:/home/rayden# hostname lmao root@uwuntu:/home/rayden# exit exit rayden@uwuntu:~$ hostname uwuntu
Notice that the hostname remains unchanged in the parent shell. The same thing can be done for process IDs:
rayden@uwuntu:~$ sudo unshare --fork --pid --mount-proc /bin/bash root@uwuntu:/home/rayden# pidof /bin/bash 1
The PID of the forked process is 1, but if you look at the output of ps aux
on the parent shell, we see PID 6499:
... root 6499 0.0 0.0 16712 580 pts/3 S 01:23 0:00 unshare --fork --pid --mount-proc /bin/bash ...
Bash sees itself as PID 1 only because its scope is confined to its own PID namespace. Once you have forked a process into its own namespace, its children processes are numbered starting from 1, but only within that namespace.
Namespaces are the foundation of containerization. Understanding the abstract concept of namespaces and how they encapsulate resources in an environment can help you understand how and why containerized applications behave the way they do. For instance, a container running a web server is unaware that it is running in a container – it knows that it has access to system calls and resources it needs, but it has its own view of things like the hostname, the process tree, the user, etc. (There are ways to detect if a process is in a container, but that is out of the scope of this discussion.)
Furthermore, a malicious process spawned from the web server cannot affect any other process on your system, because as far as any process in that PID namespace knows, the process tree is rooted at 1, and 1 the container it’s running in (or some cases itself).
There is a namespace subsystem defined by cgroups (in that you can control resources by their namespace), but be careful not to confuse the two: cgroups limit resource utilization, while namespaces limit the resource view (what a resource may see on the system)
seccomp
There are cases where isolation via chroot
, capabilities, cgroups and namespaces is not enough. Suppose some web server running in a container was compromised, and a remote shell was spawned by the attacker. The same set of system calls is invocable by the host process and the container, and there could exist some sequence of calls that makes a container escape possible. (In fact, there are a number of container-escape exploits: False Boundaries and Arbitrary Code Execution, Container escape through open_by_handle_at, Abusing Privileged and Unprivileged Linux Containers to name a few.)
seccomp
protects against the threat of damage by a malicious process via syscalls, by limiting the number of syscalls a process is allowed to execute. Modern browsers such as Chrome and Firefox use seccomp
to clamp down tighter on their applications. Many container-escape exploits can be easily blocked by limiting the syscall interface to only syscalls required for the containerized application to carry out its function.
In lxc, seccomp filters can be specified through the container configuration file (~/.local/share/lxc/<container_name>/config
):
lxc.seccomp = /usr/share/lxc/config/common.seccomp
where /usr/share/lxc/config/common.seccomp
is a list of disallowed system calls by default.
Mandatory Access Control (MAC)
Suppose that in web server toy example in the previous section, the attacker managed to escape the container despite isolation via chroot
, capabilities, cgroups, namespaces and seccomp
. What happens now? If the container is privileged (run using UID 0), it’s pretty much GG. If it is unprivileged, the attacker could still try to escalate privileges, or if the current user is privileged enough, enough damage could be done.
In this situation, only discretionary access control (DAC) (via UNIX permissions) stands between the attacker and a fully compromised system. In good old Defense-in-Depth fashion, we layer another control to mitigate this risk: Mandatory Access Control (MAC).
MAC is a centralized authorization mechanism that operates on the philosophy that information belongs to an organization (and not the individual members). A security policy is defined and kept in the kernel, which authorizes accesses based on the defined policy. Modern MAC implementations such as SELinux are a combination of Role-based Access Control (RBAC) and two concepts:
- Type Enforcement (TE)
- Multilevel Security (MLS)
Type Enforcement (TE)
TE introduces type labeling to every file system object, and is a prerequisite for MAC. Objects are labeled with a type, and a policy is defined in the kernel to specify which types are allowed to transition to which other types. The kernel then checks the specified policy every time a labeled file system object is accessed. If the specific type transition is not present, the access is denied by default. For example in Security-Enhanced Linux (SELinux), the standard label httpd_sys_content_t
applied to web server content served by Apache is not allowed to access files labeled with bin_t
, which is applied to binaries in /usr/bin
.
Suppose that, in the same web server toy example, the attacker managed to get root
access. Under DAC, any user has discretionary control over any thing owned by the user, so the attacker has full control. If TE was implemented via SELinux , the attacker is severely impeded: the web server exploit gives a process of UID 0 but the exploit inherits the type label of the exploited web page (httpd_sys_content_t
), which only allows access to other file system objects of the same label (in that content served by a web server should only need to access other web content, and nothing else).
Similarly, if we enabled a MAC mechanism like SELinux for our container host, all containers will be labeled with the default lxc_t
type label (defined by the default SELinux TE policy for lxc
). Any malicious process that bypasses the other isolation mechanisms will be confined by the TE policy. More information on what type transitions and accesses are allowed by default can be seen directly from the .te file here.
Multilevel Security (MLS)
(MLS is out of the scope of this article, and is only treated briefly).
Few systems are configured with MLS, except in government or military systems. In a military environment, files are labeled with a sensitivity level (e.g. Unclassified, Confidential, Secret, Top Secret). However, it is insufficient to use only sensitivity levels to classify files, because it does not respect the principle of least privilege (on a need-to-know basis). Hence, the US military compartmentalizes the most secretive information (known as Top Secret: Secret Compartmentalized Information or TS:SCI). Every information asset belongs to a set of compartments, which could be categories such as cyber, nuclear, biological, blackops, etc. An information asset in the compartments [cyber, biological]
may only be accessed by principals (person) that have clearance to see information BOTH of those compartments. MLS is formalized through the Bell-LaPadula model of 1973:
Given ordered set of all sensitivity levels [latex]S[/latex] and the set of all compartments [latex]C[/latex], we write that [latex]\forall s_i\in S[/latex], [latex]\forall c_i \in C[/latex] two labels [latex]l_1 = (s_1, c_1)[/latex] and [latex]l_2 = (s_2, c_2)[/latex] are such that [latex]l_1\leq l_2[/latex] (in that [latex]l_1[/latex] is no more restrictive than [latex]l_2[/latex]) when [latex]s_1 \leq s_2[/latex] and [latex]c_1 \subseteq c_2[/latex].
Let [latex]P[/latex] denote a principal and [latex]L(E)[/latex] denote the type label of some asset [latex]A[/latex]. The BLP model also specifies two security conditions:
- Principals are not allowed to “read up”, i.e. [latex]P[/latex] may only read some asset [latex]A[/latex] if [latex]L(A)\leq L(P)[/latex], and
- Principals are not allowed to “write down”, i.e. [latex]P[/latex] may only write to some asset [latex]A[/latex] if [latex]L(P)\leq L(A)[/latex].
The first security condition of the BLP guarantees that a principal can never directly read an information asset for which it is not cleared. It also guarantees that a principal must never be able to learn information about some higher-labeled asset [latex]A[/latex] by reading another lower-labeled object [latex]A^\prime[/latex]: suppose some principal [latex]P[/latex] reads [latex]A[/latex] before writing to [latex]A^\prime[/latex], which gives [latex]L(A)\leq L(P) \leq L(A^\prime)[/latex] – but [latex]L(A)\leq L(A^\prime)[/latex], so there is no information leakage.
The BLP model is not perfect, and that is why real-world systems combine different access control mechanisms. Some problems with the BLP model are:
- Only confidentiality is considered, and not integrity (in the event that principals write up to an asset of a higher label)
- The security level of a principal is assumed to be static, when in reality it could change mid-operation.
- By the second security condition, any principal [latex]P[/latex] cannot write down, and privileges have to be stripped to a minimal set (which may not be a problem since Least Privilege is observed here)
By default, SELinux only carries out TE using the default targeted
policy. MLS can be enabled using the mls
policy by changing the configuration at /etc/selinux/config
. An operation is allowed if and only if both the MAC and DAC policy permits, and in some cases RBAC.
SELinux is installed by default on Red Hat-based distributions such as Fedora and CentOS. On Debian-based systems, the MAC implementation used is AppArmor.
Demo: working with lxc
With that in mind, let us go through a short demo on how to work with lxc
. Throughout this section we will be using a Ubuntu 19.10 Eoan amd64 VMware workstation virtual machine on Windows 10. You may use your own choice of hypervisor (kvm
, VirtualBox, etc.) and host operating system – it should not affect your ability to follow the steps listed below.
lxc
can be simply installed through your favorite package manager. On Ubuntu:
rayden@uwuntu:~$ sudo apt install lxc [sudo] password for rayden: Reading package lists… Done Building dependency tree Reading state information… Done The following additional packages will be installed: bridge-utils liblxc-common liblxc1 libpam-cgfs lxc-utils lxcfs uidmap Suggested packages: ifupdown btrfs-tools lvm2 lxc-templates lxctl The following NEW packages will be installed: bridge-utils liblxc-common liblxc1 libpam-cgfs lxc lxc-utils lxcfs uidmap [output truncated]
Subordinate UID/GID ranges
We ensure that the current user is allowed to have subordinate uid
s and gid
s by making sure that the following files are defined:
rayden@uwuntu:~$ cat /etc/subuid rayden:100000:65536 rayden@uwuntu:~$ cat /etc/subgid rayden:100000:65536
which allows the user rayden
to have 65536
subordinate uid
s/gid
s starting at 100000
. We also need to create the user config directory for lxc
if it does not exist and create the default configuration file:
$ mkdir -p ~/.config/lxc $ touch ~/.config/lxc/default.conf
The ~/.config/lxc/default.conf
file should be modified so that it looks like this (with the correct id_map
values):
lxc.include = /etc/lxc/default.conf lxc.id_map = u 0 100000 65536 lxc.id_map = g 0 100000 65536
Virtual network interfaces
When installing lxc
, a default bridge should have been created for you: lxcbr0. You can verify that the bridge exists via the command
rayden@uwuntu:~$ brctl show bridge name bridge id STP enabled interfaces lxcbr0 8000.00163e000000 no
Ensure that the /etc/lxc/lxc-usernet
file is defined with:
# user type bridge max_interfaces_by_user rayden veth lxcbr0 10
This tells lxc
how many virtual network interfaces it may attach to the specified bridge as the user rayden
(or group, one per line).
The quickest way to effect the changes would be to restart the node or log out and back in. This restarts dbus
, sets up the cgroups
properly and turns user namespaces on (kernel.unprivileged_userns_clone=1
).
Verify that the vEthernet networking module is loaded via
rayden@uwuntu:~$ lsmod | grep veth veth 28672 0
If the veth
module is not loaded, load it and make it persist after a reboot:
rayden@uwuntu:~$ echo veth | sudo tee -a /etc/modules veth
Creating a container
To create a container, simply run lxc-create
:
rayden@uwuntu:~$ lxc-create -t download -n example Setting up the GPG keyring Downloading the image index DIST RELEASE ARCH VARIANT BUILD alpine 3.10 amd64 default 20200714_13:00 alpine 3.10 arm64 default 20200714_13:00 alpine 3.10 armhf default 20200714_13:00 [output truncated]
This is an interactive command that creates a container with the name example
, using the download
template. There are 4 default templates specified by the lxc install, which are basically scripts in /usr/share/lxc/templates/
:
download
– downloads pre-built images and unpacks themlocal
– consumes local images that were built with thedistrobuilder build-lxc
commandbusybox
– common UNIX utilities contained in a single executableoci
– creates an application container from images in the Open Containers Image (OCI) format
The download
template prompts for your choice of distribution/release from a given list as the base image for your container, which is what we will be using to create our example
container. We can specify the desired image directly on the command line, i.e. for a Ubuntu 19.10 (Eoan) amd64
image (note the double dash --
after the name):
rayden@uwuntu:~$ lxc-create -t download -n example -- --dist ubuntu --release eoan --arch amd64 Setting up the GPG keyring Downloading the image index Downloading the rootfs Downloading the metadata The image cache is now ready Unpacking the rootfs You just created an Ubuntu eoan amd64 (20200714_07:42) container. To enable SSH, run: apt install openssh-server No default root or user password are set by LXC.
A container directory will be created at ~/.local/share/lxc/example/
, with a container-specific configuration file named config
where you can specify further filters and controls such as MAC, seccomp deny lists, networks, etc. You will see that the root filesystem of the newly created container is unpacked in rootfs/
, which looks like a standard Linux root filesystem:
rayden@uwuntu:~/.local/share/lxc/example$ ls rootfs/ bin dev home lib32 libx32 mnt proc run srv tmp var boot etc lib lib64 media opt root sbin sys usr
You may make changes offline (without starting and attaching to the container) by using chroot
on the rootfs
directory.
Running a container
To start the example
container, simply run
rayden@uwuntu:~$ lxc-start example
which daemonizes the container. If you encounter errors starting the container, using the -F
option to start the container in the foreground will give more verbose output.
We can verify that our container is running via
rayden@uwuntu:~$ lxc-info example Name: example State: RUNNING PID: 7547 IP: 10.0.3.79 Memory use: 50.38 MiB KMem use: 30.09 MiB Link: veth1000_JSJQ TX bytes: 2.08 KiB RX bytes: 8.93 KiB Total bytes: 11.01 KiB
We can also get a summarized view of all containers:
rayden@uwuntu:~$ lxc-ls --fancy NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED example RUNNING 0 - 10.0.3.79 - true
Attaching to our running container instance is as simple as:
rayden@uwuntu:~$ lxc-attach example root@example:/# id uid=0(root) gid=0(root) groups=0(root) root@example:/# passwd New password: Retype new password: passwd: password updated successfully
Once we have added our users (as needed) and changed their passwords, we can connect to the container using an interactive login via the command lxc-console
. The difference is that lxc-attach
behaves more like key-based ssh
setup (you get a root session directly inside without any prompts) while lxc-console
gives you a virtual console which simulates an interactive console on a real server such as serial, DRAC, ILO, etc.
Notice that we are root inside the container, even though we created an unprivileged container. This behavior is the result of UID namespaces. We can see that any process in the container is mapped to an unprivileged UID on the host by running a process in the container:
root@example:/# while [ 1 ]; do sleep 5; done & [1] 132
On the host we can see that the process is running with UID 100000:
rayden@uwuntu:~$ ps aux | grep sleep 100000 7983 0.0 0.0 8068 844 pts/3 S 20:24 0:00 sleep 5
If you look at other processes from the ps aux
output, you will notice that the container init
process is UID-mapped as well:
rayden@uwuntu:~$ ps aux | grep init ... 100000 7547 0.0 0.1 166192 10220 ? Ss 19:48 0:00 /sbin/init
We may run most system administration tasks inside, such as installing packages. Let us install the nginx
web server and the net-tools
binary package:
root@example:/# apt update [output truncated] root@example:/# apt install nginx net-tools [output truncated]
Verify that nginx
is running on port 80:
root@example:/# netstat -atunp | grep LISTEN tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 88/systemd-resolved tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 881/nginx: master p tcp6 0 0 :::80 :::* LISTEN 881/nginx: master p
If for some reason it isn’t running, start and persist it with
root@example:/# systemctl start nginx root@example:/# systemctl enable nginx
Networking
(This section assumes knowledge of iptables
.)
lxc
creates an independent bridge by default, which uses masquerading for all traffic to the main interface. A bridge is created out of thin air (lxcbr0
) and the containers are linked to this bridge. This allows the containers to reach the Internet if the main interface has access to the Internet as well (through the use of forwarding and masquerading). A quick look at the interfaces on our host shows the main interface with an Internet connection ens33
, the default bridge lxcbr0
and the virtual interface veth1000_XXXX
for the container example
.
rayden@uwuntu:~$ ifconfig ens33: flags=4163 mtu 1500 inet 192.168.3.131 netmask 255.255.255.0 broadcast 192.168.3.255 inet6 fe80::1dca:deec:91b8:2e31 prefixlen 64 scopeid 0x20 ether 00:0c:29:d0:e5:24 txqueuelen 1000 (Ethernet) ... ... lxcbr0: flags=4163 mtu 1500 inet 10.0.3.1 netmask 255.255.255.0 broadcast 0.0.0.0 inet6 fe80::216:3eff:fe00:0 prefixlen 64 scopeid 0x20 ether 00:16:3e:00:00:00 txqueuelen 1000 (Ethernet) ... veth1000_JSJQ: flags=4163 mtu 1500 inet6 fe80::fcc8:d2ff:fee1:3646 prefixlen 64 scopeid 0x20 ether fe:c8:d2:e1:36:46 txqueuelen 1000 (Ethernet) ...
The local network setup looks like this:
- Hypervisor (Windows 10)
- Directly connected to
192.168.3.0/24
(address192.168.3.130
)
- Directly connected to
- Container Host (Ubuntu VM)
- Directly connected to
192.168.3.0/24
viaens33
(address192.168.3.131
) - Directly connected to
10.0.3.0/24
vialxcbr0
(address10.0.3.1
)
- Directly connected to
example
Container runningnginx
(Ubuntu)- Directly connected to
10.0.3.0/24
viaeth0
(address10.0.3.1
) which is connected to thelxcbr0
host bridge via the virtual adapterveth1000_JSJQ
.
- Directly connected to
In this setup, the 10.0.3.0/24
network uses the system default gateway in the 192.168.3.0/24
network, which we can see from the system routing table on the container host:
rayden@uwuntu:~$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.3.2 0.0.0.0 UG 100 0 0 ens33 10.0.3.0 0.0.0.0 255.255.255.0 U 0 0 0 lxcbr0 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 ens33 192.168.3.0 0.0.0.0 255.255.255.0 U 100 0 0 ens33
We can see the masquerading rule in the NAT table through iptables
:
rayden@uwuntu:~$ sudo iptables -t nat -L Chain PREROUTING (policy ACCEPT) target prot opt source destination Chain INPUT (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain POSTROUTING (policy ACCEPT) target prot opt source destination MASQUERADE all -- 10.0.3.0/24 !10.0.3.0/24
Hence, the nginx
default site is reachable by the host network 192.168.3.0/24
since it is directly connected. If you browse to http://10.0.3.79/
you should see the welcome page served by the container:
However, any other external network should not be able to reach the nginx
service in the example container. In this setup the container host is running Ubuntu on a VMware workstation virtual machine, which runs on Windows 10. Since the addresses are translated from 10.0.3.0/24
to 192.168.3.0/24
, the only address we can reach from the Windows 10 host is the ens33
interface in the Ubuntu VM (192.168.3.131
).
In order to expose the nginx
service in the container to the Windows 10 host, we need to forward port 80 on 192.168.3.131
to 10.0.3.79
. We can do this via a NAT table PREROUTING
chain rule:
rayden@uwuntu:~$ sudo iptables -t nat -A PREROUTING -p tcp -i ens33 --dport 80 -j DNAT --to-destination 10.0.3.79:80
If you cannot access the service, check that port forwarding is enabled on the Ubuntu kernel:
rayden@uwuntu:~$ cat /proc/sys/net/ipv4/ip_forward 1
Otherwise, append the following line to /etc/sysctl.conf
:
net.ipv4.ip_forward=1
and load the value using the command sudo sysctl -p
.
Now let us tighten the firewall rules a little bit on the Ubuntu VM. The default chain policy on the filter table (ACCEPT
) is too permissive, so let’s set the default policy on the INPUT
and FORWARD
chains to DROP
:
rayden@uwuntu:~$ sudo iptables -P INPUT DROP rayden@uwuntu:~$ sudo iptables -P FORWARD DROP
Make sure to delete any rules that accept all traffic on both INPUT and FORWARD. You should not be able to access the nginx
service from the Windows 10 host right now, since the Ubuntu VM is not forwarding any traffic to the container. We need to enable some forwarding rules to allow HTTP traffic to 10.0.3.79
:
rayden@uwuntu:~$ sudo iptables -A FORWARD -p tcp -d 10.0.3.79 --dport 80 -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT
and vice versa:
rayden@uwuntu:~$ sudo iptables -A FORWARD -s 10.0.3.79 -p tcp --sport 80 -j ACCEPT
You should be able to access the nginx
service from your VM host via the container host, which forwards it to the container.
Stopping a container
To stop the example
container we have created simply issue the command:
rayden@uwuntu:~$ lxc-stop example rayden@uwuntu:~$ lxc-info example Name: example State: STOPPED
If you would like to purge (delete) the container from the file system:
rayden@uwuntu:~$ lxc-destroy example
What’s the difference between lxc
and Docker?
Both solutions are suited for different use-cases. In short:
lxc
: been around much longer (Docker used to use lxc
). Feels more like a full OS in a VM and has to be handled in a similar manner: software has to be installed and updated manually, either by hand or through configuration management tools such as Ansible.
Docker: intended for running a single application. Does not have a full stack of system processes like lxc
. A container with the application and its dependencies is built and deployed using a Dockerfile
.
In terms of container orchestration, both have rather new tools: lxc
has lxd
, and Docker has Docker Swarm and Kubernetes. There is a new project called lxe
which aims to integrate lxc
/lxd
with Kubernetes.
A common misconception is that Docker uses lxc
. Docker DOES NOT use lxc
; Docker used to make use of lxc
to run containers, but that ceased a few years ago. Both Docker and lxc
use the same kernel features for containerization, but they are independent solutions. You can read more
Summary
To summarize:
- A container is a collection of one or more processes that are isolated from the rest of the system.
lxc
achieves containerization through the use Linux kernel features to abstract the operating system away, and isolate the container, such as:- Control Groups (
cgroups
) - Capabilities
seccomp
- Mandatory Access Control (via AppArmor, SELinux)
- Namespaces
chroot
jails
- Control Groups (
- Container-specific configuration for
lxc
is located at~/.local/share/lxc/<container name>/config
- User-specific configuration for
lxc
is located at~/.config/lxc/default.conf
- Global configuration for
lxc
is located at/etc/lxc/default.conf
- Create a container using
lxc-attach
- Start a container using
lxc-start
- Stop a container using
lxc-stop
- List containers using
lxc-info [--fancy]
- Destroy containers using
lxc-destroy
- Container rootfs is at
~/.local/share/lxc/<container name>/rootfs