Friday, December 16, 2016

Docker Networking Internals: How Docker uses Linux iptables and interfaces

I started playing with docker a while ago, and like most people I was instantly impressed with its power and ease of use. Simplicity is one of docker’s core tenants, and much of docker’s power is abstracted behind simple cli commands. As I was learning to use docker, I wanted to know what it was doing in the background to make things happen, especially around networking (one of my primary areas of interest).

I found numerous documentation for how to create and manipulate container networks, but not as many when it came to how docker makes container networking work. Docker extensively uses linux iptables and bridge interfaces, and this post is my summary of how that is used to create container networks. Most of this information came from github discussion threads, presentations, as well as my own testing, and I link to a number of helpful resources at the end of this post.

I used docker 1.12.3 for the examples in this post. This is not meant as a comprehensive description of docker networking nor as an introduction to docker networking. I hope it might add some insights for users, and I would appreciate any feedback or comments on errors or anything missing. 

 

Contents

 

Docker Networks Overview

Docker’s networking is built on top of the Container Network Model (CNM) which allows any party to write their own network driver. This allows for different network types to be available to containers running on the docker engine, and containers can connect to more than one network type at the same time. In addition to the various third party network drivers available, docker comes with four built-in network drivers:

  • Bridge: This is the default network that containers are launched in. The connectivity is facilitated through a bridge interface on the docker host. Containers using the same bridge network have their own subnet, and can communicate with each other (by default). 

  • Host: This driver allows a container to have access to the docker host’s own network space (The container will see and use the same interfaces as the docker host).

  • Macvlan: This driver allows containers to have direct access to an interface or subinterface (vlan) of the host. It also allows trunking.

  • Overlay: This driver allows for networks to be built across multiple hosts running docker (usually a docker swarm cluster). Containers also have their own subnet and network addresses, and can communicate with each other directly even if they are running on different physical hosts.

Bridge and overlay networks are probably the most commonly used network drivers, and I will be mostly concerned with these two drivers in this article and the next.

 

Docker Bridge Networks

The default network for containers running on a docker host is a bridge network. Docker creates a default bridge network named ‘bridge’ when first installed. We can see this network by listing all networks docker network ls:

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
3e8110efa04a        bridge              bridge              local
bb3cd79b9236        docker_gwbridge     bridge              local
22849c4d1c3a        host                host                local
3kuba8yq3c27        ingress             overlay             swarm
ecbd1c6c193a        none                null                local

To inspect its properties run docker network inspect bridge:

$ docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "3e8110efa04a1eb0923d863af719abf5eac871dbac4ae74f133894b8df4b9f5f",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Containers": {},
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

You can also create your own bridge networks by using the docker network create command and specifying the option --driver bridge, for example docker network create --driver bridge --subnet 192.168.100.0/24 --ip-range 192.168.100.0/24 my-bridge-network creates another bridge network, with the name ‘my-bridge-network’ and subnet 192.168.100.0/24.

 

Linux bridge interfaces

Each bridge network that docker creates is represented by a bridge interface on the docker host. The default bridge network ‘bridge’ usually has the interface docker0 associated with it, and each subsequent bridge network that is created with the docker network create command will have a new interface associated with it.

$ ifconfig docker0
docker0   Link encap:Ethernet  HWaddr 02:42:44:88:bd:75
          inet addr:172.18.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

To find out the linux interface associated with the docker network you created, you can use ifconfig to list all interfaces and then find the interface that has the subnet you specified, for example if we wanted to look up the bridge interface for my-bridge-network which we created above, we can run the following:

$ ifconfig | grep 192.168.100. -B 1
br-e6bc7d6b75f3 Link encap:Ethernet  HWaddr 02:42:bc:f1:91:09
          inet addr:192.168.100.1  Bcast:0.0.0.0  Mask:255.255.255.0

The linux bridge interfaces are similar to switches in their function in that they connect different interfaces to the same subnet, and forward traffic based on MAC addresses. As we shall see below, each container connected to a bridge network will have its own virtual interface created on the docker host, and the docker engine will connect all containers in the same network to the same bridge interface, which will allow them to communicate with each other. You can get more details about the status of the bridge by using the brctl utility:

$ brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no

Once we have containers running and connected to this network, we will see each container’s interface listed under the interfaces column. And running a traffic capture on the bridge interface will allow us to see intercommunication between containers on the same subnet.

 

Linux virtual interfaces (veth)

The Container Networking Model (CNM) allows each container to have its own network space. Running ifconfig from inside the container will show its interfaces as the container sees them:

$ docker run -ti ubuntu:14.04 /bin/bash
root@6622112b507c:/#
root@6622112b507c:/# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:ac:12:00:02
          inet addr:172.18.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe12:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:766 (766.0 B)  TX bytes:508 (508.0 B)


lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

However, the eth0 seen above is only available from within that container, and outside on the docker host, docker creates a twin virtual interface that corresponds to it and acts as a link to the outside world. These virtual interfaces are then connected to the bridge interfaces discussed above to facilitate connectivity between different containers on the same subnet.

We can review this process by starting two containers connected to the default bridge network, and then view the interface configuration on the docker host.

Before running starting any containers, the docker0 bridge interface has no interfaces attached:

$ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no

I then started two containers from the ubuntu:14.04 image:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
a754719db594        ubuntu:14.04        "/bin/bash"         5 seconds ago       Up 4 seconds                            zen_kalam
976041ec420f        ubuntu:14.04        "/bin/bash"         7 seconds ago       Up 5 seconds                            stupefied_easley

You can immediately see that there are now two interfaces attached to the docker0 bridge interface (one for each container):

$ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no              veth2177159
                                                        vethd8e05dd

Starting a ping to google from one of the containers, then doing a traffic capture on the container’s virtual interface from the docker host will show us the containers traffic:

$ docker exec a754719db594 ping google.com
PING google.com (216.58.217.110) 56(84) bytes of data.
64 bytes from iad23s42-in-f110.1e100.net (216.58.217.110): icmp_seq=1 ttl=48 time=0.849 ms
64 bytes from iad23s42-in-f110.1e100.net (216.58.217.110): icmp_seq=2 ttl=48 time=0.965 ms

ubuntu@swarm02:~$ sudo tcpdump -i veth2177159 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth2177159, link-type EN10MB (Ethernet), capture size 262144 bytes
20:47:12.170815 IP 172.18.0.3 > iad23s42-in-f14.1e100.net: ICMP echo request, id 14, seq 55, length 64
20:47:12.171654 IP iad23s42-in-f14.1e100.net > 172.18.0.3: ICMP echo reply, id 14, seq 55, length 64
20:47:13.170821 IP 172.18.0.3 > iad23s42-in-f14.1e100.net: ICMP echo request, id 14, seq 56, length 64
20:47:13.171694 IP iad23s42-in-f14.1e100.net > 172.18.0.3: ICMP echo reply, id 14, seq 56, length 64

Similarly we can do a ping from one container to another.

First, we need to get the IP address of the container which we can do by either running ifconfig in the container or inspecting the container using the docker inspect command:

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' a754719db594 
172.18.0.3 

Then start a ping from one container to another:

$ docker exec 976041ec420f ping 172.18.0.3
PING 172.18.0.3 (172.18.0.3) 56(84) bytes of data.
64 bytes from 172.18.0.3: icmp_seq=1 ttl=64 time=0.070 ms
64 bytes from 172.18.0.3: icmp_seq=2 ttl=64 time=0.053 ms

To see this traffic from the docker host, we can do a capture on either of the virtual interfaces corresponding to the containers, or we can do a capture on the bridge interface (docker0 in this instance) which shows all inter-container communication on that subnet:

$ sudo tcpdump -ni docker0 host 172.18.0.2 and host 172.18.0.3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:55:37.990831 IP 172.18.0.2 > 172.18.0.3: ICMP echo request, id 14, seq 200, length 64
20:55:37.990865 IP 172.18.0.3 > 172.18.0.2: ICMP echo reply, id 14, seq 200, length 64
20:55:38.990828 IP 172.18.0.2 > 172.18.0.3: ICMP echo request, id 14, seq 201, length 64
20:55:38.990866 IP 172.18.0.3 > 172.18.0.2: ICMP echo reply, id 14, seq 201, length 64

 

Locate a container’s veth interface

There is no straightforward way for finding which veth interface on the docker host is linked to the interface within a container, but there are several methods discussed in various docker forum and github threads. The easiest in my opinion is the following (Based on the solution in this thread with a slight modification) which depends on having ethtool accessible in the container:

For example, I have three containers running on my system:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
ccbf97c72bf5        ubuntu:14.04        "/bin/bash"         3 seconds ago       Up 3 seconds                            admiring_torvalds
77d9f02d61f2        ubuntu:14.04        "/bin/bash"         4 seconds ago       Up 4 seconds                            goofy_borg
19743c0ddf24        ubuntu:14.04        "/bin/sh"           8 minutes ago       Up 8 minutes                            high_engelbart

First I execute the following in the container, and get the peer_ifindex number:

$ docker exec 77d9f02d61f2 sudo ethtool -S eth0
NIC statistics:
     peer_ifindex: 16

Then on the docker host, I use the peer_ifindex to find the interface name:

$ sudo ip link | grep 16
16: veth7bd3604@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default

So the interface name in this case is veth7bd3604.

iptables

Docker uses linux iptables to control communication to and from the interfaces and networks it creates. Linux iptables consist of different tables, but we are primarily concerned with two: filter and nat. Filter is the security rules table used to allow or deny traffic to IP addresses, networks or interfaces, whereas nat contains the rules responsible for masking IP addresses or ports. Docker uses nat to allow containers on bridge networks to communicate with destinations outside the docker host (otherwise routes pointing to the container networks would have to be added in the docker host’s network).

iptables:filter

Tables in iptables consist of different chains that correspond to different conditions or stages in processing a packet on the docker host. The filter table has 3 chains by default: Input chain for processing packets arriving at the host and destined for the same host, output chain for packets originating on the host to an outside destination, and forward are for packets entering the host but with a destination outside the host. Each chain consists of some rules that dictate some action to be taken on the packet (for example reject or accept the packet) as well as conditions for matching the rule. Rules are processed in sequence until a match is found, otherwise the default policy of the chain is applied. It is also possible to define custom chains in a table.

To view the currently configured rules and default policies for chains in the filter table, run iptables -t filter -L (or iptables -L since the filter table is used by default if no table is specified):

$ sudo iptables -t filter -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere             udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:bootps
ACCEPT     udp  --  anywhere             anywhere             udp dpt:bootps
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
Chain DOCKER (3 references)
target     prot opt source               destination
Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Highlighted are the different chains, and the default policy for each chain (There are no default policies for custom chains). We can also see that Docker has added two custom chains: Docker and Docker-Isolation, and has inserted rules in the Forward chain that have these two new chains as targets.

Docker-isolation chain

Docker-isolation contains rules that restrict access between the different container networks. To see more details, use the -v option when running iptables:

$ sudo iptables -t filter -L -v
….
Chain DOCKER-ISOLATION (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DROP       all  --  br-e6bc7d6b75f3 docker0  anywhere             anywhere
    0     0 DROP       all  --  docker0 br-e6bc7d6b75f3  anywhere             anywhere
    0     0 DROP       all  --  docker_gwbridge docker0  anywhere             anywhere
    0     0 DROP       all  --  docker0 docker_gwbridge  anywhere             anywhere
    0     0 DROP       all  --  docker_gwbridge br-e6bc7d6b75f3  anywhere             anywhere
    0     0 DROP       all  --  br-e6bc7d6b75f3 docker_gwbridge  anywhere             anywhere
36991 3107K RETURN     all  --  any    any     anywhere             anywhere

You can see above a number of drop rules that block traffic between any of the bridge interfaces created by docker, thus making sure that container networks cannot communicate.

icc=false

One of the options that can be passed to the docker network create command is com.docker.network.bridge.enable_icc, which stands for inter-container communication. Setting this option to false blocks containers on the same network from communicating with each other. This is carried out by adding a drop rule in the forward chain that matches on packets coming from the bridge interface associated with the network destined for the same interface.

For example, if we create a new network with the command docker network create --driver bridge --subnet 192.168.200.0/24 --ip-range 192.168.200.0/24 -o "com.docker.network.bridge.enable_icc"="false" no-icc-network

 

$ ifconfig | grep 192.168.200 -B 1
br-8e3f0d353353 Link encap:Ethernet  HWaddr 02:42:c4:6b:f1:40
          inet addr:192.168.200.1  Bcast:0.0.0.0  Mask:255.255.255.0

$ sudo iptables -t filter -S FORWARD
-P FORWARD ACCEPT
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o br-8e3f0d353353 -j DOCKER
-A FORWARD -o br-8e3f0d353353 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i br-8e3f0d353353 ! -o br-8e3f0d353353 -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -o br-e6bc7d6b75f3 -j DOCKER
-A FORWARD -o br-e6bc7d6b75f3 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i br-e6bc7d6b75f3 ! -o br-e6bc7d6b75f3 -j ACCEPT
-A FORWARD -i br-e6bc7d6b75f3 -o br-e6bc7d6b75f3 -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -o lxcbr0 -j ACCEPT
-A FORWARD -i lxcbr0 -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A FORWARD -i br-8e3f0d353353 -o br-8e3f0d353353 -j DROP
iptables:nat

NAT allows the host to change the IP address or port of a packet. In this instance, it is used to mask the source IP address of packets coming from docker bridge networks (for example hosts in the 172.18.0.0/24 subnet) destined to the outside world, behind the IP address of the docker host. This feature is controlled by the com.docker.network.bridge.enable_ip_masquerade option that can be passed to docker network create (If not specified, then it defaults to true).

You can see the effect of this command in the nat table of iptables:

$ sudo iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL


Chain INPUT (policy ACCEPT)
target     prot opt source               destination


Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere            !127.0.0.0/8          ADDRTYPE match dst-type LOCAL


Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.18.0.0/16        anywhere
MASQUERADE  all  --  192.168.100.0/24     anywhere
MASQUERADE  all  --  172.19.0.0/16        anywhere
MASQUERADE  all  --  10.0.3.0/24         !10.0.3.0/24


Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

In the postrouting chain, you can see all the docker networks created with the action of masquerade applied to them when communicating with any host outside their own network.

Summary

  • A bridge network has a corresponding linux bridge interface on the docker host that acts as a layer2 switch, and which connects different containers on the same subnet.
  • Each network interface in a container has a corresponding virtual interface on the docker host that is created while the container is running.
  • A traffic capture from the docker host on the bridge interface is equivalent to configuring a SPAN port on a switch in that you can see all inter-container communication on that network.
  • A traffic capture from the docker host on the virtual interface (veth-*) will show all traffic the container is sending on a particular subnet.
  • Linux iptables rules are used to block different networks (and sometimes hosts within the network) from communicating using the filter table. These rules are usually added in the DOCKER-ISOLATION chain.
  • Containers communicating with the outside world through a bridge interface have their IP hidden behind the docker host’s IP address. This is done by adding rules to the nat table in iptables.

 

Links/Resources

8 comments:

  1. This is fantastic !!! Helped me a lot ! Thank you !

    ReplyDelete
    Replies
    1. Thank you for the feedback! I'm glad you found it helpful.

      Delete
  2. Do you know how to block all outgoing traffic? The internal network type blocks incoming traffic as well.

    ReplyDelete
    Replies
    1. There are multiple ways to do this. One way would be to remove the iptables rules that allow the bridge networks' outgoing traffic to any destination, which would be the following rules in the FORWARD chain (from the example in the post there were two bridge networks and each has its own rule):

      -A FORWARD -i br-8e3f0d353353 ! -o br-8e3f0d353353 -j ACCEPT
      -A FORWARD -i docker0 ! -o docker0 -j ACCEPT

      Another option is to add a drop rule in the DOCKER-ISOLATION chain. If we wanted to do this for the docker0 bridge network, the rule can be added with the following command (make sure the rule is above the last rule in that chain):

      sudo iptables -A DOCKER-ISOLATION -i docker0 ! -o docker0 -j DROP

      You can of course be more granular to control what destinations are allowed/denied.

      I hope this helps.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. I think it's useless to add a drop rule in the DOCKER-ISOLATION chain. It's really strange that i can't block my telnet hostip:30004 request no matter where i add the drop rule to FORWARD OR DOCKER-ISOLATION chian.

      like this:

      -A FORWARD -i docker0 -o docker0 -p tcp -m tcp --dport 30004 -j DROP

      -A DOCKER-ISOLATION -i docker0 -p tcp -m tcp --dport 30004 -j DROP

      but it works when i added the drop rule to FORWARD if it's in docker 1.9.1 .

      Delete
  3. Very Nice demos man... great work.

    ReplyDelete
  4. Any idea on how to pass mcast requests on standard ports, say for example i am running ws-discovery as docker and in this case discovery requests will be on port 3702. For the requests to reach ws-discovery docker from host - any changes to be made or what changes to be made?

    ReplyDelete