Friday, December 16, 2016

Docker Networking Internals: How Docker uses Linux iptables and interfaces

I started playing with docker a while ago, and like most people I was instantly impressed with its power and ease of use. Simplicity is one of docker’s core tenants, and much of docker’s power is abstracted behind simple cli commands. As I was learning to use docker, I wanted to know what it was doing in the background to make things happen, especially around networking (one of my primary areas of interest).

I found numerous documentation for how to create and manipulate container networks, but not as many when it came to how docker makes container networking work. Docker extensively uses linux iptables and bridge interfaces, and this post is my summary of how that is used to create container networks. Most of this information came from github discussion threads, presentations, as well as my own testing, and I link to a number of helpful resources at the end of this post.

I used docker 1.12.3 for the examples in this post. This is not meant as a comprehensive description of docker networking nor as an introduction to docker networking. I hope it might add some insights for users, and I would appreciate any feedback or comments on errors or anything missing. 

 

Contents

 

Docker Networks Overview

Docker’s networking is built on top of the Container Network Model (CNM) which allows any party to write their own network driver. This allows for different network types to be available to containers running on the docker engine, and containers can connect to more than one network type at the same time. In addition to the various third party network drivers available, docker comes with four built-in network drivers:

  • Bridge: This is the default network that containers are launched in. The connectivity is facilitated through a bridge interface on the docker host. Containers using the same bridge network have their own subnet, and can communicate with each other (by default). 

  • Host: This driver allows a container to have access to the docker host’s own network space (The container will see and use the same interfaces as the docker host).

  • Macvlan: This driver allows containers to have direct access to an interface or subinterface (vlan) of the host. It also allows trunking.

  • Overlay: This driver allows for networks to be built across multiple hosts running docker (usually a docker swarm cluster). Containers also have their own subnet and network addresses, and can communicate with each other directly even if they are running on different physical hosts.

Bridge and overlay networks are probably the most commonly used network drivers, and I will be mostly concerned with these two drivers in this article and the next.

 

Docker Bridge Networks

The default network for containers running on a docker host is a bridge network. Docker creates a default bridge network named ‘bridge’ when first installed. We can see this network by listing all networks docker network ls:

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
3e8110efa04a        bridge              bridge              local
bb3cd79b9236        docker_gwbridge     bridge              local
22849c4d1c3a        host                host                local
3kuba8yq3c27        ingress             overlay             swarm
ecbd1c6c193a        none                null                local

To inspect its properties run docker network inspect bridge:

$ docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "3e8110efa04a1eb0923d863af719abf5eac871dbac4ae74f133894b8df4b9f5f",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Containers": {},
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

You can also create your own bridge networks by using the docker network create command and specifying the option --driver bridge, for example docker network create --driver bridge --subnet 192.168.100.0/24 --ip-range 192.168.100.0/24 my-bridge-network creates another bridge network, with the name ‘my-bridge-network’ and subnet 192.168.100.0/24.

 

Linux bridge interfaces

Each bridge network that docker creates is represented by a bridge interface on the docker host. The default bridge network ‘bridge’ usually has the interface docker0 associated with it, and each subsequent bridge network that is created with the docker network create command will have a new interface associated with it.

$ ifconfig docker0
docker0   Link encap:Ethernet  HWaddr 02:42:44:88:bd:75
          inet addr:172.18.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

To find out the linux interface associated with the docker network you created, you can use ifconfig to list all interfaces and then find the interface that has the subnet you specified, for example if we wanted to look up the bridge interface for my-bridge-network which we created above, we can run the following:

$ ifconfig | grep 192.168.100. -B 1
br-e6bc7d6b75f3 Link encap:Ethernet  HWaddr 02:42:bc:f1:91:09
          inet addr:192.168.100.1  Bcast:0.0.0.0  Mask:255.255.255.0

The linux bridge interfaces are similar to switches in their function in that they connect different interfaces to the same subnet, and forward traffic based on MAC addresses. As we shall see below, each container connected to a bridge network will have its own virtual interface created on the docker host, and the docker engine will connect all containers in the same network to the same bridge interface, which will allow them to communicate with each other. You can get more details about the status of the bridge by using the brctl utility:

$ brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no

Once we have containers running and connected to this network, we will see each container’s interface listed under the interfaces column. And running a traffic capture on the bridge interface will allow us to see intercommunication between containers on the same subnet.

 

Linux virtual interfaces (veth)

The Container Networking Model (CNM) allows each container to have its own network space. Running ifconfig from inside the container will show its interfaces as the container sees them:

$ docker run -ti ubuntu:14.04 /bin/bash
root@6622112b507c:/#
root@6622112b507c:/# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:ac:12:00:02
          inet addr:172.18.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe12:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:766 (766.0 B)  TX bytes:508 (508.0 B)


lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

However, the eth0 seen above is only available from within that container, and outside on the docker host, docker creates a twin virtual interface that corresponds to it and acts as a link to the outside world. These virtual interfaces are then connected to the bridge interfaces discussed above to facilitate connectivity between different containers on the same subnet.

We can review this process by starting two containers connected to the default bridge network, and then view the interface configuration on the docker host.

Before running starting any containers, the docker0 bridge interface has no interfaces attached:

$ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no

I then started two containers from the ubuntu:14.04 image:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
a754719db594        ubuntu:14.04        "/bin/bash"         5 seconds ago       Up 4 seconds                            zen_kalam
976041ec420f        ubuntu:14.04        "/bin/bash"         7 seconds ago       Up 5 seconds                            stupefied_easley

You can immediately see that there are now two interfaces attached to the docker0 bridge interface (one for each container):

$ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.02424488bd75       no              veth2177159
                                                        vethd8e05dd

Starting a ping to google from one of the containers, then doing a traffic capture on the container’s virtual interface from the docker host will show us the containers traffic:

$ docker exec a754719db594 ping google.com
PING google.com (216.58.217.110) 56(84) bytes of data.
64 bytes from iad23s42-in-f110.1e100.net (216.58.217.110): icmp_seq=1 ttl=48 time=0.849 ms
64 bytes from iad23s42-in-f110.1e100.net (216.58.217.110): icmp_seq=2 ttl=48 time=0.965 ms

ubuntu@swarm02:~$ sudo tcpdump -i veth2177159 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth2177159, link-type EN10MB (Ethernet), capture size 262144 bytes
20:47:12.170815 IP 172.18.0.3 > iad23s42-in-f14.1e100.net: ICMP echo request, id 14, seq 55, length 64
20:47:12.171654 IP iad23s42-in-f14.1e100.net > 172.18.0.3: ICMP echo reply, id 14, seq 55, length 64
20:47:13.170821 IP 172.18.0.3 > iad23s42-in-f14.1e100.net: ICMP echo request, id 14, seq 56, length 64
20:47:13.171694 IP iad23s42-in-f14.1e100.net > 172.18.0.3: ICMP echo reply, id 14, seq 56, length 64

Similarly we can do a ping from one container to another.

First, we need to get the IP address of the container which we can do by either running ifconfig in the container or inspecting the container using the docker inspect command:

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' a754719db594 
172.18.0.3 

Then start a ping from one container to another:

$ docker exec 976041ec420f ping 172.18.0.3
PING 172.18.0.3 (172.18.0.3) 56(84) bytes of data.
64 bytes from 172.18.0.3: icmp_seq=1 ttl=64 time=0.070 ms
64 bytes from 172.18.0.3: icmp_seq=2 ttl=64 time=0.053 ms

To see this traffic from the docker host, we can do a capture on either of the virtual interfaces corresponding to the containers, or we can do a capture on the bridge interface (docker0 in this instance) which shows all inter-container communication on that subnet:

$ sudo tcpdump -ni docker0 host 172.18.0.2 and host 172.18.0.3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:55:37.990831 IP 172.18.0.2 > 172.18.0.3: ICMP echo request, id 14, seq 200, length 64
20:55:37.990865 IP 172.18.0.3 > 172.18.0.2: ICMP echo reply, id 14, seq 200, length 64
20:55:38.990828 IP 172.18.0.2 > 172.18.0.3: ICMP echo request, id 14, seq 201, length 64
20:55:38.990866 IP 172.18.0.3 > 172.18.0.2: ICMP echo reply, id 14, seq 201, length 64

 

Locate a container’s veth interface

There is no straightforward way for finding which veth interface on the docker host is linked to the interface within a container, but there are several methods discussed in various docker forum and github threads. The easiest in my opinion is the following (Based on the solution in this thread with a slight modification) which depends on having ethtool accessible in the container:

For example, I have three containers running on my system:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
ccbf97c72bf5        ubuntu:14.04        "/bin/bash"         3 seconds ago       Up 3 seconds                            admiring_torvalds
77d9f02d61f2        ubuntu:14.04        "/bin/bash"         4 seconds ago       Up 4 seconds                            goofy_borg
19743c0ddf24        ubuntu:14.04        "/bin/sh"           8 minutes ago       Up 8 minutes                            high_engelbart

First I execute the following in the container, and get the peer_ifindex number:

$ docker exec 77d9f02d61f2 sudo ethtool -S eth0
NIC statistics:
     peer_ifindex: 16

Then on the docker host, I use the peer_ifindex to find the interface name:

$ sudo ip link | grep 16
16: veth7bd3604@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default

So the interface name in this case is veth7bd3604.

iptables

Docker uses linux iptables to control communication to and from the interfaces and networks it creates. Linux iptables consist of different tables, but we are primarily concerned with two: filter and nat. Filter is the security rules table used to allow or deny traffic to IP addresses, networks or interfaces, whereas nat contains the rules responsible for masking IP addresses or ports. Docker uses nat to allow containers on bridge networks to communicate with destinations outside the docker host (otherwise routes pointing to the container networks would have to be added in the docker host’s network).

iptables:filter

Tables in iptables consist of different chains that correspond to different conditions or stages in processing a packet on the docker host. The filter table has 3 chains by default: Input chain for processing packets arriving at the host and destined for the same host, output chain for packets originating on the host to an outside destination, and forward are for packets entering the host but with a destination outside the host. Each chain consists of some rules that dictate some action to be taken on the packet (for example reject or accept the packet) as well as conditions for matching the rule. Rules are processed in sequence until a match is found, otherwise the default policy of the chain is applied. It is also possible to define custom chains in a table.

To view the currently configured rules and default policies for chains in the filter table, run iptables -t filter -L (or iptables -L since the filter table is used by default if no table is specified):

$ sudo iptables -t filter -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere             udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:bootps
ACCEPT     udp  --  anywhere             anywhere             udp dpt:bootps
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
Chain DOCKER (3 references)
target     prot opt source               destination
Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Highlighted are the different chains, and the default policy for each chain (There are no default policies for custom chains). We can also see that Docker has added two custom chains: Docker and Docker-Isolation, and has inserted rules in the Forward chain that have these two new chains as targets.

Docker-isolation chain

Docker-isolation contains rules that restrict access between the different container networks. To see more details, use the -v option when running iptables:

$ sudo iptables -t filter -L -v
….
Chain DOCKER-ISOLATION (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DROP       all  --  br-e6bc7d6b75f3 docker0  anywhere             anywhere
    0     0 DROP       all  --  docker0 br-e6bc7d6b75f3  anywhere             anywhere
    0     0 DROP       all  --  docker_gwbridge docker0  anywhere             anywhere
    0     0 DROP       all  --  docker0 docker_gwbridge  anywhere             anywhere
    0     0 DROP       all  --  docker_gwbridge br-e6bc7d6b75f3  anywhere             anywhere
    0     0 DROP       all  --  br-e6bc7d6b75f3 docker_gwbridge  anywhere             anywhere
36991 3107K RETURN     all  --  any    any     anywhere             anywhere

You can see above a number of drop rules that block traffic between any of the bridge interfaces created by docker, thus making sure that container networks cannot communicate.

icc=false

One of the options that can be passed to the docker network create command is com.docker.network.bridge.enable_icc, which stands for inter-container communication. Setting this option to false blocks containers on the same network from communicating with each other. This is carried out by adding a drop rule in the forward chain that matches on packets coming from the bridge interface associated with the network destined for the same interface.

For example, if we create a new network with the command docker network create --driver bridge --subnet 192.168.200.0/24 --ip-range 192.168.200.0/24 -o "com.docker.network.bridge.enable_icc"="false" no-icc-network

 

$ ifconfig | grep 192.168.200 -B 1
br-8e3f0d353353 Link encap:Ethernet  HWaddr 02:42:c4:6b:f1:40
          inet addr:192.168.200.1  Bcast:0.0.0.0  Mask:255.255.255.0

$ sudo iptables -t filter -S FORWARD
-P FORWARD ACCEPT
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o br-8e3f0d353353 -j DOCKER
-A FORWARD -o br-8e3f0d353353 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i br-8e3f0d353353 ! -o br-8e3f0d353353 -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -o br-e6bc7d6b75f3 -j DOCKER
-A FORWARD -o br-e6bc7d6b75f3 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i br-e6bc7d6b75f3 ! -o br-e6bc7d6b75f3 -j ACCEPT
-A FORWARD -i br-e6bc7d6b75f3 -o br-e6bc7d6b75f3 -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -o lxcbr0 -j ACCEPT
-A FORWARD -i lxcbr0 -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A FORWARD -i br-8e3f0d353353 -o br-8e3f0d353353 -j DROP
iptables:nat

NAT allows the host to change the IP address or port of a packet. In this instance, it is used to mask the source IP address of packets coming from docker bridge networks (for example hosts in the 172.18.0.0/24 subnet) destined to the outside world, behind the IP address of the docker host. This feature is controlled by the com.docker.network.bridge.enable_ip_masquerade option that can be passed to docker network create (If not specified, then it defaults to true).

You can see the effect of this command in the nat table of iptables:

$ sudo iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL


Chain INPUT (policy ACCEPT)
target     prot opt source               destination


Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere            !127.0.0.0/8          ADDRTYPE match dst-type LOCAL


Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.18.0.0/16        anywhere
MASQUERADE  all  --  192.168.100.0/24     anywhere
MASQUERADE  all  --  172.19.0.0/16        anywhere
MASQUERADE  all  --  10.0.3.0/24         !10.0.3.0/24


Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

In the postrouting chain, you can see all the docker networks created with the action of masquerade applied to them when communicating with any host outside their own network.

Summary

  • A bridge network has a corresponding linux bridge interface on the docker host that acts as a layer2 switch, and which connects different containers on the same subnet.
  • Each network interface in a container has a corresponding virtual interface on the docker host that is created while the container is running.
  • A traffic capture from the docker host on the bridge interface is equivalent to configuring a SPAN port on a switch in that you can see all inter-container communication on that network.
  • A traffic capture from the docker host on the virtual interface (veth-*) will show all traffic the container is sending on a particular subnet.
  • Linux iptables rules are used to block different networks (and sometimes hosts within the network) from communicating using the filter table. These rules are usually added in the DOCKER-ISOLATION chain.
  • Containers communicating with the outside world through a bridge interface have their IP hidden behind the docker host’s IP address. This is done by adding rules to the nat table in iptables.

 

Links/Resources

Wednesday, December 14, 2016

Docker Networking Internals: Container Connectivity in Docker Swarm and Overlay Networks

In the previous post, I covered how Docker uses linux virtual interfacs and bridge interfaces to facilitate communication between containers over bridge networks. In this post, I will be discussing how Docker utilizes vxlan technology to create overlay networks that are used in swarm clusters, and how it is possible to view and inspect this configuration. I will also discuss the various network types used to facilitiate different connectivity needs for containers launched in swarm clusters.

It is assumed that the readers are already familiar with setting up swarm clusters, and launching services in Docker Swarm. I will also link to a number of helpful resources at the end of the post that will provide more details and context around the topics discussed here. Again, I appreciate any feedback provided.

Contents

 

Docker Swarm and Overlay Networks

Docker overlay networks are used in the context of docker clusters (Docker Swarm), where a virtual network used by containers needs to span multiple physical hosts running the docker engine. When a container is launched in a swarm cluster (as part of a service), multiple networks are attached by default, each to facilitate different connectivity requirements.

For example, I have a three node docker swarm cluster:

$ docker node ls
ID                           HOSTNAME       STATUS  AVAILABILITY  MANAGER STATUS
50ogqwz3vkweor5i35eneygmi *  swarm-manager  Ready   Active        Leader
7zg6c719vaj8az2tmiga4twju    swarm02        Ready   Active
b3p76fz1zga5njh5cr91wszfi    swarm01        Ready   Active

I will first create an overlay network named my-overlay-network:

$ docker network create --driver overlay --subnet 10.10.10.0/24 my-overlay-network
bwii51jaglps6a3xkohm57xe6

Then launch a service with a container running a simple web server that exposes port 8080 to the outside world. This service will have 3 replicas running, and I specify that it is connected to one network only (my-overlay-network):

$ docker service create --name webapp --replicas=3 --network my-overlay-network -p 8080:80 akittana/dockerwebapp:1.1
8zpocbn9mv8gb2hqwjpa1stuq
$ docker service ls
ID            NAME    REPLICAS  IMAGE                      COMMAND
8zpocbn9mv8g  webapp  3/3       akittana/dockerwebapp:1.1

If I then list the interfaces available to any of the running containers, I see three interfaces as opposed to only 1 interface as we would have expected when running a container on a single docker host:

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
9ec607870ed9        akittana/dockerwebapp:1.1   "/usr/sbin/apache2ctl"   58 seconds ago      Up 57 seconds       80/tcp              webapp.2.ebrd6mf98r4baogleca60ckjf

$ docker exec 9ec607870ed9 ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:0a:ff:00:06
          inet addr:10.255.0.6  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:aff:feff:6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:648 (648.0 B)

eth1      Link encap:Ethernet  HWaddr 02:42:ac:13:00:03
          inet addr:172.19.0.3  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe13:3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:648 (648.0 B)

eth2      Link encap:Ethernet  HWaddr 02:42:0a:0a:0a:03
          inet addr:10.10.10.3  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::42:aff:fe0a:a03/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:15 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1206 (1.2 KB)  TX bytes:648 (648.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The container is connected to my-overlay-network through eth2 (as you can tell by the IP address). eth0 and eth1 are connected to other networks. If we run docker network ls, we can see that there are an extra two networks that were added: docker_gwbridge and ingress, and from the subnets addresses we can see that those are connected to eth0 and eth1:

$ docker network ls
NETWORK ID          NAME                 DRIVER              SCOPE
e96b3056294c        bridge               bridge              local
b85b2b9fdadf        docker_gwbridge      bridge              local
9f0dd1556cd0        host                 host                local
3kuba8yq3c27        ingress              overlay             swarm
bwii51jaglps        my-overlay-network   overlay             swarm
2b9d08c6067a        none                 null                local
$ docker network inspect ingress | grep Subnet
"Subnet": "10.255.0.0/16",
$ docker network inspect docker_gwbridge | grep Subnet
"Subnet": "172.19.0.0/16",

 

Overlay

The overlay networks creates a subnet that can be used by containers across multiple hosts in the swarm cluster. Containers running on different physical hosts can talk to each other on the overlay network (if they are all attached to the same network).

For example, for the webapp service we started, we can see that there is one container running on each host in the swarm cluster:

$ docker service ps webapp
ID                         NAME      IMAGE                      NODE           DESIRED STATE  CURRENT STATE          ERROR
agng82g4qm19ascc1udlnvy1k  webapp.1  akittana/dockerwebapp:1.1  swarm-manager  Running        Running 3 minutes ago
0vxnym0djag47o94dcmi8yptk  webapp.2  akittana/dockerwebapp:1.1  swarm01        Running        Running 3 minutes ago
d38uyent358pm02jb7inqq8up  webapp.3  akittana/dockerwebapp:1.1  swarm02        Running        Running 3 minutes ago

I can get the overlay IP address for each container by executing ifconfig eth2 (eth2 is the interface connected to the overlay network)

On swarm01:

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
2b0abe2956c6        akittana/dockerwebapp:1.1   "/usr/sbin/apache2ctl"   5 minutes ago       Up 5 minutes        80/tcp              webapp.2.0vxnym0djag47o94dcmi8yptk

$ docker exec 2b0abe2956c6 ifconfig eth2 | grep addr
eth2      Link encap:Ethernet  HWaddr 02:42:0a:0a:0a:05
          inet addr:10.10.10.5  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::42:aff:fe0a:a05/64 Scope:Link

Then from the conatiner running on swarm02, I should be able to ping 10.10.10.5 (the IP of the container on swarm01):

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
a1ca9a0d2364        akittana/dockerwebapp:1.1   "/usr/sbin/apache2ctl"   55 minutes ago      Up 55 minutes       80/tcp              webapp.3.d38uyent358pm02jb7inqq8up

$ docker exec a1ca9a0d2364 ping 10.10.10.5
PING 10.10.10.5 (10.10.10.5) 56(84) bytes of data.
64 bytes from 10.10.10.5: icmp_seq=1 ttl=64 time=0.778 ms
64 bytes from 10.10.10.5: icmp_seq=2 ttl=64 time=0.823 ms
vxlan

Docker’s overlay network uses vxlan technology which encapsulates layer 2 frames into layer 4 packets (UDP/IP). This allows docker to create create virtual networks on top of existing connections between hosts that may or may not be in the same subnet. Any network endpoints participating in this virtual network, see each other as if they’re connected over a switch, without having to care about the underlying physical network.

To see this in action, we can do a traffic capture on the docker hosts particpating in the overlay network. So in the example above, a traffic capture on the swarm01 or swarm02 will show us the icmp traffic between the containers running on them (vxlan uses udp port 4789):

ubuntu@swarm01:~$ sudo tcpdump -i eth0 udp and port 4789
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
01:20:37.030201 IP 172.17.1.50.40368 > 172.17.1.142.4789: VXLAN, flags [I] (0x08), vni 257
IP 10.10.10.5 > 10.10.10.4: ICMP echo request, id 16, seq 19157, length 64
01:20:37.030289 IP 172.17.1.142.49108 > 172.17.1.50.4789: VXLAN, flags [I] (0x08), vni 257
IP 10.10.10.4 > 10.10.10.5: ICMP echo reply, id 16, seq 19157, length 64

You can see two layers in the packets above, the first is the udp vxlan tunnel traffic between the docker hosts over port 4789, and inside you can see the icmp traffic with the container IP addresses.

Encryption

The traffic capture showed above showed that anyone who can see the traffic between the docker hosts, is able to also see inter-container traffic going over an overlay network. This is why docker includes an encryption option which enables automatic IPSec encryption of the vxlan tunnels simply by adding --opt encrypted when creating the network.

Doing the same test above, but by using an encrypted overlay network, we will only see encrypted packets between the docker hosts:

$ docker network create --driver overlay --opt encrypted --subnet 10.20.20.0/24 enc-overlay-network
0aha4giv5yylp6l5nzgev4tel

$ docker service create --name webapp --replicas=3 --network enc-overlay-network -p 8080:80 akittana/dockerwebapp:1.1
5hjbv2mmqemto5krgylmen076

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
6ba03d127212        akittana/dockerwebapp:1.1   "/usr/sbin/apache2ctl"   20 seconds ago      Up 19 seconds       80/tcp              webapp.1.3axjnerwc6

$ docker exec 6ba03d127212 ifconfig | grep 10.20.20
          inet addr:10.20.20.3  Bcast:0.0.0.0  Mask:255.255.255.0

$ docker exec 6ba03d127212 ping 10.20.20.5
PING 10.20.20.5 (10.20.20.5) 56(84) bytes of data.
64 bytes from 10.20.20.5: icmp_seq=1 ttl=64 time=0.781 ms
64 bytes from 10.20.20.5: icmp_seq=2 ttl=64 time=0.785 ms

ubuntu@swarm02:~$ sudo tcpdump -i eth0 esp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
01:37:38.342817 IP 172.17.1.50 > 172.17.1.65: ESP(spi=0x378b5e6f,seq=0xe9), length 140
01:37:38.342936 IP 172.17.1.65 > 172.17.1.50: ESP(spi=0x29ade773,seq=0x49), length 140
Inspecting vxlan tunnel interfaces

Similar to bridge networks, docker creates a bridge interface for each overlay network, which connect the virtual tunnel interfaces that make the vxlan tunnel connections between the hosts. However, these bridge and vxlan tunnel interfaces are not created directly on the tunnel host, but instead they are in separate containers that docker launches for each overlay network that is created.

To actually inspect these interfaces, you have to use nsenter to run commands within the network space of the container managing the tunnels and virtual interfaces. This has to be run on the docker hosts that have containers that participate in the overlay network.

Also, you have to edit /etc/systemd/system/multi-user.target.wants/docker.service on the docker host and comment out MountFlags=slave as discussed here.

ubuntu@swarm02:~$ sudo ls -l /run/docker/netns/
total 0
-r--r--r-- 1 root root 0 Dec 15 19:52 1-3kuba8yq3c
-r--r--r-- 1 root root 0 Dec 16 02:00 1-bwii51jagl
-r--r--r-- 1 root root 0 Dec 16 02:00 4d950aa1386e
-r--r--r-- 1 root root 0 Dec 15 19:52 ingress_sbox

ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl ifconfig
br0       Link encap:Ethernet  HWaddr 22:14:63:f9:8b:f5
          inet addr:10.10.10.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::b0f7:dfff:fe61:6098/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:536 (536.0 B)  TX bytes:648 (648.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

veth2     Link encap:Ethernet  HWaddr 7a:48:1b:9a:ef:ec
          inet6 addr: fe80::7848:1bff:fe9a:efec/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:13 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:1038 (1.0 KB)

vxlan1    Link encap:Ethernet  HWaddr 22:14:63:f9:8b:f5
          inet6 addr: fe80::2014:63ff:fef9:8bf5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:20 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl bridge fdb show
33:33:00:00:00:01 dev br0 self permanent
01:00:5e:00:00:01 dev br0 self permanent
33:33:ff:61:60:98 dev br0 self permanent
22:14:63:f9:8b:f5 dev vxlan1 vlan 1 master br0 permanent
22:14:63:f9:8b:f5 dev vxlan1 master br0 permanent
02:42:0a:0a:0a:05 dev vxlan1 dst 172.17.1.142 link-netnsid 0 self permanent
02:42:0a:0a:0a:04 dev vxlan1 dst 172.17.1.50 link-netnsid 0 self permanent
7a:48:1b:9a:ef:ec dev veth2 master br0 permanent
7a:48:1b:9a:ef:ec dev veth2 vlan 1 master br0 permanent
02:42:0a:0a:0a:03 dev veth2 master br0
33:33:00:00:00:01 dev veth2 self permanent
01:00:5e:00:00:01 dev veth2 self permanent
33:33:ff:9a:ef:ec dev veth2 self permanent

Finally, running a traffic capture on the veth interface will show us the traffic as it leaves the container, and before it is routed into the vxlan tunnel (the ping from earlier was still running):

ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl tcpdump -i veth2 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth2, link-type EN10MB (Ethernet), capture size 262144 bytes
02:04:06.653684 IP 10.10.10.3 > 10.10.10.5: ICMP echo request, id 16, seq 21, length 64
02:04:06.654426 IP 10.10.10.5 > 10.10.10.3: ICMP echo reply, id 16, seq 21, length 64
02:04:06.958298 IP 10.10.10.3 > 10.10.10.4: ICMP echo request, id 20, seq 3, length 64
02:04:06.959198 IP 10.10.10.4 > 10.10.10.3: ICMP echo reply, id 20, seq 3, length 64

 

ingress

The second network that the containers where connected to was the ingress network. Ingress is an overlay network but it is installed by default once a swarm cluster is initiated. This network is used to provide connectivity when connections are made to containers from the outside world. It is also where the load balancing feature provided by the swarm cluster happens.

The load balancing is handling by IPVS which runs on a container that docker swarm launches by default. We can see this container attached to the ingress network (I used the same web service as before which will expose port 8080 which is then mapped to port 80 on the containers):

$ docker service create --name webapp --replicas=3 --network my-overlay-network -p 8080:80 akittana/dockerwebapp:1.1
3ncich3lzh9nj3upcir39mxzm

$ docker network inspect ingress
[
    {
        "Name": "ingress",
        "Id": "3kuba8yq3c27p2vwo7hcm987i",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.255.0.0/16",
                    "Gateway": "10.255.0.1"
                }
            ]
        },
        "Internal": false,
        "Containers": {
            "02f1067a00b3a83d5f03cec17265bf7d7925fc6a326cc23cd46c5ab73cf57f20": {
                "Name": "webapp.1.a4s04msunrycihp4pk7hc9kmy",
                "EndpointID": "8f4bcf1fafa1e058d920ccadbec3b0172cf0432759377b442ac2331a388ad144",
                "MacAddress": "02:42:0a:ff:00:06",
                "IPv4Address": "10.255.0.6/16",
                "IPv6Address": ""
            },
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "f3030184ba93cd214c77b5811db7a20209bfd6a8daad4b56e7dea00bd022d536",
                "MacAddress": "02:42:0a:ff:00:04",
                "IPv4Address": "10.255.0.4/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "256"
        },
        "Labels": {}
    }
]

First, let’s take a look at the docker host (any of the hosts participating in the swarm cluster):

ubuntu@swarm01:~$ sudo iptables -t nat -n -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER-INGRESS  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER-INGRESS  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
DOCKER     all  --  0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match src-type LOCAL
MASQUERADE  all  --  172.18.0.0/16        0.0.0.0/0
MASQUERADE  all  --  172.26.0.0/16        0.0.0.0/0
MASQUERADE  all  --  172.19.0.0/16        0.0.0.0/0
MASQUERADE  all  --  172.25.0.0/16        0.0.0.0/0
MASQUERADE  all  --  10.0.3.0/24         !10.0.3.0/24

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0
RETURN     all  --  0.0.0.0/0            0.0.0.0/0
RETURN     all  --  0.0.0.0/0            0.0.0.0/0
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Chain DOCKER-INGRESS (2 references)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.19.0.2:8080
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

You can see the rule that matches traffic destined to port 8080 and forwards it 172.19.0.2. We can see that 172.19.0.2 belongs to the ingress-sbox container if we inspect its interfaces:

ubuntu@swarm01:~$ sudo ls -l /run/docker/netns
total 0
-rw-r--r-- 1 root root 0 Dec 16 16:03 1-3kuba8yq3c
-rw-r--r-- 1 root root 0 Dec 16 16:05 1-bwii51jagl
-rw-r--r-- 1 root root 0 Dec 16 16:05 fb3e041d72b7
-r--r--r-- 1 root root 0 Dec 16 15:54 ingress_sbox

ubuntu@swarm01:~$ sudo nsenter --net=/run/docker/netns/ingress_sbox ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:0a:ff:00:04
          inet addr:10.255.0.4  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:aff:feff:4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:23 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1854 (1.8 KB)  TX bytes:648 (648.0 B)

eth1      Link encap:Ethernet  HWaddr 02:42:ac:13:00:02
          inet addr:172.19.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe13:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:24 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1944 (1.9 KB)  TX bytes:648 (648.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

docker then uses iptables mangle rules to mark packets to port 8080 with a certain number, that will then be used by IPVS to load balance to the appropriate containers:

ubuntu@swarm01:~$ sudo nsenter --net=/run/docker/netns/ingress_sbox iptables -t mangle -L -n
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 MARK set 0x102

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            10.255.0.2           MARK set 0x102

More details on how docker swarm using iptables and IPVS to load balance to containers are presented in this talk.

docker_gwbridge

Finally, there is the docker_gwbridge network. This is a bridge network and has a corresponding interface with the name docker_gwbridge created on each host participating in the swarm cluster. The docker_gwbridge provides connectivity to the outside world for traffic originating on the containers in the swarm cluster (For example, if we do a ping to google, that traffic goes through the docker_gwbridge network).

I won’t go into details of the internals of this network as this is the same as the bridge networks covered in the previous post.

Summary

When launching a container in a swarm cluster, the container can be attached to three (or more) networks by default. First there is the docker_gwbridge network which is used to allow containers to communicate with the outside world, then the ingress network which is only used if the containers need to allow inbound connections from the outside world, and finally there are the overlay networks that are user created and can be attached to containers. The overlay networks serve as a shared subnet between containers launched into the same network in which they can communicate directly (even if they are launched on different physical hosts).

We also saw that there separate network spaces that are created by default by docker in a swarm cluster that help manage the vxlan tunnels used for the overlay networks, as well as the load balancing rules for inbound connections to containers.

Links/Resources