First we create 3 ubuntu 14.04 hosts with docker 1.11.1 on them.
docker run -d -p 8500:8500 -h consul --name consul progrium/consul -server –bootstrap
DOCKER_OPTS="-H tcp://0.0.0.0:2375
-H unix:///var/run/docker.sock
--cluster-store=consul://$MASTER_IP:8500/network \
--cluster-advertise=eth0:2375"
sudo service docker restart
docker network create -d overlay --subnet 10.10.10.0/24 multinet
docker network ls
NETWORK ID NAME DRIVER
4a91d51c8352 bridge bridge
9396e02c30e4 docker_gwbridge bridge
a188f529878c host host
ec752ef8859b multinet overlay
3a30f03b6183 none null
docker network inspect multinet
[
{
"Name": "multinet",
"Id": "ec752ef8859b9a7db88305a6065cc1d85ce04679f5492bebba97171928afcfb4",
"Scope": "global",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {},
"Config": [
{
"Subnet": "10.10.10.0/24"
}
]
},
"Internal": false,
"Containers": {},
"Options": {},
"Labels": {}
}
]
On node-1:
docker run --net multinet --name node1test -d busybox
On node-2
docker run --net multinet --name node2test -d busybox
Notice 2 new interfaces that got added to each host
ip link
8: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:41:57:e3:79 brd ff:ff:ff:ff:ff:ff
31: vethdd1ad9b@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
link/ether f6:a8:e9:12:49:37 brd ff:ff:ff:ff:ff:ff
It takes 2 to tango on a veth pair...
ethtool -S vethdd1ad9b
NIC statistics:
peer_ifindex: 30
docker exec -it node1test ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
28: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:0a:0a:02 brd ff:ff:ff:ff:ff:ff
inet 10.10.10.2/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::42:aff:fe0a:a02/64 scope link
valid_lft forever preferred_lft forever
30: eth1@if31: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.2/16 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
so there it is.. The veth pair is between eth1 in the container and vethdd1ad9b on the host. Where is vethdd1ad9b attached to? Turns out it is on the docker_gwbridge:
brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.024264855414 no
docker_gwbridge 8000.02424157e379 no vethdd1ad9b
Eth1 in the container connects to the docker_gwbridge to be able to communicate with the outside world. Which must mean eth0 is connected to our multinet overlay network, routing traffic to containers on other Docker hosts.
Also notice the MTU is set to 1450. This is 50 bytes less than we would expect (default MTU is 1500). This is because a VXLAN header is exactly 50 bytes. Setting MTU to 1450 allows for the VXLAN encapsulated packet to precisely fit into the standard size MTU of 1500.
Let's track eth0 and see where it leads to.. We know it's ifindex
28.
Docker uses network namespaces to create self-contained sets of interfaces and routing tables to provide dedicated bridges for all networks (e.g. container networking, overlay networks, etc.). So each overlay network is configured into its own network namespace.
sudo ls -al /var/run/docker/netns
total 0
drwxr-xr-x 2 root root 80 May 11 18:36 .
drwx------ 4 root root 80 May 11 11:08 ..
-r--r--r-- 1 root root 0 May 11 18:36 2-ec752ef885
-r--r--r-- 1 root root 0 May 11 18:36 c146eed489de
We need to create some symlinks to let ip netns
work nicely with Docker namespaces:
mkdir -p /var/run/netns
sudo ln -s /var/run/docker/netns/2-ec752ef885 /var/run/netns/2-ec752ef885
sudo ln -s /var/run/docker/netns/c146eed489de /var/run/netns/c146eed489de
Now to check all net
namespaces:
ip netns list
c146eed489de
2-ec752ef885
In our lab we have only 1 container running on the host, so 1 namespace must be for the container, while the other is for the overlay bridge. Let's check which namespace is used by the container:
docker inspect --format '{{.NetworkSettings.SandboxKey}}' node1test
/var/run/docker/netns/c146eed489de
This means the other namespace is for the overlay network. Let's see what interfaces are in the overlay network namespace:
sudo ip netns exec 2-ec752ef885 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 92:6c:6f:50:c1:81 brd ff:ff:ff:ff:ff:ff
inet 10.10.10.1/24 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::a087:15ff:fe08:f5e3/64 scope link
valid_lft forever preferred_lft forever
27: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN group default
link/ether ba:40:1f:f9:6c:96 brd ff:ff:ff:ff:ff:ff
inet6 fe80::b840:1fff:fef9:6c96/64 scope link
valid_lft forever preferred_lft forever
29: veth2@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
link/ether 92:6c:6f:50:c1:81 brd ff:ff:ff:ff:ff:ff
inet6 fe80::906c:6fff:fe50:c181/64 scope link
valid_lft forever preferred_lft forever
So here we see br0 which connects containers on the multinet overlay network that are on the same host, vxlan1 which encapsulates and routes traffic to containers on the other Docker hosts, and veth2. So what is on the other end of veth2?
sudo ip netns exec 2-ec752ef885 ethtool -S veth2
NIC statistics:
peer_ifindex: 28
Aha! There is our ifindex
28! This means it's connected to the eth0 inside the node1test container.
So eth0 from our container is connected to the br0 interface inside the overlay bridge network namespace:
sudo ip netns exec 2-ec752ef885 brctl show
bridge name bridge id STP enabled interfaces
br0 8000.926c6f50c181 no veth2
vxlan1
We can see the VXLAN is plugged into br0 as well. So this is how our container talks to containers on the other hosts.
Now the last question remains: how do the vxlan interfaces know how/where to route traffic destined for other containers to?
As it turns out vxlan1 is a bridge interface as well. This bridge has a forwarding database (FDB), which contains a list of MAC addresses with their corresponding ip addresses. We can see this FDB through:
sudo ip netns exec 2-ec752ef885 bridge fdb show dev vxlan1
ba:40:1f:f9:6c:96 permanent
ba:40:1f:f9:6c:96 vlan 1 permanent
02:42:0a:0a:0a:03 dst 10.0.11.6 self permanent
The ip 10.0.11.6 is the ip of node-2. This is where our node2test container is, which is also connected to the multinet overlay network. So who's MAC is listed in that rule?
On node-2:
ip a
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
link/ether 02:03:bd:c6:f2:7d brd ff:ff:ff:ff:ff:ff
inet 10.0.11.6/20 brd 10.0.15.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::3:bdff:fec6:f27d/64 scope link
valid_lft forever preferred_lft forever
...
The MAC is not the host's. As it turns out, it's the MAC of the eth0 in the node2test container:
docker exec -it node2test ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
17: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:0a:0a:03 brd ff:ff:ff:ff:ff:ff
inet 10.10.10.3/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::42:aff:fe0a:a03/64 scope link
valid_lft forever preferred_lft forever
19: eth1@if20: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.2/16 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
This makes sense, since we saw that eth0 in the container is always connected to the overlay network.
So where do the rules come from? They are managed by the docker-engine, which communicates with other docker-engines through the Serf (a gossip) protocol. In order to form a cluster that exchanges events through Serf, the KV store is used (in our example Consul).