Journey to the Center of the Linux Kernel: Traffic Control, Shaping and QoS

- Introduction

This document describes the Traffic Control subsystem of the Linux Kernel in depth, algorithm by algorithm, and shows how it can be used to manage the outgoing traffic of a Linux system. Throughout the chapters, we will discuss both the theory behind and the usage of Traffic Control, and demonstrate how one can gain a complete control over the packets passing through his system.

a QoS graph

The initial target of this paper was to gain a better control over a small DSL uplink. And it grew over time to cover a lot more than that. 99% of the information provided here can be applied to any type of server, as well as routers, firewalls, etc…

The Traffic Control topic is large and in constant evolution, as is the Linux Kernel. The real credit goes to the developers behind the /net directory of the kernel, and all of the researchers who created and improved all of these algorithms. This is merely an attempt to document some of this work for the masses. Any participation and comments are welcome, in particular if you spotted an inconsistency somewhere. Please email julien[at]linuxwall.info, your messages are always most appreciated.

For the technical discussion, since the LARTC mailing list doesn't exists anymore, try those two:

* Netfilter users mailing list for general discussions * NetDev mailing list] is where magic happens (developers ML)

- Motivation

This article was initially published in the french issue of Gnu/Linux Magazine France #127, in May 2010. GLMF is kind enough to provide a contract that release the content of the article under Creative Common after some time. I extended the initial article quite a bit since, but you can still find the original french version here

My interest for the traffic shaping subsystem of Linux started around 2005, when I decided to host most of the services I use myself. I read the documentation available on the subject (essentially LARTC) but found it incomplete and ended up reading the original research publications and the source code of Linux.

I host most of my Internet services myself, at home or on small end servers (dedibox and so on). This includes web hosting (this wiki, some blogs and a few websites), SMTP and IMAP servers, some FTP servers, XMPP, DNS and so on. French ISPs allow this, but only give you 1Mbps/128KBps of uplink, which corresponds to the TCP Acks rate necessary for a 20Mbps downlink.

1Mbps is enough for most usage, but without traffic control, any weekly backup to an external location fills up the DSL link and slows down everyone on the network. During that time, both the visitors of this wiki and my wife chatting on skype will experience high latency. This is not acceptable, because the priority of the weekly backup is a lot lower than the two others. Linux give you the flexibility to shape the network traffic and use all of your bandwidth efficiently, without penalizing real-time applications. But this comes with a price, and the learning curve to implement an efficient traffic control policy is quite steep. This document provides an accurate and comprehensive introduction to the most used QoS algorithms, and gives you the tools to implement and understand your own policy.

- The basics of Traffic Control

In the Internet world, everything is packets. Managing an network means managing packets: how they are generated, router, transmitted, reorder, fragmented, etc… Traffic Control works on packets leaving the system. It doesn't, initially, have as an objective to manipulate packets entering the system (although you could do that, if you really want to slow down the rate at which you receive packets). The Traffic Control code operates between the IP layer and the hardware driver that transmits data on the network. We are discussing a portion of code that works on the lower layers of the network stack of the kernel. In fact, the Traffic Control code is the very one in charge of constantly furnishing packets to send to the device driver.

It means that the TC module, the packet scheduler, is permanently activate in the kernel. Even when you do not explicitly want to use it, it's there scheduling packets for transmission. By default, this scheduler maintains a basic queue (similar to a FIFO type queue) in which the first packet arrived is the first to be transmitted.

At the core, TC is composed of queuing disciplines, or qdisc, that represent the scheduling policies applied to a queue. Several types of qdisc exist. If you are familiar with the way CPU schedulers work, you will find that some of the TC qdisc are similar. We have FIFO, FIFO with multiple queues, FIFO with hash and round robin (SFQ). We also have a Token Bucket Filter (TBF) that assigns tokens to a qdisc to limit it flow rate (no token = no transmission = wait for a token). This last algorithm was then extended to a hierarchical mode called HTB (Hierarchical Token Buket). And also Quick Fair Queuing (QFQ), Hierarchical Fair Service Curve (HFSC), Random Early Detection (RED), etc….

For a complete list of algorithm, check out the source code at kernel.org.

- First contact

Let's skip the theory for now and start with an easy example. We have a web server on which we would like to limit the flow rate of packets leaving the server. We want to fix that limit at 200 kilo-bits per seconds (25 KB/s).

This set up is fairly simple, and we need three things:

a Netfilter rule to mark the packets that we want to limit
a Traffic Control policy
a Filter to bind the packets to the policy

- Netfilter MARK

Netfilter can be used to interact directly with the structure representing a packet in the kernel. This structure, the sk_buff, contains a field called “__u32 nfmark” that we are going to modify. TC will then read that value to select the destination class of a packet.

The following iptables rule will apply the mark '80' to outgoing packets (OUTPUT chain) sent by the web server (TCP source port is 80).

# iptables -t mangle -A OUTPUT -o eth0 -p tcp --sport 80 -j MARK --set-mark 80

We can control the application of this rule via the netfilter statistics:

# iptables -L OUTPUT -t mangle -v
Chain OUTPUT (policy ACCEPT 74107 packets, 109M bytes)
 pkts bytes target prot opt in  out  source   destination
73896  109M MARK   tcp  --  any eth0 anywhere anywhere    tcp spt:www MARK xset 0x50/0xffffffff

<note tip>You probably noticed that the rule is located in the mangle table. We will go back to that a little bit later.</note>

- Two classes in a tree

To manipulate TC policies, we need the /sbin/tc binary from the **iproute** package (aptitude install iproute).

<note important>The iproute package must match your kernel version. Your distribution's package manager will normally take care of that.</note>

We are going to create a tree that represents our scheduling policy, and that uses the HTB scheduler. This tree will contain two classes: one for the marked traffic (TCP sport 80), and one for everything else.

# tc qdisc add dev eth0 root handle 1: htb default 20
# tc class add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 200kbit prio 1 mtu 1500
# tc class add dev eth0 parent 1:0 classid 1:20 htb rate 824kbit ceil 1024kbit prio 2 mtu 1500

The two classes are attached to the root. Each class has a guaranteed bandwidth (rate value) and an opportunistic bandwidth (ceil value). If the totality of the bandwidth is not used, a class will be allowed to increased its flow rate up to the ceil value. Otherwise, the rate value is applied. It means that the sum of the rate values must correspond to the total bandwidth available.

In the previous example, we consider the total upload bandwidth to be 1024kbits/s, so class 10 (web server) gets 200kbits/s and class 20 (everything else) gets 824 kbits/s.

<note tip>TC can use both kbit and kbps notations, but they don't have the same meaning. kbit is the rate in kilo-bits per seconds, and kbps is in kilo-bytes per seconds. In this article, I will the kbit notation only.</note>

- Connecting the marks to the tree

We now have on one side a traffic shaping policy, and on the other side packets marking. To connect the two, we need a filter.

A filter is a rule that identify packets (handle parameter) and direct them to a class (fw flowid parameter). Since several filters can work in parallel, they can also have a priority. A filter must be attached to the root of the QoS policy, otherwise, it won't be applied.

# tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 80 fw flowid 1:10

We can test the policy using a simply client/server setup. Netcat is very useful for such testing. Start a listening process on the server that applies the policy using:

# nc -l -p 80 < /dev/zero

And connect to it from another machine using :

# nc 192.168.1.1 80 > /dev/null

The server process will send zeros (taken from /dev/zero) as fast as it can, and the client will receive them and throw them away, as fast as it can.

Using iptraf to monitor the connection, we can supervise the bandwidth usage (bottom right corner).

The value is 199.20kbits/s, which is close enough to the 200kbits/s target. The precision of the scheduler depends on a few parameters that we will discuss later on.

Any other connection from the server that uses a source port different from TCP/80 will have a flow rate between 824kbits/s and 1024kbits/s (depending on the presence of other connections in parallel).

- Twenty Thousand Leagues Under the Code

Now that we enjoyed this first contact, it is time to go back to the fundamentals of the Quality of Service of Linux. The goal of this chapter is to dive into the algorithms that compose the traffic control subsystem. Later on, we will use that knowledge to build our own policy.

The code of TC is located in the net/sched directory of the sources of the kernel. The kernel separates the flows entering the system (ingress) from the flows leaving it (egress). And, as we said earlier, it is the responsibility of the TC module to manage the egress path.

The illustration below show the path of a packet inside the kernel, where it enters (ingress) and where it leaves (egress). If we focus on the egress path, a packet arrives from the layer 4 (TCP, UDP, …) and then enter the IP layer (not represented here). The Netfilter chains OUTPUT and POSTROUTING are integrated in the IP layer and are located between the IP manipulation functions (header creation, fragmentation, …). At the exit of the NAT table of the POSTROUTING chain, the packet is transmitted to the egress queue, and this is where TC starts its work.

Almost all the devises use a queue to schedule the egress traffic. The kernel possesses algorithms that can manipulate these queues, they are the queuing disciplines (FIFO, SFQ, HTB, …). The job of TC is to apply these queuing disciplines to the egress queue in order to select a packet for transmission.

TC works with the sk_buff] structure that represents a packet in the kernel. It doesn't manipulate the packet data directly. sk_buff is a shared structure accessible anywhere in the kernel, thus avoiding unnecessary duplication of data. This method is a lot more flexible and a lot faster because sk_buff contains all of the necessary information to manipulate the packet, the kernel therefore avoids copies of headers and payloads from different memory areas that would ruin the performances.

On a regular basis, the packet scheduler will wake up and launch any pre-configured scheduling algorithms to select a packet to transmit.

Most of the work in launch by the function dev_queue_xmit from net/core/dev.c, that only take a sk_buff structure as input (this is enough, sk_buff contains everything needed, such as skb→dev, a pointer to the output NIC).

dev_queue_xmit makes sure the packet is ready to be transmitted on the network, that the fragmentation is compatible with the capacity of the NIX, that the checksums are calculated (if this is handled by the NIC, this step is skipped). Once those controls done, and if the equipment has a queue in skb→dev→qdisc, then the sk_buff structure of the packet is added to this queue (via the enqueue function) and the qdisc_run function is called to select a packet to send.

This means that the packet that has just been added to the NIC's queue might not be the one immediately transmitted, but we know that it is ready for subsequent transmission the moment it is added to the queue.

To each device is attached a root queuing discipline. This is what we defined earlier when creating the root qdisc to limit the flow rate of the web server:

# tc qdisc add dev eth0 root handle 1: htb default 20

This command means “attach a root qdisc identified by id 1 to the device eth0, use htb as a scheduler and send everything to class 20 by default”.

We will find this qdisc at the pointer skb→dev→qdisc. The enqueue and dequeue functions are also located there, respectively in skb→dev→qdisc→enqueue() and skb→dev→qdisc→dequeue(). This last dequeue function will be in charge of forwarding the sk_buff structure of the packet selected for transmission to the NIC's driver.

The root qdisc can have leaves, known as classes, that can also possess their own qdisc, thus constructing a tree.

- Classless Disciplines

This is a pretty long way down. For those who wishes to deepen the subject, I recommend reading « Understanding Linux Network Internals », from Christian Benvenuti at O'reilly.

We now have a dequeue function which role is to select the packet to send to the network interface. To do so, this function calls scheduling algorithms that we are now going to study. There are two types of algorithms: Classless and Classful. The classful algorithms are composed of qdisc that can contain classes, like we did in the previous example with HTB. In opposition, classless algorithms cannot contain classes, and are (supposedly) more simple.

- PFIFO_FAST

Let's start with a small one. pfifo_fast is the default scheduling algorithm used when no other is explicitly defined. In other words, it's the one used on 99.9% of the Linux systems. It is classless and a tiny bit more evolved than a basic FIFO queue, since it implements 3 bands working in parallel. These bands are numbered 0, 1 and 2 and emptied sequentially: while 0 is not empty, 1 will not be processed, and 2 will be the last. So we have 3 priorities: the packets in queue 0 will be sent before the packets in queue 2.

The kernel uses the Type of Service field (the 8 bits fields from bit 8 to bit 15 of the IP header, see below) to select the destination band of a packet.

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Example Internet Datagram Header from RFC 791

This algorithm is defined in **net/sched/sch_generic.c** and represented in the diagram below.

dia source

The length of a band, representing the number of packet it can contain, is set to 100 by default and defined outside of TC. It's a parameter that can be set using ifconfig, and visualized in /sys:

# cat /sys/class/net/eth0/tx_queue_len
1000

Once the default value of 1000 is passed, TC will start dropping packets. This should very rarely happen because TCP makes sure to adapt its sending speed to the capacity of both systems participating in the communication (that's the role of the TCP slow start). But experiments showed that increased that limit to 10,000, or even 100,000, in some very specific cases of gigabits networks can improve the performances. I wouldn't recommend touching this value unless you really now what you are doing. Increasing a buffer size to a too large value can have very negative side effect on the quality of a connection. Jim Gettys called that bufferbloat, and we will talk about it in the last chapter.

Because this algorithm is classless, it is not possible to plug another scheduler after it.

- SFQ - Stochastic Fairness Queuing

Stochastic Fairness Queuing is an algorithm that shares the bandwidth without giving any privilege of any sort. The sharing is fair by being stochastic, or random if you prefer. The idea is to take a fingerprint, or hash of the packet based on its header, and to use this hash to send the packet to one of the 1024 buckets available. The buckets are send emptied in a round-robin fashion.

The main advantage of this method is that no packet will have the priority over another one. No connexion can take over the other, and everybody has to share. The repartition of the bandwidth across the packets will almost always be fair, but there are some minor limitation. The main limitation is that the hashing algorithm might produce the same result for several packets, and thus send them to the same bucket. One bucket will then fill up faster than the other, breaking the fairness of the algorithm. To mitigate this, SFQ will modify the parameters of its hashing algorithm on a regular basis, by default every 10 seconds.

The diagram below shows how the packets are processed by SFQ, from entering the scheduler at the top, to being dequeued and transmitted at the bottom. The source code is available in net/sched/sch_sfq.c. Some variables are hard coded in the source code: * SFQ_DEFAULT_HASH_DIVISOR gives the number of buckets and default to 1024 * SFQ_DEPTH defines the depth of each bucket, and defaults to 128 packets

#define SFQ_DEPTH		128 /* max number of packets per flow */
#define SFQ_DEFAULT_HASH_DIVISOR 1024

These two value determine the maximum number of packets that can be queued at the same time. 1024 * 128 = 131072 packets can be waiting in the SFQ scheduler before it starts dropping packets. But this theoritical number is very unlikely to ever happen, because it would require of the algorithm to fill up every single bucket evenly before starting to dequeue or drop packets.

The tc command line lists the options that can be fed to SFQ:

# tc qdisc add sfq help
Usage: ... sfq [ limit NUMBER ] [ perturb SECS ] [ quantum BYTES ]

limit: is the size of the buckets. It can be reduced below SFQ_DEPTH but not increased past it. If you try to put a value above 128, TC will simply ignore it.
perturb: is the frequency, in seconds, at which the hashing algorithm is updated.
quantum: represents the maximum amount of bytes that the round-robin dequeue process will be allow to dequeue from a bucket. At a bare minimum, this should be equal to the MTU of the connexion (1500 bytes on ethernet). Imagine that we set this value to 500 bytes, all packets bigger than 500 bytes would not be dequeued by SFQ, and would stay in their bucket. We would, very quickly, arrive at a point where all buckets are blocked and no packets are transmitted anymore.

DIA source

- SFQ Hashing algorithm

The hash used by SFQ is computed by the sfq_hash function (see the source code), and take into account the source and destination IP addresses, the layer 4 protocol number (still an IP header), and the TCP ports (if the IP packet is not fragmented). Those parameters are mixed with a random number regenerated every 10 seconds (the perturb value defines that).

Let's walk through a simplified version of the algorithm steps. The connexion has the following parameters:

source IP: 126.255.154.140
source port: 8080
destination IP: 175.112.129.215
destination port: 2146

/*IP source address in hexadecimal*/
h1 = 7efffef0 

/*IP Destination address in hexadecimal*/
h2 = af7081d7 

/* 06 is the protocol number for TCP (bits 72 to 80 of the IP header)
 We perform a XOR between the variable h2 obtained in the previous step
and the TCP protocol number
*/
h2 = h2 XOR 06 

/* if the IP packet is not fragmented, we include the TCP ports in the hash*/
/* 1f900862  is the hexadecimal representation of the source destination ports
We perform another XOR with this value and the h2 variable
*/
h2 = h2 XOR 1f900862 

/* And finally, we use the Jenkins algorithm with some additional "golden numbers" 
This jhash function is defined somewhere else in the kernel source code*/
h = jhash(h1, h2, perturbation)

The result obtained is a hash value of 32 bits that will be used by SFQ to select the destination bucket of the packet. Because the perturb value is regenerated every 10 seconds, the packets from a reasonably long connexion will be directed to different buckets over time.

But this also means that SFQ might break the sequencing of the sending process. Because if two packets from the same connexion are placed in two differents buckets, it is possible that the second bucket will be processed before the first one, and therefore sending the second packet before the first one. This is not a problem for TCP, which uses sequence number to reorder the packets when they reach their destination, but for UDP it might be.

For example, imagine that you have a syslog daemon sending logs to a central syslog server. With SFQ, it might happen that a log process arrives before the log that precedes it. If you don't like it, use TCP.

TC gives us some more flexibility on the hashing algorithm. We can, for example, modify the fields considered by the hashing process. This can be used using TC filters, as follow:

# tc qdisc add dev eth0 root handle 1: sfq perturb 10 quantum 3000 limit 64
# tc filter add dev eth0 parent 1:0 protocol ip handle 1 flow hash keys src,dst divisor 1024

The filter above simply the hash to keep only the source and destination IP addresses as input parameters. The divisor value is the number of buckets, as seen before. We could, then, create a SFQ scheduler that works with 10 buckets only and considers the IP addresses of the packets in the hash.

This discipline is classless as well, which means we cannot direct packet to another scheduler when they leave SFQ. Packets are transmitted to the network interface only.

- Classful Disciplines

- TBF - Token Bucket Filter

Until now, we looked at algorithm that do not allow to control the amount of bandwidth. SFQ and PFIFO_FAST give the ability to smoothen the traffic, and even to prioritize it a bit, but not to control its throughput.

In fact, the main problem when controlling the bandwidth is to find an efficient accounting method. Because counting in memory is extremely difficulty and costly to do in real-time, computer scientists took a different approach here.

Instead of counting the packets (or the bits transmitted by the packets, it's the same thing), the Token Bucket Filter algorithm sends, at a regular interval, a token into a bucket. Now this is disconnected from the actual packet transmission, but when a packet enters the scheduler, it will consume a certain number of tokens. If there is not enough tokens for it to be transmitted, the packet waits.

Until now, with SFQ and PFIFO_FAST, we were talking about packets, but with TBF we now have to look into the bits contained in the packets. Let's take an example: a packet carrying 8000 bits (1KB) wishes to be transmitted. It enters the TBF scheduler and TBF control the content of its bucket: if there are 8000 tokens in the bucket, TBF destroys them and the packet can pass. Otherwise, the packet waits until the bucket has enough tokens.

The frequency at which tokens are added into the bucket determine the transmission speed, or rate. This is the parameter at the core of the TBF algorithm, shown in the diagram below.

DIA source

Another particularity of TBF is to allow bursts. This is a natural side effect of the algorithm: the bucket fills up at a continuous rate, but if no packets are being transmitted for some time, the bucket will get completely full. Then, the next packets to enter TBF will be transmitted right away, without having to wait and without having any limit applied to them, until the bucket is empty. This is called a burst, and in TBF the burst parameter defines the size of the bucket.

So with a very large burst value, say 1,000,000 tokens, we would let a maximum of 83 fully loaded packets (roughly 124KBytes if they all carry their maximum MTU) traverse the scheduler without applying any sort of limit to them.

To overcome this problem, and provides better control over the bursts, TBF implements a second bucket, smaller and generally the same size as the MTU. This second bucket cannot store large amount of tokens, but its replenishing rate will be a lot faster that the one of the big bucket. This second rate is called peakrate and it will determine the maximum speed of a burst.

Let's take a step back and look at those parameters again. We have:

peakrate > rate : the second bucket fills up faster than the main one, to allow and control bursts. If the peakrate value is infinite, then TBF behaves as if the second bucket didn't exist. Packets would be dequeued according to the main bucket, at the speed of rate.
burst > MTU : the size of the first bucket is a lot larger than the size of the second bucket. If the burst is equal to MTU, then peakrate is equal to rate and there is no burst.

So, to summarize, when everything works smoothly packets are enqueued and dequeued at the speed of rate. If tokens are available when packets enter TBF, those packets are transmitted at the speed of peakrate until the first bucket is empty. This is represented in the diagram above, and in the source code at net/sched/sch_tbf.c, the interesting function being tbf_dequeue.c.

The configurable options of the TBF scheduler are listed in TC:

# tc qdisc add tbf help
Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
        [ peakrate KBPS ] [ latency TIME ] [ overhead BYTES ] [ linklayer TYPE ]

We recognize burst, rate, mtu and peakrate that we discussed above. The limit parameter is the size of the packet queue (see diagram above). latency represents another way of calculating the limit parameter, by setting the maximum time a packet can wait in the queue (the size of the queue is then derivated from it, the calculation includes all of the values of burst, rate, peakrate and mtu). overhead and linklayer are two other parameters whose story is quite interesting. Let's take a look at those now.

- DSL and ATM, the Ox that believed to be Frog

If you have ever read Jean De La Fontaine, you probably know the story of the “The frog that wanted to be as big as an ox”. Well, in our case, it's the opposite, and your packets might not be as small as they think they are.

If most local networks use Ethernet, up until recently a lot of communications inside (at least in Europe) were done over the ATM protocol. Nowadays, ISP are moving towarding all-over-ip, but ATM is still around. The particularity of ATM is to split large ethernet packets into much smaller ones, called “cells”. A 1500 bytes ethernet packet would be split into ~30 smaller ATM cells of just 53 bytes each. And from those 53 bytes, only 48 are from the original packet, the rest is occupied by the ATM headers.

So where is the problem ? Considering the following network topology.

The QoS box is in charge of performing the packet scheduling before transmitting it to the modem. The packets are then split by the modem into ATM cells. So our initial 1.5KB ethernet packets is split into 32 ATM cells, for a total size of 32 * 5 bytes of headers per cell + 1500 bytes of data = (32*5)+1500 = 1660 bytes. 1660 bytes is 10.6% bigger than 1500. When ATM is used, we lose 10% of bandwidth compared to an ethernet network (this is an estimate that depend on the average packet size, etc…).

If TBF doesn't know about that, and calculates its rate based on the sole knowledge of the ethernet MTU, then it will transmit 10% more packets than the modem can transmit. The modem will start queuing, and eventually dropping, packets. The TCP stacks will have to adjust their speed, traffic gets erratic and we lose the benefit of TC as a traffic shaper.

Jesper Dangaard Brouer did his Master Thesis on this topic, and wrote a few patchs for the kernel and TC. These patchs implement the overhead and linklayer parameter, and can be used to inform the TBF scheduler of the type of link to account for.

* overhead represents represent the ADSL encapsulation overhead. Figuring out what the ADSL encapsulation overhead for the specific line is difficult. One should assume a minimum of 10 bytes overhead. adsl encapsulation

* linklayer defines the type of link, either ethernet or {atm,adsl}. atm and adsl are the same thing and represent a 5 bytes header overhead

We can use these parameter to fine tune the creation of a TBF scheduling discipline:

# tc qdisc add dev eth0 root tbf rate 1mbit burst 10k latency 30ms linklayer atm

# tc -s qdisc show dev eth0
qdisc tbf 8005: root rate 1000Kbit burst 10Kb lat 30.0ms
 Sent 738 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

And by setting the linklayer to “atm”, we take into account an overhead of 5 bytes per cell later in the transmission, therefore preventing the modem from buffering the packets.

- The limits of TBF

TBF gives a pretty accurate control over the bandwidth assigned to a qdisc. But it also imposes that all packets pass through a single queue. If a big packet is blocked because there is not enough tokens to send it, smaller packets - that could potentially be sent instead - are blocked behind it as well. This is the case represented in the diagram above, where packet #2 is stuck behind packet #1. We could optimize the bandwidth usage by allowing the smaller packet to be sent instead of the bigger one. We would, however, fall into the same problem of reordering packets that we discussed with the SFQ algorithm.

The other solution would be to give more flexibility to the traffic shaper, declare several TBF queues in parallel, and route the packets to one or the other using filters. We could also allow those parallel queues to borrow tokens from each other, in case one is idle and the other one is not.

We just prepared the ground for classful qdisc, and the Hierarchical Token Bucket.

- HTB - Hierarchical Token Bucket

The Hierarchical Token Bucket (HTB) is an improved version of TBF that introduces the notion of classes. Each class is, in fact, a TBF-like qdisc, and classes are linked together as a tree, with a root and leaves. HTB introduces a number of features to improved the management of bandwidth, such as a the priority between classes, a way to borrow bandwidth from another class, or the possibility to plug another qdisc as an exit point (a SFQ for example).

Let's take a simple example, represented in the diagram below.

htb_en.dia.zip

The tree is created with the commands below:

# tc qdisc add dev eth0 root handle 1: htb default 20
# tc class add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 400kbit prio 1 mtu 1500
# tc class add dev eth0 parent 1:0 classid 1:20 htb rate 200kbit ceil 200kbit prio 2 mtu 1500

HTB uses a similar terminology than TBF and SFQ:

burst is identical to the burst of TBF: it's the size of the token bucket
rate is the speed at which token at generated and put in the bucket, the speed of the leaf, like in TBF
quantum is similar to the quantum discussed in SFQ, it's the amount of bytes to serve from the leaf at once.

The new parameters are ceil and cburst. Let us walk through the tree to see how they work.

In the example above, we have a root “qdisc handle 1” and two leaves “qdisc handle 10” and “qdisc handle 20”. The root will apply filters to decide where a packet should be directed, we will discuss those later, and by default packets are sent to leaf #20 (default to 20).

The leaf #10 has a rate value of 200kbits/s, a ceil value of of 400kbibts/s (which means it can borrow 200kbits/s more that its rate) and a priority (prio) of 1.

The leaf #20 has a rate value of 200kbits/s, a ceil value of 200kbbits/s (which means it cannot borrow anything, rate == ceil) and a priority of 2.

Each HTB leaf will, at any point, have one of the three following status:

HTB_CANT_SEND : the class can neither send nor borrow, no packets are allowed to leave the class
HTB_MAY_BORROW : the class cannot send using its own tokens, but can try to borrow from another class
HTB_CAN_SEND : the class can send using its own tokens

Imagine a group of packets that enter TC and are marked with the flag #10, and therefore are directed to leaf #10. The bucket for leaf #10 does not contain enough tokens to let the first packets pass, so it will try to borrow some from its neighbor leaf #20. The quantum value of leaf #10 is set to the MTU (1500 bytes), which means the maximal amount of data that leaf #10 will try to send is 1500 bytes. If packet#1 is 1400 bytes large, and the bucket in leaf #10 has enough tokens for 1000 bytes, then the leaf will try to borrow the remaining 400 bytes from its neightbor leaf #20.

The quantum is the maximal amount of bytes that a leaf will try to send at once. The closer the value is from the MTU, the more accurate the scheduling will be, because we reschedule after every 1500 bytes. And the larger the value of quantum will be, the more a leaf will be privileged: it will be allowed to borrow more tokens from its neighbor. But of course, since the total amount of tokens in the tree is not unlimited, if a token is borrowed from a leaf, another leaf cannot use it anymore. Therefore, the bigger the value of quantum is, the more a leaf is able to steal from its neighbor. This is tricky because those neighbors might very well have packets to send as well.

When configuring TC, we do not manipulate the value of quantum directly. There is an intermediary parameter called r2q that calculates the quantum automatically based on the rate. quantum = rate / r2q. By default, r2q is set to 10, so for a rate of 200kbits, quantum will have a value of 20kbits.

For very small or very large bandwidth, it is important to tune r2q properly. If r2q is too large, too many packets will leave a queue at once. If r2q is too small, no enough packets are sent.

<note>One important detail is that r2q is set on the root qdisc once and for all. It cannot be configured for each leaf separately.</note>

TC offer the following configuration options for HTB:

Usage: ... qdisc add ... htb [default N] [r2q N]
 default  minor id of class to which unclassified packets are sent {0}
 r2q      DRR quantums are computed as rate in Bps/r2q {10}
 debug    string of 16 numbers each 0-3 {0}

... class add ... htb rate R1 [burst B1] [mpu B] [overhead O]
                      [prio P] [slot S] [pslot PS]
                      [ceil R2] [cburst B2] [mtu MTU] [quantum Q]
 rate     rate allocated to this class (class can still borrow)
 burst    max bytes burst which can be accumulated during idle period {computed}
 mpu      minimum packet size used in rate computations
 overhead per-packet size overhead used in rate computations
 linklay  adapting to a linklayer e.g. atm
 ceil     definite upper class rate (no borrows) {rate}
 cburst   burst but for ceil {computed}
 mtu      max packet size we create rate map for {1600}
 prio     priority of leaf; lower are served first {0}
 quantum  how much bytes to serve from leaf at once {use r2q}

As you can see, we are now familiar to all of those parameters. If you just jumped to this section without reading about SFQ and TBF, please read those chapters for a detailed explanation of what those parameters do.

<note important>Remember that, when configuring leaves in HTB, the sum of the rate of the leaves cannot be higher than the rate of the root. It makes sense, right ?</note>

- Hysteresis and HTB

Hysteresis. If this barbarian word is not familiar to you, as it wasn't to me, here is how wikipedia defines it: “Hysteresis is the dependence of a system not just on its current environment but also on its past.”

Hysteresis is a side effect introduced by an optimization of HTB. In order to reduce the load on the CPU, HTB initially did not recalculate the content of the bucket often enough, therefore allowing some classes to consume more tokens that they actually held, without borrowing.

The problem was corrected and a parameter introduced to allow or block the usage of estimate in HTB calculation. The kernel developers kept the optimization feature simply because it can prove useful in high traffic networks, where recalculating the content of the bucket each time is simply not doable.

But in most cases, this optimization is simply deactivated, as shown below:

# cat /sys/module/sch_htb/parameters/htb_hysteresis
0

- CoDel

http://queue.acm.org/detail.cfm?id=2209336

- FQ_CoDel

- HFSC - Hierarchical Fair Service Curve

http://nbd.name/gitweb.cgi?p=openwrt.git;a=tree;f=package/qos-scripts/files;h=71d89f8ad63b0dda0585172ef01f77c81970c8cc;hb=HEAD

- QFQ - Quick Fair Queueing

http://www.youtube.com/watch?v=r8vBmybeKlE

- RED - Random Early Detection

- CHOKe - CHOose and {Keep,Kill}

- Shaping the traffic on the Home Network

Home networks are tricky to shape, because everybody wants the priority and it's difficult to predetermine a usage pattern. In this chapter, we will build a TC policy that answer general needs. Those are:

Low latency. The uplink is only 1.5Mbps and the latency shouldn't be more than 30ms under high load. We can tune the buffers in the qdisc to ensure that our packets will not stay in a large queue for 500ms waiting to be processed
High UDP responsiveness, for applications like Skype and DNS queries
Guarantied HTTP/s bandwidth, half of the uplink is dedicated to the HTTP traffic (although, other classes can borrow from it) to ensure that web browsing, probably 80% of a home network usage, is smooth and responsive
TCP ACKs and SSH traffic get higher priority. In the age of Netflix and HD VoD, it's necessary to ensure fast download speed. And for that, you need to be able to send TCP ACKs as fast as possible. This is why those packets get a higher priority than the HTTP traffic.
A general class for everything else.

This policy is represented in the diagram below. We will use PFIFO_FAST and SFQ termination qdisc once we exit HTB to perform some additional scheduling (and prevent a single HTTP connection from eating all of the bandwidth, for example).

DIA source

The script that generates this policy is available on github via the icon below, with comments to help you follow through.

get the bash script from github

Below is one of the section, in charge of the creation of the class for SSH. I have replaced the variables with their value for readability.

# SSH class: for outgoing connections to 
# avoid lag when somebody else is downloading
# however, an SSH connection cannot fill up 
# the connection to more than 70%

echo "#---ssh - id 300 - rate 160 kbit ceil 1120 kbit"
/sbin/tc class add dev eth0 parent 1:1 classid 1:300 htb \
    rate 160kbit ceil 1120kbit burst 15k prio 3

# SFQ will mix the packets if there are several 
# SSH connections in parallel
# and ensure that none has the priority

echo "#--- ~ sub ssh: sfq"
/sbin/tc qdisc add dev eth0 parent 1:300 handle 1300: \
    sfq perturb 10 limit 32

echo "#--- ~ ssh filter"
/sbin/tc filter add dev eth0 parent 1:0 protocol ip \
    prio 3 handle 300 fw flowid 1:300

echo "#--- ~ netfilter rule - SSH at 300"
/sbin/iptables -t mangle -A POSTROUTING -o eth0 -p tcp 
    --tcp-flags SYN SYN -dport 22 -j CONNMARK \
    --set-mark 300

The first rule is the definition of the HTB class, the leaf. I connects back to its parent 1:1, defines a rate of 160kbit/s and can use up to 1120kbit/s by borrowing the difference from other leaves. The burst value is set to 15k, with is 10 full packets with a MTU of 1500 bytes.

The second rule defines a SFQ qdisc connected to the HTB one above. That means that once packets have passed the HTB leaf, they will pass through a SFQ leaf before being transmitted. The SFQ will ensure that multiple parallel connections are mixed before being transmitted, and that one connection cannot eat the whole bandwidth. We limit the size of the SFQ queue to 32 packets, instead of the default of 128.

The come the TC filter in the third rule. This filter will check the handle of each packet, or, to be more accurate, the value of nf_mark in the sk_buff representation of the packet in the kernel. Using this mark, the filter will direct SSH packet to the HTB leaf above. <note> Even though this rule is located in the SSH class block for clarity, you might have noticed that the filter has the root qdisc for parent (parent 1:0). Filters are always attached to the root qdisc, and not to the leaves. That makes sense, because the filtering must be done at the entrance of the traffic control layer.</note>

And finally, the fourth rule is the iptables rule that applies a mark to SYN packets leaving the gateway (connection establishments). Why SYN packets only ? To avoid performing complex matching on all the packets of all the connection. We will rely on netfilter's capability to maintain stateful information to propagate a mark placed on the first packet of the connection to all of the other packets. This is done by the following rule at the end of the script:

echo "#--- ~ propagating marks on connections"
iptables -t mangle -A POSTROUTING -j CONNMARK --restore-mark

Let us now load the script on our gateway, and visualise the qdisc created.

# /etc/network/if-up.d/lnw_gateway_tc.sh start
~~~~ LOADING eth0 TRAFFIC CONTROL RULES FOR ramiel ~~~~

#-cleanup
RTNETLINK answers: No such file or directory
#-define a HTB root qdisc
#--uplink - rate 1600 kbit ceil 1600 kbit
#---interactive - id 100 - rate 160 kbit ceil 1600 kbit
#--- ~ sub interactive: pfifo
#--- ~ interactive filter
#--- ~ netfilter rule - all UDP traffic at 100
#---tcp acks - id 200 - rate 320 kbit ceil 1600 kbit
#--- ~ sub tcp acks: pfifo
#--- ~ filtre tcp acks
#--- ~ netfilter rule for TCP ACKs will be loaded at the end
#---ssh - id 300 - rate 160 kbit ceil 1120 kbit
#--- ~ sub ssh: sfq
#--- ~ ssh filter
#--- ~ netfilter rule - SSH at 300
#---http branch - id 400 - rate 800 kbit ceil 1600 kbit
#--- ~ sub http branch: sfq
#--- ~ http branch filter
#--- ~ netfilter rule - http/s
#---default - id 999 - rate 160kbit ceil 1600kbit
#--- ~ sub default: sfq
#--- ~ filtre default
#--- ~ propagating marks on connections
#--- ~ Mark TCP ACKs flags at 200

Traffic Control is up and running

# /etc/network/if-up.d/lnw_gateway_tc.sh show

 ---- qdiscs details -----
qdisc htb 1: root refcnt 2 r2q 40 default 999 direct_packets_stat 0 ver 3.17
qdisc pfifo 1100: parent 1:100 limit 10p
qdisc pfifo 1200: parent 1:200 limit 10p
qdisc sfq 1300: parent 1:300 limit 32p quantum 1514b flows 32/1024 perturb 10sec 
qdisc sfq 1400: parent 1:400 limit 32p quantum 1514b flows 32/1024 perturb 10sec 
qdisc sfq 1999: parent 1:999 limit 32p quantum 1514b flows 32/1024 perturb 10sec 

 ---- qdiscs statistics --
qdisc htb 1: root refcnt 2 r2q 40 default 999 direct_packets_stat 0
 Sent 16776950 bytes 125321 pkt (dropped 4813, overlimits 28190 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
qdisc pfifo 1100: parent 1:100 limit 10p
 Sent 180664 bytes 1985 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
qdisc pfifo 1200: parent 1:200 limit 10p
 Sent 5607402 bytes 100899 pkt (dropped 4813, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
qdisc sfq 1300: parent 1:300 limit 32p quantum 1514b perturb 10sec 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
qdisc sfq 1400: parent 1:400 limit 32p quantum 1514b perturb 10sec 
 Sent 9790497 bytes 15682 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
qdisc sfq 1999: parent 1:999 limit 32p quantum 1514b perturb 10sec 
 Sent 1198387 bytes 6755 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0

The output below is just two types of output tc can generate. You might find the class statistics to be helpful to diagnose leaves consumption:

# tc -s class show dev eth0

[...truncated...]

class htb 1:400 parent 1:1 leaf 1400: prio 4 rate 800000bit ceil 1600Kbit burst 30Kb cburst 1600b 
 Sent 10290035 bytes 16426 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 23624bit 5pps backlog 0b 0p requeues 0 
 lended: 16424 borrowed: 2 giants: 0
 tokens: 4791250 ctokens: 120625

Above is shown the detailled statistics for the HTTP leaf, and you can see the accumulated rate, statistics of packets per seconds, but also the tokens accumulated, lended, borrowed, etc… this is the most helpful output to diagnose your policy in depth.

- A word about "BufferBloat"

We mentionned that too large buffers can have a negative impact on the performances of a connection. But how bad is it exactly ?

The answer to that question was investigated by Jim Gettys when he found his home network to be inexplicably slow.

He found that, while we were increasing the bandwidth of network connections, we didn't worry about the latency at all. Those two factors are quite different and both critical to the good quality of a network. Allow me to quote Gettys's FAQ here:

A 100 Gigabit network is always faster than a 
1 megabit network, isn’t it? 
More bandwidth is always better! I want a faster network!

No, such a network can easily be much slower.   
Bandwidth is a measure of capacity, not a measure of how 
fast the network can respond. You pick up the phone to 
send a message to Shanghai immediately, but dispatching a 
cargo ship full of blu-ray disks will be amazingly slower 
than the telephone call, even though the bandwidth of the 
ship is billions and billions of times larger than the 
telephone line.   So more bandwidth is better only if it’s 
latency (speed) meets your needs. More of what you don’t 
need is useless. 
Bufferbloat destroys the speed we really need.

More information on Gettys's page, and in this paper from 1996: It's the Latency, Stupid.

Long story short: if you have bad latency, but large bandwidth, you will be able to transfer very large files efficiently, but a simple DNS query will take a lot longer than it should. And since those DNS queries, and other small message such as VoIP, are very often time sensitive, bad latency impacts them a lot (a single packet takes several hundreds of milliseconds to traverse the network).

So how does that relate to buffers ? Gettys proposes a simple experiment to illustrate the problem.

We said earlier that Linux ships with a default txqueuelen of 1000. This value was increased in the kernel when gigabits ethernet cards became the standard. But not all links are gigabits, far from that. Consider the following two computers:

We will call them desktop and laptop. They are both gigabit, and the switch is gigabit.

If we verify the configuration of their network interfaces, we will confirm that:

the interfaces are configured in gigabits via ethtool
the txqueuelen is set to 1000

# ifconfig eth0|grep txqueuelen
          collisions:0 txqueuelen:1000 
          
# ethtool eth0|grep Speed
	Speed: 1000Mb/s

On one machine, launch nttcp with the '-i' switch to make it wait for connections:

# nttcp -i

On the laptop, launch “nttcp -t -D -n2048000 <serverIP>”, where

-t means this machine is “transmitting”, or sending data
-D disables the TCP_NODELAY see redhat doc
-n is the number of buffers of 4096 bytes given to the socket.

# nttcp -t -D -n2048000 192.168.1.220

And at the same time, on the laptop, launch a ping of the desktop.

64 bytes from 192.168.1.220: icmp_req=1 ttl=64 time=0.300 ms
64 bytes from 192.168.1.220: icmp_req=2 ttl=64 time=0.386 ms
64 bytes from 192.168.1.220: icmp_req=3 ttl=64 time=19.2 ms
64 bytes from 192.168.1.220: icmp_req=4 ttl=64 time=19.2 ms
64 bytes from 192.168.1.220: icmp_req=5 ttl=64 time=19.2 ms
64 bytes from 192.168.1.220: icmp_req=6 ttl=64 time=19.2 ms
64 bytes from 192.168.1.220: icmp_req=7 ttl=64 time=19.3 ms
64 bytes from 192.168.1.220: icmp_req=8 ttl=64 time=19.0 ms
64 bytes from 192.168.1.220: icmp_req=9 ttl=64 time=0.281 ms
64 bytes from 192.168.1.220: icmp_req=10 ttl=64 time=0.362 ms

The first two pings are launch before nttcp is launched. When nttcp starts, the latency augments but this is still acceptable.

Now, reduce the speed of each network card on the desktop and the laptop to 100Mbips. The command is:

#ethtool -s eth0 speed 100 duplex full


# ethtool eth0|grep Speed
	Speed: 100Mb/s

And run the same test again. After 60 seconds, here are the latency I get :

64 bytes from 192.168.1.220: icmp_req=75 ttl=64 time=183 ms
64 bytes from 192.168.1.220: icmp_req=76 ttl=64 time=179 ms
64 bytes from 192.168.1.220: icmp_req=77 ttl=64 time=181 ms

And one last time, with an Ethernet speed of 10Mbps:

64 bytes from 192.168.1.220: icmp_req=187 ttl=64 time=940 ms
64 bytes from 192.168.1.220: icmp_req=188 ttl=64 time=937 ms
64 bytes from 192.168.1.220: icmp_req=189 ttl=64 time=934 ms

Almost one second of latency between two machines next to each other. Every time we divide the speed of the interface by an order of magnitude, we augment the latency by an order of magnitude as well. Under that load, opening an SSH connection from the laptop to the desktop is taking more than 10 seconds, because we have a latency of almost 1 second per packet.

Now, while this last test is running, and while you are enjoying the ridiculous latency of your SSH session to the desktop, we will get rid of the transmit and ethernet buffers.

# ifconfig eth0 |grep txqueuelen
          collisions:0 txqueuelen:1000 

#  ethtool -g eth0
Ring parameters for eth0:
[...]
Current hardware settings:
[...]
TX:		511

We start by changing the txqueuelen value on the laptop machine from 1000 to zero. The latency will not change.

# ifconfig eth0 txqueuelen 0

64 bytes from 192.168.1.220: icmp_req=1460 ttl=64 time=970 ms
64 bytes from 192.168.1.220: icmp_req=1461 ttl=64 time=967 ms

Then we reduce the size of the tx ring of the ethernet card. Now that we don't have any buffer anymore, let's see what happens:

# ethtool -G eth0 tx 32

64 bytes from 192.168.1.220: icmp_req=1495 ttl=64 time=937 ms
64 bytes from 192.168.1.220: icmp_req=1499 ttl=64 time=0.865 ms
64 bytes from 192.168.1.220: icmp_req=1500 ttl=64 time=60.3 ms
64 bytes from 192.168.1.220: icmp_req=1501 ttl=64 time=53.1 ms
64 bytes from 192.168.1.220: icmp_req=1502 ttl=64 time=49.2 ms
64 bytes from 192.168.1.220: icmp_req=1503 ttl=64 time=45.7 ms

The latency just got divided by 20 ! We dropped from almost one second to barely 50ms. This is the effect of excessive buffering in a network, and this is what happens, today, in most Internet routers.

- What happens in the buffer ?

If we take a look at the Linux networking stack, we see that the TCP stack is a lot above the transmit queue and ethernet buffer. During a normal TCP connection, the TCP stack starts sending and receiving packets at a normal rate, and accelerates its sending speed at an exponential rate: send 2 packets, receive ACKs, send 4 packets, receive ACKs, send 8 packets, receives ACKs, send 16 packets, receives ACKS, etc….

This is known as the TCP Slow Start. This mechanisms works fine in practice, but the presence of large buffers will break it.

A buffer of 1MB on a 1Gbits/s link will empty in ~8 milli-seconds. But the same buffer on a 1MBits/s link will take 8 seconds to empty. During those 8 seconds, the TCP stack thinks that all of the packets it sent have been transmitted, and will probably continue to increase its sending speed. The subsequent packets will get dropped, the TCP stack will panick, drop its sending rate, and restart the slow start procedure from 0: 2 packets, get ack, 4 packets, get ack, etc…

But while the TCP stack was filling up the TX buffers, all the other packets that our system wanted to send got either stuck somewhere in the queue, with several hundreds of milliseconds of delay before being transmitted, or purely dropped.

The problem happens on the TX queue of the sending machine, but also on all the the buffers of the intermediary network devices. And this is why Gettys went to war against the home router vendors.

Linuxwall

Table of Contents

Journey to the Center of the Linux Kernel: Traffic Control, Shaping and QoS

- Introduction

- Motivation

- The basics of Traffic Control

- First contact

- Netfilter MARK

- Two classes in a tree

- Connecting the marks to the tree

- Twenty Thousand Leagues Under the Code

- Classless Disciplines

- PFIFO_FAST

- SFQ - Stochastic Fairness Queuing

- SFQ Hashing algorithm

- Classful Disciplines

- TBF - Token Bucket Filter

- DSL and ATM, the Ox that believed to be Frog

- The limits of TBF

- HTB - Hierarchical Token Bucket

- Hysteresis and HTB

- CoDel

- FQ_CoDel

- HFSC - Hierarchical Fair Service Curve

- QFQ - Quick Fair Queueing

- RED - Random Early Detection

- CHOKe - CHOose and {Keep,Kill}

- Shaping the traffic on the Home Network

- A word about "BufferBloat"

- What happens in the buffer ?