From Erik\\\\\\\\\\\\\\\'s IT-Security notes
Jump to: navigation, search

Troubleshooting TCP/IP performance problems in Windows 2008 and Vista

Microsoft rewrote the whole TCP/IP-stack in Vista/2008. Finally they added receive window auto-tuning and merged a lot of the previously released feature packs together. This new stack works well, but something misbehaves with older network equipment. Here's some notes on how to troubleshoot performance issues.

Prerequisite
You should record the traffic you need to troubleshoot with Wireshark or Network Monitor first. Also you must filter it to get the relevant data.

Then a little sanity check: - Is the hardware all right?
Try switching network cards and switch ports.

- Is the firmware current?
Make sure you have the latest firmware applied to all relevant hardware devices. If the switch works fine for all server except one, the problem is probably not the firmware on the switch.

- Are the drivers updated?
Install the latest device drivers on the server. It can help.

- Do you have NIC teaming on the server?
Avoid using any network card teaming capabilities, since it can cause compatibility issues and really strange network problem. Load balanced teaming is the most problematic teaming configuration of them all. Even if you know teaming should work, removing it as a part of the troubleshooting is highly recommended. Microsoft Network Load Balancing, Microsoft Cluster Services and switches with anti-spoofing capabilities should generally never be used together with NIC Teaming.

Checklist: are we getting lots of segmentation?

Segmentation occurs when your packet is larger than the maximum size the router can handle is surpassed. By setting the don't fragment (DF) flag and trying different payload sizes, you'll get an error the size is larger than what can be sent in one piece. This is the magical limit where fragmentation occurs. Ping can be used to test this:

ping <remote address> -l <size> -f

A size of 1460 works without a problem with the FreeBSD-router:
ping 192.168.0.1 -l 1460 -f
Reply from 192.168.0.1: bytes=1460 time=1ms TTL=64
Reply from 192.168.0.1: bytes=1460 time=1ms TTL=64

Increasing to 1490 doesn't work at all
Pinging 192.168.0.1 with 1490 bytes of data:
Request timed out.
Request timed out.

After a bit of testing by increasing the buffer size, we find out that the MTU is 1472 + headers in this case. Go back to the saved network traffic data you generated and check the MTU. If its buffers frequently passes the MTU, you should see lots of fragmentation happening. This will dramatically decrease performance while putting a heavier load on the CPU, that have to reassemble the packages. The most common MTU size is 1500 or 1492. Gigabit-networks with "jumbo frames" can go much higher. 9k is a common value for jumbo frames.

The TCP/IP stack should be able to fix this automatically and it's not a feature new to Windows Vista/2008. Path MTU discovery has been around a while in Windows. By default the discovery is on except for Windows-boxes running ISA Server 2004/2006. But changing that on the ISA Server is easy

How to activate PMTUD:
HKEY_LOCALMACHINE\System\CurrentControlSet\Services\Tcpip\Parameters
Key: EnablePMTUDiscovery
Type: DWORD
Acceptable values: 0 or 1
Default value: 1 (true)

Troubleshooting by exclusion: which feature must go?

The first step is to figure out which the TCP Global settings that are active by issuing this command:

netsh interface tcp show global

The result should look something like this:

Querying active state...
TCP Global Parameters
----------------------------------------------
Receive-Side Scaling State          : enabled
Chimney Offload State               : enabled
Receive Window Auto-Tuning Level    : disabled
Add-On Congestion Control Provider  : ctcp
ECN Capability                      : enabled
RFC 1323 Timestamps                 : enabled

Now, the we must try to see if any of those features are the cause of the performance issues. Try disabling them one and one and see if the performance changes. If the performance problem disappears, then re-enable all the other features until you've cornered that one feature that is the culprit. Sometimes it could be more than one thing that causes the problem, so note everything you do.

TOE/Chimney offloads the forming of TCP-connections to the network card. This can cause much problem if the drivers and firmware are outdated or the hardware faulty. This is to my experience often the most problematic feature in the new Windows TCP/IP-stack.

To disable it:

netsh int tcp set global chimney=disabled 

To enable it again:

netsh int tcp set global chimney=enabled

If you suspect that the auto-tuning feature is causing you trouble, you can disable it by issuing this command:

netsh interface tcp set global autotuninglevel=disabled

Activating it is equally simple:

netsh interface tcp set global autotuninglevel=enabled

You have to restart Windows to make the new setting go into effect.

Receive Side Scaling balances the network traffic between CPUs and/or CPU cores.

To disable it:

netsh int tcp set global rss=disabled

And to enable it again:

netsh int tcp set global rss=enabled

More on Microsoft Windows TCP Offloading (TOE/Chimney)
http://support.microsoft.com/kb/951037

More on auto-tuning:
http://technet.microsoft.com/en-us/magazine/2007.01.cableguy.aspx
http://technet.microsoft.com/en-us/library/bb878127.aspx
http://support.microsoft.com/kb/951037

More on Window-scaling:
http://en.wikipedia.org/wiki/TCP_window_scale_option

Compound TCP handling tries to compensate for links with very high latency.

Enable:
netsh interface tcp set global congestionprovider=ctcp 
Disable:
netsh interface tcp set global congestionprovider=none
For Windows 2003, this is how control Compound TCP:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
Key: TCPCongestionControl
Type: DWORD
Acceptable values: 0 or 1

Note that Windows 2003 and Windows 64 needs hotfix 949316 applied to support this function.

More on compound TCP:
http://en.wikipedia.org/wiki/Compound_TCP