Over the past six months, we’ve been working on moving a pretty significant number of applications (hundreds of apps, over a thousand individual virtual servers) from Cisco CSM + SSL SM load balancers over to F5 Viprions for a large enterprise customer. We actually started this a while before Cisco announced they were ending their load balancer product line (called ACE). I expect that a lot of other people will probably be going through Cisco to F5 load balancer migrations in the next few months / years. I’d definitely be interested in hearing other people’s experiences in migrating between Cisco and F5 load balancers, and please get in touch if you’d like some additional guidance with these kinds of migrations.
Cisco’s load balancer product line had been stagnant for quite a while. We’ve definitely been quite happy with our F5’s, not only in the traditional load balancing role (F5 LTM), but also for global load balancing (F5 GTM). We’re looking forward to starting to use their F5 APM product as well to offer different authentication methods for applications.
I’m definitely glad to no longer have to deal with the Cisco CSM’s. Their functionality was pretty limited, creating redirects was sort of ugly, and dealing with SSL termination / re-encryption was an absolute nightmare. The CSM’s we had unfortunately had no SSL functionality built in, so we had to use the external SSL service modules. For apps that needed SSL termination on both the client side and server side, this resulted in an wonderful arrangement like this:
The CSM’s had relatively little “intelligence” at an HTTP / TCP level. The F5’s have a lot more “intelligence”, which is great because of the functionality that it lets us use, but also has caused a few problems with strangely behaving applications. We’ve come across a few applications where we had to tweak some of the default F5 TCP / HTTP profiles. These are some of the tweaks we’ve had to make. I’m not necessarily suggesting changing these settings across the board for all applications, but I think they are something to look out for during these kind of migrations. I’ve also included a few tweaks we’ve made outside of these migrations for other applications. Honestly, this has been a fun part of my job. It’s always satisfying to dig into the details of a packet capture, understand what is going on at a low level, and resolve an issue.
HTTP Header Size
Shortly after migrating a Sharepoint related application, we began seeing log entries on our F5’s about the maximum HTTP header size limit being exceeded. As it turns out, some Sharepoint sites had a custom web part that was somewhat poorly designed that resulted in very large HTTP headers being sent.
The maximum header size is configured on the HTTP profile that is assigned to a virtual server on the F5. By default, its 32768 bytes. This should be more than enough in almost all circumstances. Other than this particular Sharepoint site, we had no problem with this default on the other 1000+ virtual servers we moved.
Here’s an example log message of what the log message looks like:
Mar 27 00:13:36 local/tmm err tmm: 011f0005:3: HTTP header (32800) exceeded maximum allowed size of 32768 (Client side: vip=TEST-VIP profile=http addr=192.168.22.100 port=80 rtdom_id=0)
Conveniently, the logs on the F5 tell you how long the header in the request actually was. (It was 32800 in this case.) This makes it easier to determine how high you should bump the maximum header size. You can modify the maximum header size on the HTTP profile like this:
modify ltm profile http testprofile max-header-size 33000
F5 has a KB article on this here SOL8482.
TCP Zero Byte Window Timeout
We’ve only seen one instance of this as well, but it was kind of interesting and not at all obvious what was happening until we did a packet capture.
One particular application that connects to an non-HTTP VIP and downloads a decent amount of data for processing. During this TCP connection though, the client periodically sets the TCP receive window to 0 bytes. This tells the load balancer / server to pause transmitting data. This in and of itself is not necessarily an issue, but in some cases the client application tries to keep the connection “paused” with a zero byte TCP window for an extended period of time. The default TCP profile on F5 LTM’s has a zero window timeout value of 20 seconds. So, if the client application tries to keep the connection “paused” via a zero byte TCP window for greater than 20 seconds, the F5 will close the connection.
We’re still trying to work out exactly why the application is doing this. I think the root cause of this issue probably lies in the client application, not the load balancer, but this was kind of interesting to look in to. The setting can be modified on the TCP profile on the load balancer. For example, to raise this timeout value to 30 seconds (30,000 milliseconds) –
modify ltm profile tcp testtcpprofile zero-window-timeout 30000
F5 has a KB article on TCP profile settings here SOL7559.
TCP RST’s on a Forwarding Virtual Server
In some of our environments, out F5’s act as the default gateway for some servers on a subnet. This is accomodated through forwarding virtual servers on the F5. The forwarding virtual server will forward traffic without modifying the source or destination address, just like a typical router would. In this use case, we don’t really need any of the F5’s stateful functions, since we’re essentially trying to emulate a stateless router.
In this specific environment, non-load balanced traffic would not necessarily be routed back through the F5’s, so we enabled “loose init” and “loose close”. These settings allow the F5’s to pass data for connections where it does not see all of the flow. (For instance, if it just sees the outgoing part of the flow, but not the return flow).
One setting we didn’t initially change was “Reset on Timeout”. Even with loose init / loose close, the F5 still keeps track of connections, and will time the connections out if they have been idle for greater than the idle timeout value (300 seconds by default). By default it will kill the connection by sending TCP RST’s to the client and server after it has timed the connection out. Given that we wanted our F5’s to behave like a stateless router, this was not desired and caused an issue for one specific application that did not reconnect gracefully.
So, we disabled the “reset on timeout” option. The load balancer will still “time out” the connection and remove it from its connection tracking tables. This doesn’t really matter though, because we are also using loose initiation / loose close. So the F5 will forward traffic in the middle of a TCP conversation, even if it doesn’t already have that connection in its tables.
If you’re in a similar situation, you can make these changes on a profile from tmsh like this:
modify ltm profile fastl4 asdf loose-initialization enabled loose-close enabled reset-on-timeout disabled
TCP Timeout / Keepalives
In one particularly web application we’re load balancing, the client makes a request which kicks off a report on the server that takes a while to run. While this report is running, the TCP connection stays open, but no data will be sent for a period of time while the server is building the report.
On a TCP profile applied to a VIP on the load balancer, the default “idle timeout” is 300 seconds. After 300 seconds, the LB will close the connection. The load balancer can also send TCP keepalives to keep the connection open assuming the client responds. However, the default interval for keepalives is 1800 seconds….longer than the idle timeout. So, effectively we found that the F5 LTM would not send TCP keepalives by default. I honestly don’t understand why the default keepalive interval is longer than the idle timeout value.
Also, because the F5 LTM is proxying the connection, if the server sends a TCP keepalive, the F5 will respond, but that keepalive won’t be passed through to the client. So, the way I understand this, that means that the client side of the connection can still time out, even if the app server is sending keepalives to the F5.
Here’s how to adjust the keepalive interval on a TCP profile (to 60 seconds in this example):
modify ltm profile tcp test keep-alive-interval 60
To raise the idle timeout value on a TCP profile to one hour, do this with tmsh:
modify ltm profile tcp test idle-timeout 3600
F5 has a KB article explaining TCP keepalive behavior here SOL8049.
We’ve come across several instances where we’re re-encrypting traffic to SSL services that do not support secure renegotiation. Because this is a security issue, newer versions of F5 BIG-IP block this by default through the server-side SSL profile. This can be changed though to support legacy servers where the team supporting the application cannot enable secure renegotiation on their end. There is a “secure renegotiation” setting on the client and server SSL profiles in BIG-IP 10.2.3 and higher. It is set to “require” or “require strict” out of the box, but can be changed to “request” to permit connections from clients that have not been patched. To do this via tmsh run:
modify ltm profile server-ssl sslprofile secure-renegotiation request
F5 has a KB article about this as well here SOL13512.
We also had to modify the SSL ciphers setting for one particular application with legacy servers. F5 LTM’s have two separate SSL stacks – “native” which is hardware accelerated, and “compat” which supports a wider range of ciphers but is not hardware accelerated. “compat” is based on OpenSSL. By default in BIG-IP version 11, only the “native” stack is enabled. For this one particular app, we found that we needed to enable the “compat” stack in addition to the “native” one. Here’s how to do this on a server side SSL profile:
modify ltm profile server-ssl sslprofile ciphers NATIVE:COMPAT
F5 also has a KB article explaining the difference between the two SSL stacks – SOL13187.