The XenDesktop Availability Problem
Over the last several years I have been actively involved in helping many of our enterprise customers deploy large VDI environments. A reoccurring theme is that most, if not all, have experienced outages in their XenDesktop environments. This problem has not only been occurring with the large customers that I have personally been involved with, but I have also been getting emails and reports about outages from large customers all over the world. The troubling thing about this is that many of these customers have extremely mature and highly skilled IT teams. Additionally, they often have done everything right according to all of our best practices in making every component highly available, yet outages keep occurring.
There are many things that can cause an outage in a XenDesktop environment. I am not going to list all of the various things, but I will highlight some of the key items that I have seen cause issues. Some specific examples that I have seen include:
- In XenDesktop 4 and earlier, a desktop group could only have one hypervisor connected to it. If vCenter, SCVMM or the Xen Pool Master went down, the desktop group would become inaccessible. I can’t count how many times I have seen hypervisor connectivity issues cause outages! Even simple things like the vCenter certificate expiring can cause an outage!
- In XenDesktop 5.x, if SQL connectivity is lost, it is game over. Without SQL, no new connections can be made. This is probably one of the biggest challenges of all.
- With PVS if SQL connectivity is lost, the server can still function in a limited fashion as long as the Stream service is not restarted and DB offline is enabled. However, if PVS is rebooted or the service is restarted, PVS will fail if SQL is not up.
- With PVS I have seen the Stream Service hang across all servers in a farm when a failover occurs while the servers are under high load. This type of Stream service failure will prevent new devices from successfully streaming.
With XenDesktop 5.x, we did make some improvements to minimizing outages caused by hypervisor issues now that we can feed catalogs and desktop groups with more than one hypervisor infrastructure. However, we introduced a new single point of failure by making XenDesktop 5.x 100% dependent on SQL server whereas in XenDesktop 4, each server had an offline copy (LocalHostCache) of the SQL database and could still broker desktop connections in the event of a SQL failure.
You can have the smartest IT architects and the greatest hardware with all high availability best practices followed by the book (multiple hypervisors, SQL mirroring/clustering, NetScaler load balancing, multiple PVS servers with DB offline, etc…); however, if you have a large environment with a single XenDesktop site, I can promise you only one thing…. You will have an outage!!! It is not a matter of if the outage will occur, but a matter of when and for how long.
POD Architecture to the rescue???
This issue of XenDesktop availability and scalability is really nothing new. Several years ago Dan Feller proposed a modular Pod architecture where instead of deploying a single large XenDesktop site, you deploy multiple independent XenDesktop sites (a.k.a. Pods) and if one XenDesktop site goes down, the others are still up and available. In fact, one of our consulting architects, Rich Meesters, recently wrote an excellent blog and white paper that discusses this architecture. Going forward with the rest of my blog, I will assume that you have read the articles by Rich and understand what is meant by Pod architecture.
While this Pod architecture sounds great, there are major deficiencies in implementing it effectively and efficiently. Since we are using a common gold image and non-persistent/pooled desktops, it should not matter to which Pod a user connects. Image that you have 5 Pods each with 5000 desktops. If you connect to Pod 1 today and Pod 3 tomorrow, it should not matter as they are identical. Ideally, you should be able to treat all 5 Pods as one logical unit. However, XenDesktop, NetScaler and Web Interface in their current form today do not give you the ability to logically view or load balance these 5 Pods as one unit.
I know what you are thinking… Can’t NetScaler and/or some of the Web Interface features such as Farm Aggregation, Recovery Farm or User Roaming fix this issue??? The short answer is No, none of these Web Interface features or the NetScaler can load balance multiple Pods cleanly as one logical unit. Let’s step through the issues with trying to load balance multiple Pods today…
Web Interface Farm Aggregation, Recovery Farms and User Roaming….
- You could simply list all 5 Pods as farms in Web Interface and leverage farm aggregation. The users will see 5 Desktop icons and will have to decide for themselves which one to pick. Do you really want a user to see five identical icons? Do you want to rely on users randomly picking the icons to distribute load? This is not elegant and not a very intelligent distribution of load.
- You could use the recovery farm feature of Web Interface and stagger the primary farm and recovery farms on each of the 5 Web Interface servers. This could be done by having each individual Web Interface only point to one primary farm and a list of recovery farms. Web Interface 1 points to Site 1 as primary and Sites 2-5 as recovery. Web Interface 2 points to Site 2 as primary and Sites 1, 3, 4, 5 as recovery, and so on… A NetScaler then load balances these Web Interface servers as one logical unit. There are several issues with this approach. The major Achilles heel with this approach is that the NetScaler would not know who you are before sending you to Web Interface and whether you have a disconnected session in one of the farms. If you had a disconnected session in Pod 1, but logged on from a new client, NetScaler might route you to a Web Interface server that uses Pod 4 as the primary farm and you would start a new session instead of reconnecting to your disconnected session. This solution also suffers from the issue where a farm might be up, but it is out of desktops due to hypervisor or PVS issues or simply because it ran out of desktops. Web Interface recovery farms are not smart enough to fail over to a backup farm simply because desktops are depleted.
- You could try to implement the user roaming feature and send specific users to a certain farm/Pod as their home Pod. This means that you must split your users up into separate AD groups assign them to a primary Pod, which adds administrative overhead. Additionally, this still does not truly load balance or provide failover. If you are assigned to Pod1 and it runs out of desktops because the hypervisors are down or because PVS is down, Web Interface and the DDCs will still be alive and will not fail you over to a recovery farm. Additionally, if your primary farm is temporarily unreachable, you could launch a session in the recovery farm. If you disconnect from the recovery farm and now your primary farm becomes available again, you will not be reconnected to your disconnected session in the recovery farm.
What about NetScaler??? No matter how you try to use NetScaler to load balance a connection to Web Interface or a connection to a specific Pod, NetScaler is not smart enough to know if there are actually desktops available in the Pod and NetScaler does not know if you have a disconnected session in a particular Pod, so you will end up being unable to reconnect to your disconnected sessions.
As you can see, we have no way to logically load balance all 5 Pods as one unit and truly distribute load based upon actual usage. No matter how you try to approach this with our current Web Interface features or with a NetScaler, there is no clean way to load balance the Pods.
A New Web Interface Enhancement to the Rescue!!!
This problem of load balancing Pods as a single logical unit was becoming a major issue for my customers, and unfortunately, there was nothing on our product roadmap that would address this issue timely enough for my customers. So, I decided to lay out a solution by designing an add-on to Web Interface. Since I am not really a programmer, I partnered with an outside developer and to help write the code as an add-on to Web Interface. I have to give a shout out here to Wayne Rouse, email@example.com , as he was the coding master mind behind helping me with this enhancement! Wayne and I spent many a day and night on GoToMeeting sessions in my lab as we developed this solution! Thank you Wayne!
So what did we do to enhance Web Interface to fix this Pod load balancing issue? We approached it by building upon on the native farm aggregation capability of Web Interface. You will list each Pod in Web Interface as a separate farm and then configure our code to work its magic! The highlights are listed below:
- We identify identically and specially named desktop groups from multiple farms and collapse them into a single icon. If you have 5 Pods offering the same “Windows 7” desktop group, you will only see one icon.
- We run PowerShell queries from Web Interface to each XenDesktop Pod and build a table of current usage statistics. We know which of the 5 Pods is least loaded and which Pods actually have available desktops.
- When you click on the single Desktop Group icon that has been aggregated from multiple XenDesktop Sites/Pods, we will first query each farm to locate any disconnected sessions. We will always reconnect you to your disconnected session first, regardless of which Pod is least loaded.
- If you do not have a disconnected session, when you launch an aggregated and load balanced Desktop Group, we will check the table of PowerShell load data and connect you to the Pod that is least loaded.
- If for any reason, we cannot successfully generate a launch.ica file for the least loaded Pod, we will automatically keep trying other Pods until we can get a launch.ica file and connect you to a desktop. No more errors about desktops being in maintenance mode or unavailable! One click and if a desktop is available in any of the Pods, we will connect you to it!
- We also offer enhanced maintenance mode capabilities. Today, if you put a Desktop Group in maintenance mode, no one will be able to connect to it. This prevents disconnected users from reconnecting to their desktops. This is a major pain point in being able to successfully drain off users from a desktop group. With our new code, we allow you to flag a desktop group for drain off without actually having to put it in maintenance mode in the Desktop Studio console. This allows you to prevent new sessions from connecting to a desktop group while simultaneously allowing disconnected users to reconnect to their sessions.
The diagram below illustrates how our code works…
Gotchas, Disclaimers and other info…
- This code is provided as a free tool/utility add-on to your Web Interface deployment. It is not an officially supported Citrix product. Just like all of the other great tweaks to Web Interface that you can find out there (especially on Thomas Koetzing’s awesome site! http://www.thomaskoetzing.de), if you have issues with this add-on, you cannot open a ticket with Citrix Tech support. Citrix is not responsible or liable for the use of this utility. Please use it in a test environment first!
- This code only works on Web Interface 5.4 and 5.4.2
- This code was only tested on Windows 2008 R2 as the web server.
- You should only run one instance of this add-on per IIS server. We have not tested running multiple instances of this add-on on a single IIS server. It does not mean that it will not work; just that it has not been tested.
- Make sure your Web Servers have at least 2 vCPUs and 4GB RAM (you should be doing this already!).
- We only tested this with XenDesktop 5.5 and 5.6 .
- This should only be used to load balance XenDesktop Pods that are in the same data center.
- This is code does not work with StoreFront or Cloud Gateway. It is only a Web Interface 5.4 enhancement.
- This code is not intended to aggregate XenApp sessions, so please do not try to do it.
- This code is designed for load balancing non-persistent pooled desktops delivered via Machine Creation Services or Provisioning Services. You should not aggregate persistent assigned desktops or assigned desktops with Personal vDisk.
I know that some people are going to ask about the future of this enhancement, so let me address that now. As stated previously, this is not an officially supported Citrix product. We created it as a free tool to address an immediate need for current XenDesktop and Web Interface deployments. Ideally, I think this functionality really belongs on NetScaler, StoreFront or some combination of the two. Hopefully, native load balancing of XenDesktop Pods will become a core part of the Citrix product at some point in the future and the need for this Web Interface add-on will go away! I am not a Citrix Product Manager, so please do not ask me if or when this capability will become a supported feature.
You can download this tool from the following link:
I hope you find this tool useful! Feel free to let me know what you think!