I truly love my job. I work closely with Citrix partners (usually over a period of many months) to build out virtual desktop and application reference architectures, conduct performance tests, and evaluate how each solution scales. When we reach the end of a validation, it’s incredibly rewarding to understand the nuances of a configuration and see the final published results of our work. And it’s especially satisfying when our testing reveals that new software and hardware releases can support substantial increases in user density.

Perhaps that’s why I’m thrilled about the release of a new Cisco® Validated Design (CVD) that documents a reference architecture and scalability tests we conducted over the last several months. The architecture combines Citrix® XenDesktop®, Cisco UCS blade servers, NetApp storage, and the VMware vSphere to build an enterprise-ready desktop virtualization solution. Under test, it showed excellent scalability with a hefty 2000-seat mixed workload of hosted virtual desktops (VDI) and hosted shared desktops (RDS). We capped our testing at 2000 users but the architecture can even scale beyond that to meet future growth needs. If you want to read the full CVD, Cisco has published it here. This blog extracts the highlights and summarizes the results of our design and testing.

Building A Scalable Architecture

The reference architecture is based on a turnkey FlexPod infrastructure (e.g., the FlexPod Data Center with VMware vSphere 5.1). This enterprise solution combines a new generation of hardware and software components: Citrix XenDesktop 7.1 and Provisioning Services 7.1, VMware ESXi 5.1, Cisco UCS B200 M3 blades with Intel® Xeon® E5-2680v2 (“Ivy Bridge”) processors, and NetApp FAS 3240 shared storage equipped with the Clustered Data ONTAP® 8.2 storage operating system. For high availability, the architecture uses an N+1 server design that features two Cisco UCS chassis housing two blades for infrastructure servers, 8 blades for the VDI workload, and 4 blades for the RDS hosted shared desktop workload, as in the diagram below.

The turnkey nature of the infrastructure creates a compact, affordable, and flexible solution for Citrix XenDesktop 7.1 software. The XenDesktop 7.1 release unifies the functionality of earlier XenApp and XenDesktop releases, providing a single management framework and common policies to provision both hosted virtual desktops and hosted shared desktops. The CVD documents a design that we stress-tested and validated with 550 hosted virtual desktops (running Microsoft Windows 7) along with 1450 hosted shared desktops (using Microsoft Server 2012). NetApp FAS 3240 storage provided well-managed, shared scale-out storage.

Testing Methodology

To understand the test methodology used, you can view this short, 3-minute video.

We set up the test configuration in the NetApp labs in Research Triangle Park, North Carolina. During each test run, we captured metrics across the end-to-end virtual desktop lifecycle: during virtual desktop boot and user desktop login (ramp-up), user workload simulation (steady state), and user log-offs. To generate load within the environment, we used Login VSI 3.7 software from Login VSI Inc. (http://www.loginvsi.com). This load simulation software generates desktop connections, simulates application workloads, and tracks application responsiveness.

To begin the testing, we started performance monitoring scripts to record resource consumption for the hypervisor, virtual desktop, storage, and load generation software. At the beginning of each test run, we took the desktops out of maintenance mode, started the virtual machines, and waited for them to register. The Login VSI launchers initiated the desktop sessions and began user logins, which constitutes the ramp-up phase. Once all users were logged in, the steady state portion of the test began in which Login VSI executed the application workload (the default Login VSI Medium workload). The Medium workload represents office productivity tasks for a “normal” knowledge worker and includes operations with Microsoft Office, Internet Explorer with Flash, printing, and PDF viewing.

Login VSI loops through specific operations and measures response times at regular intervals. The response times determine Login VSIMax, the maximum number of users that the test environment can support before performance degrades consistently. Because baseline response times can vary depending on the virtualization technology used, using a dynamically calculated threshold based on weighted measurements provides greater accuracy for cross-vendor comparisons. For this reason, we also configured the Login VSI software to calculate and report a VSImax Dynamic response time.

We conducted both single server and multiple server scalability tests, performing three test runs during each test cycle to verify the consistency of our results. The test phases included:

  1. Determining single server scalability limits. This phase calculated Login VSIMax for each scenario (VDI or RDS) on a single blade. In each case, user density was scaled until Login VSIMax was reached, which occurs when CPU utilization reaches 100%.
  2. Validating single server scalability under a maximum recommended density with VDI and RDS loads. The maximum recommended load for a single blade occurs when CPU utilization peaks at 90-95%.
  3. Validating multiple server scalability on each workload cluster. Using multiple blade servers, we created a separate baseline for the VDI and RDS workloads, testing each workload cluster independently.
  4. Validating multiple server scalability on a combined workload. After determining the baseline for each workload cluster, we combined the workloads to achieve a full-scale, mixed workload result.

Main Findings

Phase 1: Single Server Scalability Tests

In the first set of tests — the single server scalability tests — we determined Login VSImax for hosted virtual desktops (VDI) and hosted shared desktop sessions (RDS) on a single blade.

Parameter Hosted Virtual Desktops Hosted Shared Desktops
Virtual CPUs 1 vCPU 5 vCPUs
Memory 1.5 GB 24 GB
vDisk size 40 GB 60 GB
Write Cache size 6 GB (Thin) 50 GB (Thin)
Virtual NICs 1 virtual VMXNET3 NIC 1 virtual VMXNET3 NIC
vDisk OS Microsoft Windows 7 Enterprise (x86) Microsoft Windows Server 2012
Additional software Microsoft Office 2010, Login VSI 3.7 Microsoft Office 2010, Login VSI 3.7
Test workload Login VSI “Medium” workload Login VSI “Medium” workload

To find Login VSImax for hosted virtual desktops on a single blade, we used a test workload of 220 users running Windows 7 SP1 sessions under a Medium workload (including Adobe Flash content). As Figure 1 shows, Login VSImax was reached at 202 users.

Note: The hosted virtual desktops were configured with the Windows Basic Desktop Theme for Windows 7. Windows 7 enables the Aero Glass theme by default which results in ~30% less user density consuming more host CPU resources. This is because XenDesktop 7.x has a software-based virtual GPU (vGPU) as part of the Virtual Desktop Agent that both Windows and applications can use for compatibility purposes. Windows and apps will enable hardware acceleration during GPU detect using XenDesktop’s vGPU to software render Aero (and DirectX) which consumes additional resources by the host server’s processors. Unless Desktop Composition (e.g. Aero, DWM) is needed, it is recommended not to use Aero to reach the full scalability potential of the solution. The Windows Basic Desktop Theme can be configured by applying the following policy.

  • User Configuration\Policies\Administrative Templates\Control Panel\Personalization\Load a specific theme
  • Path to theme file “C:\Windows\Resources\Ease of Access Themes\basic.theme”
  • User profiles need to be recreated when enabling/disabling this policy setting

 

Figure 1: Hosted Virtual Desktops, Single Server Results

We then looked at the scalability of hosted shared desktops on Windows Server 2012 desktop sessions on a single blade. We launched 240 user sessions for the scalability test of hosted shared desktops on a single Cisco UCS B200-M3 blade and achieved a VSImax score of 211 users (Figure 2).

Figure 2: Hosted Shared Desktops, Single Server Results

With any optimally configured scalability test, CPU resources are typically the limiting factor, which is what we experienced in both the VDI and RDS single server tests. Compared to past tests of previous generation Intel Xeon “Sandy Bridge” processors, the same Cisco UCS B200 M3 hardware with dual 10-core 2.7 GHz Intel Xeon E5-2680v2 “Ivy Bridge” processors supported about 25% greater user density. Thus, Cisco blades with the new processors enable an ultra-compact solution that can support up to 2000 users.

 

Phase 2: Single Server Scalability, Maximum Recommended Density

After testing single server scalability, the next step was to determine the maximum recommended capacity for a single blade. The maximum recommended density level occurs when CPU utilization reaches 90-95%.

Under a VDI workload, the maximum recommended density was 180 hosted virtual machines on a single Cisco UCS B200 M3 blade server with dual Intel Xeon E5-2680 v2 processors and 384GB of RAM. Running Microsoft Windows 7 (32-bit), each virtual machine was configured with 1 vCPU and 1.5GB RAM. Figure 3 shows VSIMax as the load scales, along with CPU, memory, and network metrics. (Storage metrics are discussed in a separate section towards the end of this blog.)

Figure 3: Hosted Virtual Desktops, Single Server Results under Recommended Load

For hosted shared desktops (RDS), the maximum recommended workload was 220 users for each Cisco UCS B200 M3 blade (with dual E5-2680 v2 processors and 256GB of RAM). On each blade we configured eight Windows Server 2012 virtual machines, each with 5 vCPUs and 24GB RAM.  Figure 4 shows VSIMax as the load scales, along with CPU, memory, and network metrics.

Figure 4: Hosted Shared Desktops, Single Server Results under Recommended Load

 

Phase 3: Workload Cluster Scalability Testing

In this phase, we created separate workload clusters, one for VDI and one for RDS, and tested them independently. In an N+1 configuration, if a single blade is unavailable due to scheduled maintenance or unplanned downtime, then the remaining servers can absorb the extra load. Our goal was to create a highly available architecture so that the design could sustain acceptable performance even in the event of a single blade failure.

To adequately support a total of 550 VDI users, we used four Cisco UCS B300 M3 blades, configuring them with virtual machines as in the single server/maximum recommended load testing (1 vCPU and 1.5GB RAM per VM). For the VDI workload cluster testing, Figure 5 shows VSIMax and CPU, memory, and network metrics for a representative blade server.

Figure 5: Hosted Virtual Desktops, Multiple Server/Workload Cluster Baseline

To deliver 1450 hosted shared desktops, we used eight Cisco UCS B300 M3 blades and configured them as in the single server/maximum recommended load testing (each blade was configured with eight Windows Server 2012 R2 virtual machines, and each VM was allocated 5 vCPUs and 24GB RAM). Again, our goal was to design a configuration that could sustain a single blade failure and continue to support 1450 RDS users. Figure 6 shows VSIMax along with collected metrics for a single representative blade.

Figure 6: Hosted Shared Desktops, Multiple Server/Workload Cluster Baseline

Phase 4: Full-Scale, Mixed Workload Scalability Testing

In this final phase of testing, we validated the solution at scale by launching LoginVSI sessions against both the VDI and RDS clusters concurrently. Cisco’s testing protocol requires that all sessions must be launched within 30 minutes and that all launched sessions must become active within 32 minutes.

Our validation testing imposed aggressive boot and login scenarios with the 2000-seat mixed desktop workload. All VDI and RDS VMs booted and registered with the XenDesktop 7.1 Delivery Controllers in under 15 minutes, proving how quickly this desktop virtualization solution could be available after a cold-start. Our testing simulated a login storm of all 2000 simulated users, yet all users logged in and started running workloads (achieving steady state) within 30 minutes without exhausting CPU, memory, or storage resources.

Figure 7 shows the results of the mixed workload, full scale tests, including graphs for VSImax and CPU, memory, and network metrics collected on representative VDI and RDS servers.

Figure 7: Full-Scale, Mixed Workload Scalability Results

Storage Performance

The storage configuration included a NetApp FAS3240 two-node cluster running Clustered Data ONTAP 8.2 and four shelves of 15K RPM, 450GB SAS drives. Each node was configured with a 512GB Flash Cache PCI card. User home directories were configured on CIFS shares while PVS write caches for the VDI and RDS workloads were hosted on NFS volumes.

Since the testing took place at NetApp’s facility, we captured extensive storage performance data during each phase: boot, login, steady state, and logoff. During the full scale testing of combined VDI and RDS workloads, we recorded storage metrics on the NetApp Filer and they are summarized below.

Overall, the storage configuration easily handled the 2000-seat workload, featuring an average read latency of less than 3 ms and write latency less than 1 ms. The PVS write cache showed a peak average of 10 IOPs per desktop during the login storm, with the steady state showing 15-20% fewer I/Os in all configurations. The average write cache I/O was 8k in size, with 90% of write cache I/Os being writes. Using NetApp Flash Cache decreased the number of IOPS during the I/O-intensive boot and login phases once cached (warm cache).

Conclusion

The testing validated linear scalability of this Cisco, NetApp, VMware, and Citrix reference architecture. The tested configuration easily supported a 2000-seat XenDesktop workload mix of VDI and RDS users without overwhelming CPU, memory, network, and storage resources. In addition, the reference architecture defines an N+1 blade configuration, creating a highly reliable and fault-tolerant design for hosted virtual desktops, hosted shared desktops and infrastructure services. And the new generation of Intel “Ivy Bridge” processors supports about 25% greater capacity per server than previous generation processors, allowing the system-under-test to support 2000 users using just 32 rack units of a single rack, conserving power and data center floor space.

Maybe it’s because I love what I do, but I think these results are pretty exciting. Check out the full CVD posted on the Cisco web site.

 

— Frank Anderson, Principal Solutions Architect with Citrix Worldwide Alliances