I’m often faced with reports of problems with our software. That’s not because XenServer is particularly buggy, but mainly a function of how many people now use it! Our support teams at Citrix, and at partner companies, do an admirable job of helping customers, but sometimes they really have come across something that needs code changes. That’s when having a good bug report really helps diagnosis and fixing activities.
Whilst there are lots of articles about how you should report bugs in software in general, I’ll try to make this post XenServer-specific. Here goes…
What’s a Bugtool Anyway?
The good news is that XenServer has the ability to package the vast majority of necessary log files in what we call a “bugtool”. You can create one of these by running
from the Dom0 command line. From XenCenter, it’s known as collecting a “Server Status Report”, on the “Tools” menu. That’s an absolute requirement for any problem report. Try to collect a bugtool just after you experience the problem (don’t wait a few days). Also, when giving a bugtool, make sure you provide a rough idea of what time the problem took place, as it makes it much faster for us to find the relevant entries in the log files.
What Happens If I Can’t Install?
Sometimes XenServer can’t be installed (maybe because we’re very cautious about deleting partition layouts that we can’t confirm are ones you don’t want to keep!). If that happens, you’ll want to collect the installer logs.
To do this, use the installer as normal, and when you come to the error, switch to another virtual terminal (CTRL+ALT+F2), and type
This will allow you to copy the logs onto a USB stick or a network share, and then provide them to us.
Beware of Comparisons
Sometimes users purchase new hardware, install XenServer on it, and report that their virtual machines are performing less well than on their old hardware. Of course, there have been times where we’ve messed up! However, most of these reports turn out to be because of the complexities of doing good benchmarking.
For example, if you purchase a processor from one vendor that has 4 cores, and then purchase another processor from a different vendor that has 8 cores, intuitively tasks should be twice as fast. However, evidently if the second one runs at half the clock speed of the first, matters will be different. More subtly, it might be that if the workload you’re running isn’t designed to use multiple cores, it might run far better on one high clock-speed core than on four half speed cores.
So, why “beware of comparisons”? When benchmarking, only change one thing at a time. In the example above, we changed all three of the number of cores, clock speed, and processor vendor. We need to understand which of those factors might cause a difference in performance, and hence ideally we change each in isolation. Of course, sometimes that’s not practical, but be aware of the pitfalls.
Another point is how you benchmark. Your aim should be to see whether the system performs well for the workload you’re intending to run on it. Thus, benchmarking a 48 processor core host by running one VM on it probably isn’t that interesting (or, indeed, what the machine was designed to do!). Instead, run (e.g.) 24 VMs at once, with two processors each. Moreover, be very pedantic about what you use: exactly what versions of OS are in the VMs, what test(s) you’re performing within them (e.g. compiling a particular piece of code), and what the results were. We can then set up an identical test in our labs. Remember that the simpler the test that illustrates the problem, the faster we’ll be able to work out what the cause is.
Storage Versus Networking
Many bug reports concern storage problems, probably because there are so many “moving parts” in the chain between a host and an external array. What’s interesting is that a lot of storage traffic actually flows over Ethernet networks, and hence a network problem can actually manifest itself in poor disk performance.
Concretely, for NFS and iSCSI (unless using an iSCSI hardware HBA) storage repositories, the first item on your checklist should be whether the throughput you can obtain over your network is reasonable. I tend to use iperf for this, between the XenServer host and a machine on the same network as the target storage array. Depending on what protocol you’re using to talk to the storage, you may want to use iperf in UDP or TCP mode, but either will give you a rough idea. If network performance is poor, concentrate on fixing that before worrying about why storage performance is low.
As with any piece of software, if you can provide as much detail about how exactly you can reproduce the problem, or the steps that led up to it, it will really help. Assume that nothing is too obvious: taking 2 minutes to provide the extra detail may well save you days later. This is because if your problem has to be escalated from a support team in your time zone, it’s likely to go to an engineering team elsewhere in the world, so every question and answer cycle might take another 12 hours. Hence, it’s always worth being specific about what messages were shown on the screen if the server “crashed”, or exactly what benchmark was used to determine that something was running “slowly”.
Of course, I hope you never have to report a problem on XenServer: customers run thousands and thousands of machines with it very successfully. But if things do break we’ll help you get running again. Just remember to give us the bugtool!