As a performance testing consultant, I get to see a lot of the work that other performance testers do…and a lot of the time it horrifies me. If performance testing was a licensed profession, like law or medicine, then it would be necessary to revoke the licenses of 90% of testers. Their work is not just “low quality”, it is actually wrong or misleading, or based on such a shaky foundation that any predictions made from the test results are tenuous at best.
Bad performance testers frequently treat their testing as a ritual, rather than as a science and they either ignore errors, or don’t bother to check for them in the first place.
Here is an example that a saw last week…
I was sent a report on a 12-hour Soak Test that had been run for a business-critical, public facing website with a large userbase – the kind of website that journalists love to write snarky little articles about when they don’t work. The report was one of the ones that can be automatically exported from LoadRunner Analysis (this is usually a bad sign). Unfortunately I can’t show you the whole report, but here is the graph of Running Users:
The Soak Test attempted to maintain a steady-state workload of 21,000 vusers for 12 hours (minus some time for ramp-up/ramp-down) but, over the course of the test, more than 10,000 vusers experienced a fatal error. The test report does not note that the test showed a catastrophic failure, and no severity 1 defect was ever raised.
If they believed that the errors they were seeing during their test were due to problems with their LoadRunner scripts, rather than the system they were testing, then they should have fixed their scripts and re-run the test before releasing results to stakeholders.
Other WTF moments in the report:
- It was a Soak Test, which typically finds resource-related problems such as memory leaks, but no system monitoring had been set up.
- The Throughput graph showed periodic spikes of higher throughput, as if their vusers were synchonised and “marching in step”. This is not very realistic.
- They have included a graph of errors over time, but have only referred to errors by their code. I hope their audience knows what a -26612 or a -27791 error is.
- They have included a table of transactions with response times (min, average, max, standard deviation, 90th percentile) and transaction counts (pass, fail, stop); but they have only highlighted a single line – the only transaction with an average response time over 2 seconds. They have not highlighed any of the transactions with thousands of errors.
Sometimes I think that calling what we do “performance testing” encourages people to think that their job is just about measuring response times, and to ignore application stability under load.
- They have focussed on average response times (measured over the entire test) and only failed one transaction, but there was a 1 hour period of quite bad response times. If the average response time was measured over just that hour, then average response times for most of the transactions would be above the 2 second SLA.
I see bad testing everywhere, but I see more of it from low-cost outsourcing companies. I can only assume that CIOs who engage these companies are pleased that they can get an incorrect performance test result for a cheaper price than a correct one.