@vcdxnz001 That sort of thing has happened to me too. BTW, one day you should do a how to perf test article.
— Michael White (@mwVme) January 5, 2017
This all started from a tweet in response to some test results that I had written about in my article Disable Windows Disk Write Cache for Data Integrity and Better Performance. Michael White suggested I write something up about how to perf test. Thanks for the inspiration Michael. So this is it, a simple 101 introduction to performance testing. I’m pretty sure it’s not exactly what you will expect. Just like the outcomes of the tests in my previous article. I utilize these same techniques in my job at Nutanix when performance testing new application workloads
Performance testing can be a lot of fun. It can also be very tedious, and sometimes frustrating. There are high’s and lows. You can break things, and spend hours figuring out what went wrong. Like professional test driving for a car company, you often do hundreds or thousands of test iterations of very similar configurations to prove or disprove a theory, or to measure differences in configurations to find the optimal combination. This could be seen as mundane. For example, in my article VMware vSphere 5.5 Virtual Network Adapter Performance I performed over 2,000 (two thousand) individual combinations of tests to get the results, each combination run multiple times. But overall it is very similar to performing scientific experiments in other fields. If you are not familiar with the scientific method, check out here and here.
Before we get into the how of performance testing, lets first look at what it is and what makes it different from others types of testing. Broadly in IT there are two different types of testing, functional testing, where you are checking that a system performs the functions it was designed to perform in the way it was designed to perform them ( a red screen is red ), and non-functional testing, which is where performance comes into it. A non functional test might be performed to prove that a system can perform a function a certain number of times over a given period with the same result, or to see the maximum number of times something can be performed in a given time period. Rather than the colour of a screen or text in an error message.
Often performance tests will try and find the limit (before response time degrades or the system breaks), or headroom available in a particular system, or the response time under a defined load, or be designed to find the optimal configuration. In all cases, they are measured against requirements, and a baseline or control. There are quite a few types of performance testing, such as benchmark testing, load testing, soak testing, stress testing, unit performance testing, integration performance testing, spike testing, headroom testing, failure performance testing.
What follows is quite a simplified and basic description of some of the high level areas that are important. There has been quite a bit written on the topic over time and for different types of systems and applications. Such as this article about hyperconverged performance.
Step 1: The Question, Objective, or Success Criteria
Before you start anything you need to decide what question are you looking to answer or what is the objectives of your test. What are you trying to learn from the testing? For example, in Disable Windows Disk Write Cache for Data Integrity and Better Performance, I was trying to find out what the impact to performance was when turning off the disk write cache in Windows. If you are testing an application your objective might be to prove if the application can meet your business performance requirements such as x load over y time, with z response time. You might ask what is the maximum for x value I can expect for this particular configuration before response time is impacted? I particularly like answering questions like, where does this system break, to the point that it can no longer process transactions, or what happens when you subject a system to 10x expected load. I also like questions such as what happens under heavy load if a particular component malfunctions or fails?
If you are performing benchmark testing with defined industry benchmarks then others will have most likely defined the questions you need to answer. Such as how many transactions per second and what system response time did you receive from a certain configuration and under a load factor of a certain number of virtual users for a sustained period of time. There are many different types of benchmarks, some of the most useful from from the Transaction Processing Council (TPC) for database type workloads, and Standard Performance Evaluation Corporation (SPEC) for many other types of workload. Many vendors also specify their own performance tests, such as SAP Sales and Distribution (SD) 2 Tier Benchmark.
Step 2: The hypothesis or expected result
Once you know what question, now you should define what you expect, what do you think will happen, or what do you require as the result? With the example I used earlier regarding disk write cache, I could have expected write performance to reduce when cache is disabled. After all, the idea of caching is to increase performance, such as IO’s per second (IOPS) and to reduce write latency / response time. Another example might be that you expect a certain number of transactions per second at a certain response time from an application that is being used concurrently by a certain number of real users, and a given transaction mix. Just as with the questions the expected results or hypothesis can get fairly complicated. Often you can use prior research done by others to form the hypothesis or expected results or as a guide line as to what you should expect, such as reviewing the published results on the TPC.org or SAP web site if you are performing a benchmark.
Step 3 A: Testing Part A – Baseline or Control
Regardless of what you are testing you need to have a point to measure from. This is true in system performance testing as in other scientific experiments. There is a baseline or a control that can be compared to the thing that is being changed. Such as in a medical trial, there is usually a placebo that contains nothing of the drug being tested, and then there is a real drug. In system performance testing we would usually have a simple standard configuration installed that meets the functional requirements and test that to determine what the baseline results are,. Alternatively we may have one system configured as per the current standard, and then another system that gets modified over time that we run continuous tests against and then compare to the original. Your baseline or control system should not be modified during the testing period, in case you need to re-run a baseline test and collect more information.
Lets say you have an existing system based on a certain version of an application, you are planning to upgrade to a new version. You could perform a standard set of tests based on your business requirements against a non-production, but similarly configured system, as your current production system and measure the results. This becomes the baseline or control. Then you upgrade that non-production system to the next version of the system and perform the same tests again. After this you can compare the results to find out what the differences are, and see if they meet your requirements.
Chapter 11 in the book I co-authored titled Virtualizing SQL Server with VMware: Doing IT Right (VMware Press 2014), focused on baselining, which is an important part of virtualizing any applications to ensure you achieve acceptable business outcomes.
Step 3 B: Testing Part B – System Under Test and Performance Test Iterations
Now that you have your baseline you can perform multiple iterative tests on the system under test, which just means the system you are testing. In order to have a valid result the tests must be repeatable. This means that each configuration or system modification needs multiple tests in order to have a valid result. You may do 3 or 5 identical tests of the same configuration before making a change and repeating the tests. Between each iteration you should keep the modifications between test runs to a minimum so you can easily tell what setting has resulted in which change. It is not always possible due to time constraints to just make a single change between test runs, but that is ideal. Otherwise how do you determine which configuration change made the difference? If you are testing many combinations the number of test iterations can easily reach into the thousands. In the testing of the disk write cache setting I had a very defined test and only a single parameter to change between each test, however I had multiple types of IO (read and write) to test, and different patterns and sizes. The combinations can easily increase exponentially, so you need to decide what are the most important tests. The important thing to remember here is that you need to have a repeatable test and multiple consistent results (3 – 5) before your test could be considered valid and before you should move on to change configurations. I usually do 3 per iteration.
Step 4: Monitoring and Analyzing Results
There are a lot of metrics that you could measure for every system under test that can help with analyzing the results. Depending on the progress of the testing you might want to dial up or down the detail and frequency of metrics collection and increase or decrease the metrics that are monitored. Performance testing can generate a lot of data for every test iteration, so being selective about which metrics to monitor, and only selecting the most important or most relevant ones is critical for success. You don’t want too few that you miss important information, or too many that you can’t see the relevant data points. The granularity of data points, i.e. number of metrics, and frequency of collection will determine how much capacity is required for monitoring data collection.
One of the problems with monitoring is that is can impact results and it can impact performance. You need to try and keep it as light weight as possible. You also need to keep the monitoring consistent between the baseline and the system under test, else the results will become impossible to compare accurately.
Once you have the data you can start to analyze the results. Analyzing the results between test runs can take almost as long as the testing itself, it can also take your testing plan in different directions depends on the results. If you find a result that is very unexpected you may have to repeat a test with more data collection, or perform some troubleshooting. If everything has gone according to plan however you can compare the results to the baseline and determine the differences and then compare to the requirements and determine if it is a success of not. If your test was just designed to see how fast something goes with a certain configuration you might have an easy result.
In the IO testing I was doing with disk write cache, I was primarily measuring IOPS and latency of different IO sizes and patterns with disk cache enabled and disabled. So there was no defined pass or fail result, it was just designed to find out what the difference was. However I got quite an unexpected result. The performance was better with disk write cache disabled, than it was with it enabled. Based on this data I performed more iterations to validate that the results weren’t a fluke or a coincidence.
With more complex system testing you may need to run tests at different user loads, or up to the point of system saturation, to see what the maximum number of users the system can support, at what transactions per second, and what the response times are. If you are measuring to a set of business non-functional requirements, then you will know what the minimum is the system needs to achieve in order to be determined a success. Usually when considering migrations, the baseline of the existing system is the starting point and you are usually trying to achieve x percent improvement over the existing system. It is important to have an accurate baseline so you can say for sure you have achieved the desired outcome as performance to end users can be quite subjective. But you also need to know you’re measuring the things that are the most important to ensure a good end user experience.
Step 5: Drawing Conclusions, Communicating Results and Further Research
You’ve done the testing, you’ve crunched the results, now it’s time to draw conclusions, communicate the results and identify areas of further research. In terms of drawing conclusions, I’ll take the graph below as an example, the conclusion you might draw in this case is that the performance of the solution increases almost linearly as you add workload and add resources, in a predictable and consistent manner. Depending on your goals, this could be good or bad, and the amount at which it increases could also be good or bad. It’s just an example and needs to be interpreted in combination with question / hypothesis or success criteria, the baseline / control, and the requirements. Your conclusion might be that the system configuration tested did not have the resources required to meet the business requirements, or that it was over engineered and actually performed well beyond expected results and could be configured with less resources.
To communicate these results you can use many methods, but this is the time when graphs and pictures paint a thousand words (maybe tables for the very detail oriented). In answer to the question what is the performance and scaleability of IO for a defined number of Database VM’s on a defined number of servers you might draw a graph like the following (from Nutanix Performance with Oracle SLOB on All Flash Nodes) :
The type of communication and graphs you use will depend on the type of testing you did. You might want to show differences in system metrics are user numbers increase, under different types of hardware for the same tests, in a slide deck presentation, in a video, in a blog, or in a system performance report that documents all of the tests you performed and their results relevant to the stakeholders who are to read the report. Whatever method is used it should be concise, and relate back to the objectives.
Then you might want to propose some further research, such as changing hardware configuration, using a different platform, putting some different patches on the system or using a different version of the system software. One of the goals of the future testing might be to determine how few systems of a newer generation you need, knowing full well compute power is increasing over time, you may need half as many assets to run the same workload in the future.
It would be very easy to write a book on performance testing, and there are thousands of resources already available on the topic. The aim of this article is to give a broad brush overview. By using a scientific approach to performance testing you can gain reliable and accurate results that can achieve business outcomes and provide assurance when things change that they can still meet the requirements. Updating baseline tests between versions and keeping good records are all part of the process of performance testing and constant system improvement. Even though the thousands of iterations of tests and small modifications between tests can be mundane, you can get some very exciting and sometimes unexpected results. I wish you successful testing and welcome all your comments.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2016 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.