A quick follow-up related to my earlier post on evidence. As some readers may know, avionics software that controls flight on airplanes (e.g., cockpit software) is subject to a test coverage standard, FAA DO-178B. That standard applies lower standards of test coverage to software that is not safety critical.
So far, so good.
Here's an example of why such standards are useful. During my flight from the US to China today, I managed to crash the entertainment software running at my seat not once by three times. I did this by pausing, rewinding, and resuming play when the flight attendants were taking my dinner orders (i.e., not by unusual actions). I was ultimately able to get it working again, thanks to a series of hard reboots by a flight attendant. One of my fellow passengers wasn't so lucky, as his system never recovered.
Okay, that's just entertainment, and anyone who travels regularly knows they should bring a book or plan to winnow down their sleep deprivation balance on long flights.
However, what if the flight control software were as easy to crash? Who would want to hear a cockpit announcement along the lines of the following: "Our entire flight control system just crashed. This enormous airliner is now essentially an unpowered and uncontrolled glider. We'll reboot the system until we get it working again, or until we have an uncontrolled encounter with terrain"?
Personally, I want people testing the more safety-critical aspects of avionics software to adhere to higher standards of coverage, and to be able to provide evidence of the same.
I received another interesting e-mail from a colleague a few weeks ago. Sorry about the delay in response, Simon, but here are my thoughts. First, Simon's e-mail:
I have been reading the Advanced Test Manager book & have been discussing the possibility of adopting an informal risk based approach in my test team, but I am encountering some resistance, which has also got me thinking. You have covered (in several places) the topic of gaps in risk analysis from a breadth point of view, but how about the issue of disparity in 'depth' for identified risk items? For example in your ‘Basic-Sumatra’ Spreadsheet there is a huge variation in depth
between, for example the risk item ‘Can’t cancel incomplete actions using cancel or back.' (A functional item that has a risk score) and 'Regression of existing Speedy Writer features.' (This is also a functional item, but may constitute several hundred test cases).
In my case an experienced tester is against the idea of informal risk analysis due to the effort involved. The scenario is one where a regression 'plan' (set of test cases) is already in place for an enterprise scale solution with 10 main components deployable in both a
Web & Windows client manner. So the usual regression test execution 'plan' requires executing a complex test procedure 10x2 times. In total there is several hundred test cases to execute (some components have approx 100 test cases).
When I suggested an informal (PRAM) style risk identification to each new project the response was:-
The effort of establishing such a 'test plan' seems to be enormous considering that the whole thing has to be performed per application component for each Win and Web client (i.e. 10 x 2 times). I estimate that the number of items requiring risk scoring will be approx 100 for each of the bigger components let alone the whole of the application.
In response to this I pointed out that we could have a 'coarse grained' risk item identification & score - perhaps 20 lines on the risk assessment spreadsheet- 1 for each component\deployment combination.
The response to that was:-
If each of these 20 lines has got an RPN and all the test cases assigned to it just inherited this RPN, this would mean that we would perform an 8 hour test on ‘Securities Win client’ before even beginning with the test of another component, which has got a lower
RPN. Further, this could mean that low-priority components might not be tested at all in a tight time schedule. This cannot be the desired test procedure. It must be ensured that each component is at least tested basically on Win and Web … which would again lead us to scoring risk items at the test case level within each component for Windows and Web & that has the problem of the effort involved.
Do you have any suggestions for handling this depth of risk identification issue?
This is an important question, Simon, that brings up three important points.
First, the amount of effort invested must be considered. We usually find that the risk analysis can be completed within a week. The time involved depends on the approach used. If you use the group brainstorm approach, then each participant must invest an entire day, with the leader of the risk analysis typically investing a couple days in addition on preparation, creating the analysis, doing follow-up, etc. If you use the sequential interview approach, then each participant invests about three hours, with 90 minutes in the initial interview and 90 minutes in the review/approval process for the document, with the leader of the risk analysis again investing about three days of effort.
Second, the question of granularity of the risk analysis is also important. The granularity must be fine-grained enough to allow unambiguous assignment of likelihood and impact scores. However, if you get too fine-grained then the effort goes up to an unacceptable level. A proper balance must be struck.
Third, the question of whether we might not test certain important areas at all because they are seen as low risk is indeed a problem. What we typically suggest is what's called a "breadth-first" approach, which means that to some extent the risk-order execution of tests is modified to ensure that all major areas of the software are tested. These areas are tested in a risk-based fashion, but every area gets at least some amount of testing.
Many of these topics are addressed in the sequence of videos on risk based testing that you can find on our digital library. I'd encourage interested readers to take a look at those brief videos for more ideas on these topics.
I recently received an interesting e-mail from a colleague:
To Whom It May Concern-
Do you have any articles on the value of collecting/capturing detailed test evidence (e.g., screenshots attached to test cases)?
In my opinion, for mature systems with experienced, veteran testers, the need for an abundance of test evidence in the form of screenshots attached to test runs in QC is overkill and unecessary that adds more time to release cycles. The justification for this is awlays "For Audit" as opposed to "Improves Quality". I looked in several articles on this fantastic site, and couldn't find anything pertaining to test evidence. Do you have any articles that provide evidence that an abundance of test evidence improves quality (even if it's just a correlation and not necessarily causation)?
We have clients that do need to retain such detailed software testing evidence; e.g., clients working in safety critical systems (such as medical systems) who must satisfy outside regulators that all necessary tests have been run and have passed. For them, retaining such evidence is a best practice, as not doing so can result in otherwise-worthy systems being barred from the market due to the lack of adequate paperwork.
As someone who relies on such systems to work--indeed, as we all do--I appreciate these regulations and would not want to see software held to a lesser standard. However, Erik makes a very valid point in terms of the trade-off. As time is spent on these audit-trail activities, that is time not spent doing other tasks that would perhaps result in a higher level of quality. Of course, these audit-trail activities are designed to ensure that all critical quality risks are addressed. So, the key question is how should organizations balance the risk of failing to test certain critical quality attributes against the reduction in breadth of quality risk coverage?
I'd be interested in hearing from other readers of this blog on their thoughts. Erik, if you have further comments on this matter, I'm sure the readers of this blog would benefit from those ideas, as this is clearly an important area to consider. I certainly agree it's an interesting topic for an article, and this blog discussion may well inspire me to collaborate with you and other respondents to write one.
I had an interesting set of questions from a reader arrive in my inbox today. I've interleaved my answers with his questions, with "RB:" in front.
Dear Mr. Black,
Would you please comment on the following three questions, or perhaps direct me to where I might gain some meaningful information that addresses them?
What is today’s trend in pricing for the software testing industry i.e. is it increasing, decreasing, stable, etc.?
RB: There certainly are what marketers refer to as "value customers" who make service purchase decisions solely on price, and these customers continue to drive down pricing on average. However, at the top end, especially for clients that need and value senior consultants, we have managed to resist that.
Is the service looked at as value added or a commodity, with pricing accordingly?
RB: For the "value customer" mentioned above, it's a commodity. For other customers, it's really a matter of doing a good job of connecting what is happening in testing with strategic business objectives. I talked about this in my chapter in the book, Beautiful Testing. To the extent that testing is very tactical and inward focused--especially when the focus is almost entirely on finding a large number of potentially unimportant bugs--it will be seen as a commodity.
Given that much of the labor is offshore in India and China, and subject to increase as these countries develop, will market be receptive to required price increases to allow a reasonable margin?
RB: The "value customer" will not be receptive to such price increases, because price is all that matters. The value customer will try to have their cake and eat it, too, by raising the minimum bar of qualifications while not allowing price to rise. Because there are billions of under-utilized human brains in the world, and because technology has almost eliminated barriers to entry for using those brains as commodity software testers, the value customer will get to have their cake and eat it, too.
SGS Consumer Testing Services
Randy, thanks for the questions. I talked about some matters relevant to these questions in my webinar on the Future of Test Management, which you can view here.
I'd be interested in other people's comments. What do you think about these questions?
One of the topics I find very interesting and useful for our clients is the proper use of metrics. We do a lot of metrics-related engagements, and in fact just this morning I'll be talking with a client about some US$ 100 million in defect-related waste that we've found in their software development process. I've written a lot on the topic, including in my books and in various articles.
Regular blog reader Gianni Pucciani asks an interesting metrics question in an e-mail:
The question is: how can you give a bonus to your test team, to motivate it, based on 90% of bugs found before the release to production date? How can you know that you found 90% of the bugs at the time you release the software?
Gianni is referring to a metric called defect detection effectiveness or defect detection percentage. This is a metric I've discussed quite a bit in my books, especially Managing the Testing Process.
Defect detection effectiveness is a very useful metric for measuring the effectiveness of a test process at defect detection. Most testing processes have defect detection as a primary objective, and we certainly should have effectiveness and efficiency metrics for objectives.
That said, it is a retrospective metric that can only be calculated some time after a release, if you intend to calculate it on a release-by-release basis. (Some of our clients calculate it on an annual basis, aggregating all their projects together, which also works.) It's typical to wait 90 days after a release to calculate defect detection effectiveness, though you really should verify what time period is required to have say 80% or more of the field-reported defects.
I could go on for days about this metric, but, since it's a blog and since Gianni asked a specific question, I'll address the other point he brought up, which is the use of this metric for bonuses. Defect detection effectiveness is a process metric, which is not the same as a metric of individual or collective performance. Many things are required to enable good defect detection effectiveness, including good testers, and many things can reduce defect detection effectiveness, some of which are beyond the control of testers. I'd encourage a web search on the string "Deming red bead black bead experiment" for a discussion on the risks of rewarding or punishing based on metrics that might not be entirely in the individuals' control.
In addition, while defect detection is typically a primary objective of testing, it's not the only objective, and defect detection effectiveness is only an effectivness metric. It doesn't measure the efficiency or the elegance with which the test team detects defects. A test process should have a fully articulated set of objectives, with effectiveness, efficiency, and (ideally) elegance metrics for each objective, rather than a single unidimensional metric by which it is measured.
For further information on defect detection effectiveness, I'd refer people to my book Managing the Testing Process, 3e. My colleague Capers Jones also contributed an article to our web site on a couple related defect metrics that readers might find interesting.
Reader Gianni Pucciani has another good question about the Advanced Software Testing: Volume 2 book. Specifically, he's concerned with question 2 from Chapter 8:
Which of the following is a best practice for retrospective meetings that will lead to process improvement?
A. Ensuring management commitment to implement improvements
B. Allowing retrospective participants’ to rely exclusively on subjective assessment
C. Requiring that every project include a retrospective meeting in its closure activities
D. Prohibiting any management staff from attending the retrospective meeting
Gianni writes, "I had marked A, but also C. Where is the mistake? I have a feeling on it, but I would like you to confirm. Is C not correct because it is an organizational best practice, and not a best practice for retrospective meetings. A logic trick basically :), is that correct?"
Actually, Gianni, the reason C is not correct is because merely having retrospectives does not guarantee process improvements. In fact, I've encountered a few situations were organizations were good about having retrospectives, but not so good about management commitment, and thus no improvements occurred.
The Financial Times today featured an article on how a software bug--abysmally handled--in a financial application cost the company US$ 242,000,000:
Because I don't know how long that link will live, here's the summary.
Axa Rosenberg Group had some quantitative analysis software that it used to service its clients accounts. Axa Rosenberg Group manages money for other people, and the software is an internal application, albeit one they touted as a key differentiator, apparently--and indeed it did turn out to be, though not in a happy way.
The software had a bug that disabled a key risk-management component of the software, which was released to production in 2007. Apparently management found out about the bug in November 2009. However, rather than fix the problem, they tried to cover up the reasons for the poor performance of their funds.
Over one third of their customers were affected by the bug.
A wee bit of analysis from yours truly: I have clients in the financial world, and I know how hard it can be to test these kinds of applications. When a calculation is wrong, it can be wrong in a way that is beyond the ability of a human tester to detect. However, Axa Rosenberg Group's handling of the bug after they found out about it is truly a textbook illustration of how not to handle a software quality problem.
While I typically restrict myself to discussions and posts related purely to how to do and manage software testing better, I feel I must make a brief side expedition to the land of commentary. This should not be a controversial commentary, but I'm afraid it will be with some. I'd like to make a brief call for more civility in the way software testing professionals address each other, both in print and in person.
The following are real quotes from published articles this year (not an old year). They are phrases used to describe software testing professionals. They are used by people who style themselves as experts and coaches in the software testing profession. See how professional and encouraging these words sound to you: "profiteer and bully," "risk-based testing cargo cult," "moral and intellectual bankrupt," "shadowy pseudo-experts," "power mad," and "embarrassingly stupid."
I could go on, but you get the picture.
I have a simple rule for public discourse, both on-line and in-person: if people want to participate in a debate or discussion with me, they can expect me to be civil and respectful towards them and towards other software testing professionals, and I expect the same from them. It'll be a better software testing world, and we'll make a lot more progress together, when this simple rule--one we all learned as children, if we paid attention in school--wins out over the sort of self-promotion-through-name-calling that dominates so much of our debate.
Back to your regularly scheduled fact-focused software testing blogging...
Like most people, I don't always read those pesky agreements that come with software these days, but I made an exception for the Tune-Up package I'm installing to try to revive my tired old Windows XP system. I came across this curious contradiction in the warranty section of the agreement:
The Software and your documentation are free of defects if they can be used in accordance with the description of the Software and its functionalities that was provided by TuneUp at the point in time that you received the Software and documentation. Further qualities of the Software are not agreed.
Since no Software is free of defects, we urgently recommend you to back up your data regularly.
Okay, guys, what is it? Is the software free of defects or not? If it is free of defects, perhaps you could enlighten us all on how you did that?
Here's another good observation on a question in the Advanced Test Manager book. Gianni Pucciani commented about question 18 in chapter 3:
Assume you are a test manager in charge of integration testing, system testing, and acceptance testing for a bank. You are working on a project to upgrade an existing automated teller machine system to allow customers to obtain cash advances from supported credit cards. The system should allow cash advances from $20 to $500, inclusively, for all supported credit cards. The supported credit cards are American Express, Visa, Japan Credit Bank, Eurocard, and MasterCard.
During test execution, you find five defects, each reported by a different tester, that involve the same problem with cash advances, with the only difference between these reports being the credit card tested. Which of the following is an improvement to the test process that you might suggest?
A. Revise all cash advance test cases to test with only one credit card.
B. Review all reports filed subsequently and close any such duplicate defect reports before assignment to development.
C. Change the requirements to delete support for American Express cards.
D. Have testers check for similar problems with other cards and report their findings in defect reports.
The answer is D. Gianni commented, "I see that B is a reactive solution, does not really improve the process. But I probably misinterpreted D: I thought of D as a duplication of work, cause I thought it was suggesting that each testers execute the same test case with all the 4 credit cards. Instead I suppose the real sense was that each tester should just check, before filing a new bug, bug reports already opened on the same issue, and add information in there... The improvement I would suggest is that each tester executes his/her own test cases with all the 4 cards, which I think is better than D."
Yes, Gianni, this is the sense in which I meant option D. When testers find a bug, they should isolate it by checking against the other cards. One of the problems with multiple choice questions is that you can't use an entire paragraph in each option!