WINNOWING TESTS: GETTING QUALITY COVERAGE FROM A 
MODEL CHECKER WITHOUT QUANTITY
Paul E. Black, National Institute of Standards and Technology, Gaithersburg, Maryland 
paul.black@nist.gov
Scott Ranville, New Eagle Software, Ann Arbor, Michigan
sranville@neweagle.net



Abstract
Test generation is easy: we can generate piles 
of tests randomly, by model checker mutation 
analysis of formal specifications, through path 
coverage, etc.  But just increasing the quantity of 
tests may not be a cost-effective way of increasing 
the quality of the test.  The number of tests, and 
hence the cost of maintaining and running large test 
sets, can increase exponentially with the increase in 
real coverage.
We present a number of automated methods to 
pick out a smaller set of tests that still have good 
coverage.  The first method is finding a subset of 
tests with equal coverage for a test criterion.  To do 
this, we must measure the coverage for a test set, 
given a test criterion.  Some possible test criteria are 
path coverage, n-switch, specification mutation, and 
branch coverage.  We use a greedy algorithm to 
choose tests satisfying requirements of the criterion.
Second, tests that are prefixes of other, longer 
tests or duplicates may be dropped.  Finally, the 
cross section of a requirement is the reciprocal 
number of tests that satisfy it.  We can choose tests 
with the highest resolution, that is, tests satisfying 
requirements with the smallest cross section, since 
these are likely to be the most sensitive tests.
Introduction
Let us define a few terms.  After [1], a test 
criterion is the overall strategic goal, or a judgment 
about what aspects one wishes to test.  Some cri-
teria are source code branch coverage, du-paths, 
specification boundary testing, MC/DC, random 
testing, use cases, and mutation adequacy [2].  A 
test objective is a specific, tactical goal.  It comes 
from considering the unit under test and the test 
criteria.  Here are some typical test objectives and 
their test criterion.
?	Execute the program so branch #27 is not taken 
(branch coverage).
?	Choose inputs so x is just barely less than y 
(boundary testing).
?	Create an account, make a deposit to it, and 
then try to delete it (use case).
A test case is a specific test.  It should satisfy 
one or more test objectives.  (Otherwise, why 
consider it?)  A test case consists of one or more 
steps.  A test set is one or more test cases, usually 
having some common theme, objective, criterion, 
etc.
Test coverage is the extent to which a test set 
satisfies a test criterion, that is, how well the test set 
tests aspects of concern.  It is often a percentage of 
the number of test objectives satisfied or some other 
simple, objective function.
Generating huge tests sets is easy for some test 
criterion.  However, larger and larger test sets do 
not automatically lead to proportionately better test 
coverage, especially when there is a random 
element in test generation.  Unless there is a very 
particular guiding force, random selections tend to 
produce broadly similar test cases over and over.
Two alternatives are to use a test generation 
method that produces few, if any, overlapping test 
cases or do a post-processing step to exclude 
overlapping test cases.  Using a different, 
deterministic test generation method that produces 
few overlaps is desirable, but not always possible.  
Also, some level of random testing helps cover 
behavior that may be otherwise overlooked.
Winnowing
We call the post-processing step to minimize 
redundant test cases winnowing following [3].  But 
why reduce the size of the test set at all, that is, why 
not run every possible test?  The goal of test 
engineers is to "identify errors in a minimal fashion.  
Due to time constraints, if the test engineers get 
multiple tests that fail for the same reason, they will 
probably get frustrated…"  [4] There is always 
limited time to run the tests and analyze the results.
Of course, in some situations the cost of 
winnowing test cases outweighs the benefit of less 
test run, analysis, and maintenance time.  In other 
circumstances, test engineers may prefer many test 
cases that each precisely indicate a specific error to 
fewer cases whose failure could result from many 
errors.  For those situations and test generation 
methods that call for winnowing test cases, there 
are many different techniques.  Some techniques are 
applicable to just certain test generation methods.  
We discuss two techniques: dropping prefixes or 
extensions and finding a complete coverage subset.
Drop Prefixes or Extensions
Test cases may be divided into two categories: 
failing test cases and passing test cases [5].  Proper 
software should not exhibit the behavior checked in 
a failing test case.  Safety and security constraints 
are often naturally modeled as failing test cases. For 
instance, the requirement "wheel retraction is not 
allowed with weight on wheels" leads to a test case 
with weight on wheels then a "landing gear up" 
command.  The software should fail to issue a 
retraction sequence.  On the other hand, proper 
software should do what is indicated in a passing 
test case.  As passing tests are more common, we 
consider them first.
Some test cases may be prefixes of longer 
cases.  That is, both test cases start in the same 
state, proceed with identical inputs and expected 
outputs, but one has additional steps.  For passing 
test cases, correct software should pass both the 
longer test case and the prefix.  Since the full test 
case exercises more (or strictly speaking, no less) 
behavior than the prefix, prefixes of passing test 
cases may be discarded.
For failing test cases, we take the opposite 
tack.  We drop failing test cases that are extensions 
of shorter failing test cases.  Since correct software 
should fail a short test case, trying a longer test case 
is no more sensitive.  The software should have 
already departed from the test case during the first 
steps.
Some test generation techniques, such as 
mutation analysis, generate many duplicate test 
cases.  All but one of the duplicates may be 
dropped.  Thus, a single step of sorting and 
syntactic comparing can reduce the number of test 
cases, and hence the size of the resulting test set.  
For some methods it may reduce the size by an 
order of magnitude and still have identical 
coverage.
Complete Coverage Subset
Since a single test case often satisfies more 
than one test objective, we may further reduce the 
size of the final test set by choosing a subset of the 
test cases.  Although finding a minimum subset that 
covers the objectives is NP-complete, simple 
greedy heuristics often suffice.
An approach is to effectively create a matrix of 
all test cases indicating which test requirements 
each one satisfies.  Test cases that are dominated by 
other cases may be dropped.  That is, if one test 
case satisfies only a proper subset of requirements 
of another, it may be dropped.
Another method is to begin with an empty 
final test set and the set of candidate test cases.  
Repeatedly find a test requirement that is not 
satisfied by any case in the final set, and choose a 
candidate test case that satisfies it.  Continuing until 
all requirements are satisfied, we hopefully have a 
smaller test set.
Highest Resolution Tests
Although the previously described heuristics 
can reduce the size of test sets, there is still a 
window of opportunity.  Recall that generating test 
cases to satisfy test requirements is rarely our 
ultimate goal.  We want to choose a test set, 
relatively miniscule compared with all possible 
executions, which has a good chance of revealing 
latent errors.  Can we formalize the notion of 
picking a set of sensitive test cases?
Some errors are harder to find than others. 
Analogously, some test requirements are harder to 
satisfy than others.  As a gross example, an error 
that requires many rare conditions to be manifest is 
harder to find than an error that immediately aborts 
the program.  Similarly, a requirement to reach a 
state that takes a minimum of a dozen events to 
reach is harder to satisfy than a requirement to visit 
the start state.  Whittaker and Voas [6] refer to the 
ease of finding a fault as the size of the fault.  We 
use an analogy to the ease of detecting an airplane 
with radar or hitting an atomic particle. We call the 
ease of detecting an error or satisfying a 
requirement the cross section of the error or 
requirement.
A test requirement with a large cross section is 
generally easier to satisfy than one with a small 
cross section, similarly for errors.  Formally, the 
cross section of a requirement or error is the ratio of 
test cases that satisfy or detect it to all tests.
CS(r) = # satisfying tests/# tests
In general, we want test cases that satisfy those 
requirements with small cross sections.  Presumably 
these test cases are quite sensitive and can detect 
errors with small cross sections, too.  We call the 
ability of a test to satisfy requirements with small 
cross sections, the resolution.  The higher the 
resolution, the more small cross section 
requirements the case satisfies.  Formally, it is the 
sum of squares of reciprocals of requirement cross 
sections it satisfies.
RES(t) = ?1/CS(r)2
An improved subset heuristic begins with an 
empty final test set and a set of candidate test cases, 
as before.  We pick the unsatisfied requirement with 
the smallest cross section, that is, the one satisfied 
by the smallest number of test cases.  From all the 
test cases that satisfy the requirement, we choose 
the test case with the greatest resolution and add it 
to the final set.  We continue picking unsatisfied, 
small cross section requirements and high-
resolution test cases until all requirements are 
satisfied.
Example
As a very small example, we use the 
automobile cruise control presented in Atlee and 
Buckley [7].  The specification takes one table of 12 
rows and nine columns in SCR or about 90 lines in 
SMV [8].  Using a mutation analysis criterion with 
many mutation operators, we generated 807 mutant 
specifications that yielded 379 counterexamples.  A 
counterexample is an execution trace of the given 
state machine showing how the specification is 
false.  Thus each counterexample is a potential test 
case.  We could have chosen a set of mutation 
operators with nearly the same coverage, but with 
an order of magnitude fewer mutants [9].  We 
didn't bother for this tiny example.
After winnowing duplicates, 35 unique test 
cases remained.  Since these are all passing tests, 
we found and dropped 11 test cases that were 
prefixes of other, longer cases, leaving 24 cases.  A 
complete subset that satisfied all the requirements 
(killed all the mutants) consisted of only 19 tests.
For this example, the test cases were required 
because they were the only ones that satisfied some 
requirement (killed some mutant) formed a set that 
satisfied all requirements.  So no more elaborate 
grading on resolution was needed.
Obviously, cross sections may be combined in 
different ways to compute resolutions.  Depending 
on the test generation method and requirements, it 
may be better to use the geometric or arithmetic 
mean of the reciprocals.  We could use the sum of 
the reciprocals, instead of the square of reciprocals, 
to allow requirements with high cross sections to 
contribute more.
 [1] Black, Paul E., 2000, Modeling and 
Marshaling: Making Tests from Model Checker 
Counterexamples, Proc. 19th Digital Avionics 
Systems Conf., IEEE, section 1.B.3.
[2] Biezer, Boris, 1990, Software Testing 
Techniques, Van Nostrand Reinhold.
[3] Ammann, Paul E. and Black, Paul E., 2001, A 
Specification-Based Coverage Metric to Evaluate 
Test Sets, Intern'l Jour. of Reliability, Quality and 
Safety, Eng., World Scientific Publishing.
[4] Ranville, Scott and Black, Paul E., 2001, 
Automated Testing Requirements – Automotive 
Perspective, 2nd Intern'l Workshop on Automated 
Program Analysis, Testing and Verification, 
Toronto, Canada. 
[5] Ammann, Paul A., Paul E. Black, and William 
Majurski, 1998, Using Model Checking to Generate 
Test from Specifications, Proc. 2nd IEEE Intern'l 
Conf. on Formal Eng. Methods, IEEE, pp. 46-54.  
Also NIST-IR 6166.
[6] Whittaker, James A. and Voas, Jeffrey, 2000, 
Toward a More Reliable Theory of Software 
Reliability, IEEE Computer, 33(12), pp. 36-42.
[7] Atlee, Joanne M., and Buckley, M. A., 1996, A 
Logic-Model Semantics for SCR Software 
Requirements, Proc. 1996 Intern'l Symp. on 
Software Testing and Analysis, pp. 280-292.
[8] McMillan, Kenneth L., 1993, Symbolic Model 
Checking, Kluwer Academic Publishers.
[9] Black, Paul E., Okun, Vadim, and Yesha, 
Yaacov, 2000, Mutation of Model Checker 
Specifications for Test Generation and Evaluation, 
Mutation 2000, San Jose, California.
	1