Measuring Effectiveness: The Software Industry’s Conundrum

On “Measuring Effectiveness”

I’ve been saying for seemingly decades, this is (one of?) our nascent software industry’s biggest conundrums (maybe even an enigma?).

Just how do you prove a team is “doing better?” Or that a new set of processes are “more effective?”

My usual reaction when asked for such proof:

“Ok, show me what you are tracking today for the team’s performance over the past few years, so we have a baseline.”

Crickets…

But seriously, wouldn’t it be awesome if we could measure something? Anything?

For my money:

Delivering useful features on top of a reasonably quality architecture, with good acceptance and unit test coverage, for a good price, is success.

Every real engineering discipline has metrics (I think, anyway, it sounds good — my kids would say I am “Insta-facting™”).

If we were painting office building interiors, or paving a highway, we could certainly develop a new process or a new tool and quantitatively estimate the expected ROI, and then prove the actual ROI after the fact. All day long.

In engineering a new piece of hardware, we could use costing analysis, and MTBF to get an idea on the relative merits of one design over another.

We would even get a weird side benefit — being relatively capable at providing estimates.

In software, I posit this dilemma (it’s a story of two teams/processes):

Garden A:

  • Produces 15 bushels (on average) per month over the growing season
  • Is full of weeds
  • Does not have good soil management
  • Will experience exponential (or maybe not quite that dramatic) production drop off in ensuing years, requiring greater effort to keep the production going. Predictability will wane.
  • Costs $X per month to tend

Garden B:

  • Produces 15 bushels (on average) per month over the growing season
  • Is weed free and looks like a postcard
  • Uses raised bed techniques, compost, and has good soil management
  • Will experience consistent, predictable, production in ensuing years
  • Costs $Y per month to tend

I could make some assertions that $Y is a bit more costly than $X… Or not. Let’s assume more costly is the case for now.

To make it easier to grok, I am holding the output of the gardens constant. This is reflected by the exorbitant rise in cost in the weedy Garden A to keep producing the same bushels per month… (I could have held the team or expense constant, and allowed production to vary. Or, I could have tried to make it even more convoluted and let everything vary. Meh. Deal with this simple analogy!)

 

Year 1 2 3 4 5 6 7 8 9 10
Garden A 100 102 105 110 130 170 250 410 730 1600
Garden B 120 120 120 120 120 120 120 120 120 120

If we look at $X and $Y in years 1 through 10, we might see some numbers that would make us choose B over A.

But if we looked at just the current burn rate (or even through year 4), we might think Garden A is the one we want. (And we can hold our tongue about the weeds.)

But most of the people asking these questions are at year 5-10 of Garden A, looking over at their neighbor’s Garden B and wanting a magic wand. The developers are in the same boat… Wishing they could be working on the cooler, younger, plot of Garden B.

What’s a business person/gold owner to do? After all, they can’t really even see the quality of the garden, they just see output. And cost. Over time. Unless they get their bonus and move on to the next project before anyone finds out the mess in Garden A. Of course, the new person coming into Garden A knows no different (unless they were fools and used to work in Garden B, and got snookered into changing jobs).

Scenario #2

Maybe we abandon Garden A, and start anew in a different plot of land every few years? Then it is cheaper over the long haul.

Year 1 2 3 4 5 6 7 8 9 10
Garden A 100 102 105 100 102 105 100 102 105 100
Garden B 120 120 120 120 120 120 120 120 120 120

I think the reason it is so challenging to get all scientific about TQM, is that what we do is more along the lines of knowledge work and craftwork, compared to assembly line.

The missing piece is to quantify what we produce in software. Just how many features are in a bushel?

I submit: ask the customer what to measure. And maybe the best you can do is periodic surveys that measure satisfaction (sometimes known as revenue).

Matt Snyder (@msnyder) tweeted me a nice video: Metrics, Metrics, Everywhere – Coda Hale


Code City Metrics

Here is a cool “metrics” visualization tool that Thomas blogged about: Visualizing Code Aesthetics

For me, using city layout, while intriguing, might not generate immediate grokking by most people.

It is cool how it shows multiple dimensions by adding

  • blocks:packages and
  • building:classes have
    • footprint based on #attributes and
    • height based on LOC.

I am not sure that people have an automatic reaction to a cityscape that can equate “good code” to a good-looking city. For example, I think most people would say that a good-looking city has a nice skyline with tall buildings grouped in an aesthetically-pleasing manner. That might represent bad code, who knows?

Also, some bad code smells at the class level are

  • All attributes — a data blob — would be big footprint, low height
  • All methods — an overachiever — would be tall and skinny

I guess the code cityscape would lead you to see some obvious outliers. But does it tell you much more than that? Does it tell you anything about the “goodness” of the design? Does it tell you anything that a list of computed metrics doesn’t point out with less fanfare?

What is missing in the code city — and arguably of greater import IMHO — are indications of high coupling, low cohesion; and cyclomatic complexity values (i.e., how convoluted are your LOC).

Nonetheless, the Code City does get your attention as it is pretty cool looking at first glance 🙂

Thanks for sharing!

NOTE: Thomas pointed out that there are some different ways to view the metrics that address some of my metric faves:

Just so you know, there's a bunch of other metrics out of the box:

Color buildings by:
* Brain class
* God class
* Data class

You can also break the classes down into methods (look like floors in
the buildings) to study this:
* Brain method
* Feature envy
* Intensive coupling
* Dispersed coupling
* Shotgun surgery

On Metrics

I was watching Corey Haines’ video

And I had these thoughts:

A challenge that I see is not so much about gathering metrics. I have been using metrics since the days of PC-Metric and PC-Lint (IIRC). I would try to get my code and designs as good as possible, without being too crazy about it all.

Later, I added 60+ metrics and 60+ audits into Together, you could even trend your results to see progress over time with a given code base. Big whoop. I even had dreams of anonymously uploading audits & metrics to a website so people could collaborate on arriving at good, meaningful values for various metrics. (I routinely got asked “What is a good level for metric x?”)

Yea, so we all know, “You can’t improve what you don’t measure.”

But what are we trying to improve? Quality? Reliability? Agility to make changes? Profit?

Just how do we correlate a measurement to a desired outcome? Can we tie a set of metrics to their impact on the business goals for the software? Less complexity equals more profit and more (happy) customers?

Or do we stop short of that and tie metrics to achieving “quality” and presume that if we target a given application to meet the “right” amount of quality, the business value will naturally follow?

This is a difficult conundrum for our industry. But we do have to start somewhere.

In the world of engineering, there exist measurements that can be tied to desired performance and cost.

We need something similar should we want to mature beyond just seat-of-the-pants and gut-feel techniques.

I am sure some folks have it down to a science… and for them, it must be a nice competitive advantage that is probably hard to share publicly.

a little bug – part 1

The Bug: Certain questions that were supposed to be included due to server-side logic were not being properly included.

There was an odd chain of events… The deeply embedded process (shown below) did not seem to behave properly. A couple of developers set about to test a bit more thoroughly. They discovered that the order of the choices for the questions that were supposed to be included affected whether they were successfully included or not. Mind you… there are only two choices: INCLUDE and EXCLUDE.

Context: The application is very loosely a Questionnaire app…. just so happens that some of the questions get processed by this third-party black box which determines if some of the questions should be required/included or not. Required in the sever process world means “included” in the UI world.

I’ll start you off with some code to peruse, full comments and all:

for each (var qstnTO : QuestionTO in processedQuestionArray) {

  for each (var qstn : QuestionVO in __localQuestionsToProcess) {

    if (qstn.questionID == qstnTO.questionID) {

        if (qstnTO.required) {

        for each (choice in qstn.choices) {

          choice.selected = (choice.displayText == "INCLUDE") ? true : false;

          changedQuestions.addItem(qstn);

          // ?? Not sure if the following is necessary or not

          choice.enabled  = true;

          choice.visible  = true;

          qstn.section.visible = true;

          qstn.section.enabled = true;

          // ??

          break;

        }

      }

    }

  }

Can you spot anything that might point to why the testers saw different behavior based on the order of choices?

I’ll be back later to elaborate…