Category Archives: agile

So You Wanna Try Agile?

Well, in a nutshelll, agile is a state of mind.
Agile is relative.
Agile is not dogmatic.
Agile is pragmatic.

Agile is designed to reduce the gap in time between doing something, and seeing a result that you can “measure.”

Blend in Lean concepts.
Read The Goal.
Don’t do giant monolithic things.
Learn how to decompose your “features/expected outcomes” into bite-size chunks.
Don’t plan across a far horizon to the same degree.
What folks are tasked with doing today better be clear and have unambiguous meaning of “DONE”
What folks are tasked with doing in 2 months better be rather nebulous and broad.

Don’t be complacent and relax.
Question the value of deliverables that are needed to fulfill some process step for some other team.
Ask any downstream recipient what they truly need — and why.
Try new processes.
Reflect.
Agile takes constant effort and constant partial attention if you are doing it right.
Be holistic.

User Stories and FDD

FDD?

I bet you never heard of Feature-Driven Development, eh?

Well, Mike Cohn wrote this recent post:

Not Everything Needs to Be a User Story: Using FDD Features

Having worked with Peter Coad since the early 90s, and Jeff De Luca in the late 90s, I’ve been a fan of FDD and naturally turn to that style when “user stories” are not so user-centric. And yes, those are typically the minority items on our backlogs.

Software development is working through a prioritized to-do list. Most of the to-dos should be about addressing user needs. Call them user stories, call them features, maybe even call them requirements. Whatever works best to help you organize and communicate what needs to be built.

Another element of FDD is breaking down (or building up) the system into

  • Major Feature Sets (Quote Management), and their
    • Feature Sets (Clone Quotes, Create Quotation Documents), and their
      • Features (create a quote, edit a quote, archive a quote).

Major Feature Sets might loosely equate to epics 🙂

One of the keys to successful software development, is to combine the list of features with a domain model (and some UI mockups don’t hurt). The domain model need not be to the nth degree of UML detail. But one that clearly describes — in just enough detail — what your problem domain is all about. This eliminates the need to write all sorts of detail in the development issues, leaving that to the model. Then the feature list become more about the order in which we are building up various aspects of the product feature sets.

Thanks for paying a bit of homage to FDD. A blast from the past!

Measuring Effectiveness: The Software Industry’s Conundrum

On “Measuring Effectiveness”

I’ve been saying for seemingly decades, this is (one of?) our nascent software industry’s biggest conundrums (maybe even an enigma?).

Just how do you prove a team is “doing better?” Or that a new set of processes are “more effective?”

My usual reaction when asked for such proof:

“Ok, show me what you are tracking today for the team’s performance over the past few years, so we have a baseline.”

Crickets…

But seriously, wouldn’t it be awesome if we could measure something? Anything?

For my money:

Delivering useful features on top of a reasonably quality architecture, with good acceptance and unit test coverage, for a good price, is success.

Every real engineering discipline has metrics (I think, anyway, it sounds good — my kids would say I am “Insta-facting™”).

If we were painting office building interiors, or paving a highway, we could certainly develop a new process or a new tool and quantitatively estimate the expected ROI, and then prove the actual ROI after the fact. All day long.

In engineering a new piece of hardware, we could use costing analysis, and MTBF to get an idea on the relative merits of one design over another.

We would even get a weird side benefit — being relatively capable at providing estimates.

In software, I posit this dilemma (it’s a story of two teams/processes):

Garden A:

  • Produces 15 bushels (on average) per month over the growing season
  • Is full of weeds
  • Does not have good soil management
  • Will experience exponential (or maybe not quite that dramatic) production drop off in ensuing years, requiring greater effort to keep the production going. Predictability will wane.
  • Costs $X per month to tend

Garden B:

  • Produces 15 bushels (on average) per month over the growing season
  • Is weed free and looks like a postcard
  • Uses raised bed techniques, compost, and has good soil management
  • Will experience consistent, predictable, production in ensuing years
  • Costs $Y per month to tend

I could make some assertions that $Y is a bit more costly than $X… Or not. Let’s assume more costly is the case for now.

To make it easier to grok, I am holding the output of the gardens constant. This is reflected by the exorbitant rise in cost in the weedy Garden A to keep producing the same bushels per month… (I could have held the team or expense constant, and allowed production to vary. Or, I could have tried to make it even more convoluted and let everything vary. Meh. Deal with this simple analogy!)

 

Year 1 2 3 4 5 6 7 8 9 10
Garden A 100 102 105 110 130 170 250 410 730 1600
Garden B 120 120 120 120 120 120 120 120 120 120

If we look at $X and $Y in years 1 through 10, we might see some numbers that would make us choose B over A.

But if we looked at just the current burn rate (or even through year 4), we might think Garden A is the one we want. (And we can hold our tongue about the weeds.)

But most of the people asking these questions are at year 5-10 of Garden A, looking over at their neighbor’s Garden B and wanting a magic wand. The developers are in the same boat… Wishing they could be working on the cooler, younger, plot of Garden B.

What’s a business person/gold owner to do? After all, they can’t really even see the quality of the garden, they just see output. And cost. Over time. Unless they get their bonus and move on to the next project before anyone finds out the mess in Garden A. Of course, the new person coming into Garden A knows no different (unless they were fools and used to work in Garden B, and got snookered into changing jobs).

Scenario #2

Maybe we abandon Garden A, and start anew in a different plot of land every few years? Then it is cheaper over the long haul.

Year 1 2 3 4 5 6 7 8 9 10
Garden A 100 102 105 100 102 105 100 102 105 100
Garden B 120 120 120 120 120 120 120 120 120 120

I think the reason it is so challenging to get all scientific about TQM, is that what we do is more along the lines of knowledge work and craftwork, compared to assembly line.

The missing piece is to quantify what we produce in software. Just how many features are in a bushel?

I submit: ask the customer what to measure. And maybe the best you can do is periodic surveys that measure satisfaction (sometimes known as revenue).

Matt Snyder (@msnyder) tweeted me a nice video: Metrics, Metrics, Everywhere – Coda Hale


Easing New Developer Ramp-up Time

On a recent healthcare start-up team, it grew from my buddy (who moved on after 6 months or so) and I, to a handful of developers/sub-contractors.

Here is how we tried to make it fairly efficient. We just started tracking what was needed, getting feedback as each new developer went through the process, and improving the instructions and process along the way. If something was missing, we added it. If something was not clear, the new developer could amend the wiki.

By the 3rd new developer, or so, we had it down to where they could get started and begin a legitimate issue in less than a half day — from getting set up to being able to commit and deploy a new “feature.”

There was a section at the top that shared a good team chat room session with a new remote developer:

Getting Started Chat Room Conversation

That was followed by the FAQ-like list of links:

One of the first reasons I wanted to make it easy for a new team member to get rolling, was so that our friend Max — who would be doing our QA from Russia — could get started. As we added the first couple of devs, we probably decreased the start-up time as follows:

As part of “Getting Started,” I would include a simple Jira issue that helped them ensure that everything was working and that they followed our dev process:

  • Git and Dev and database (MongoDB) environment obviously had to be set up
  • Access to Jira to assign themselves to the issue, and move it — Kanban style — to the In Progress state.
  • Commit the work and the passing tests
  • Drag the Jira issue to “Done”

 

Since Atlassian’s Confluence Wiki does a stellar job at versioning pages, I actually looked back to see how the page grew and morphed over time. It started out rather modestly (and empty):

After it grew a bit bloated:

It was successively refactored into its current state, here is a snippet of the 70 versions that this page underwent from March 2011 through July 2012.

Wikis, like code, need to be tended to, nurtured, and refactored to provide the best value.

The Cost of Using Ruby’s Rescue as Logic

[notice]
If you use this sort of technique, you may want to read on.

node = nodes.first rescue return

[/notice]

 

[important]

Nov 2012 Update:

Though this post was about the performance cost of using a ‘rescue’ statement, there is a more insidious problem with the overall impact of such syntax. The pros and cons of using a rescue are well laid out in Avdi’s free RubyTapas: Inline Rescue

[/important]

Code like this:

unless nodes.nil?
  nodes.first
else
  return
end

Can be written using the seemingly more elegant approach with this ruby trick:

node = nodes.first rescue return

But then, that got me to thinking… In many languages I have used in the past (e.g., Java and C++), Exception handling is an expensive endeavor.

So, though the rescue solution works, I am thinking I should explore whether there are any pros/cons to allowing a “rescue” to act as logic. So I did just that…

Here are the two methods I benchmarked, one with “if” logic, and one with “rescue” logic:

def without_rescue(nodes)
  return nil if nodes.nil?
  node = nodes.first
end
def with_rescue(nodes)
  node = nodes.first rescue return
end

Using method_1, below, I got the following results looping 1 million times:

                  user     system      total        real
W/out rescue  0.520000   0.010000   0.530000 (  0.551359)
With rescue  22.490000   0.940000  23.430000 ( 26.487543)

Yikes. Obviously, rescue is an expensive choice by comparison!

But, if we look at just one or maybe 10 times, the difference is imperceptible.

Conclusion #1 (Normal Usage)

  • It doesn’t matter which method you chose to use if the logic is invoked infrequently.

Looking a bit Deeper

But being a curious engineer at heart, there’s more… The above results are based on worst-case, assuming nodes is always nil. If nodes is never nil, then the rescue block is never invoked. Yielding this (rather obvious) timing where the rescue technique (with less code) is faster:

                  user     system      total        real
W/out rescue  0.590000   0.000000   0.590000 (  0.601803)
With rescue   0.460000   0.000000   0.460000 (  0.461810)

However, what if nodes were only nil some percentage of the time? What does the shape of the performance curve look like? Linear? Exponential? Geometric progression? Well, it turns out that the response (see method_2, below) is linear (R2= 0.99668):

Rescue Logic is Expensive

Rescue Logic is Expensive

Conclusion #2 (Large Data Set):

In this example use of over a million tests, the decision on whether you should use “rescue” as logic boils down to this:

  • If the condition is truly rare (like a real exception), then you can use rescue.
  • If the condition is going to occur 5% or more, then do not use rescue technique!

In general, it would seem that there is considerable cost to using rescue as pseudo logic over large data sets. Caveat emptor!

Sample Code:

My benchmarking code looked like this:

require 'benchmark'

include Benchmark

def without_rescue(nodes)
  return nil if nodes.nil?
  node = nodes.first
end

def with_rescue(nodes)
  node = nodes.first rescue return
end

TEST_COUNT = 1000000

def method_1
  [nil, [1,2,3]].each do |nodes|
    puts "nodes = #{nodes.inspect}"
    GC.start
    bm(12) do |test|
      test.report("W/out rescue") do
        TEST_COUNT.times do |n|
          without_rescue(nodes)
        end
      end
      test.report("With rescue") do
        TEST_COUNT.times do |n|
          with_rescue(nodes)
        end
      end
    end
  end
end

def method_2
  GC.start
  bm(18) do |test|
    nil_nodes = nil
    real_nodes = nodes = [1,2,3]
    likely_pct = 0
    10.times do |p|
      likely_pct += 10
      test.report("#{likely_pct}% W/out rescue") do
        TEST_COUNT.times do |n|
          nodes = rand(100) > likely_pct ? real_nodes : nil_nodes
          without_rescue(nodes)
        end
      end
      test.report("#{likely_pct}% With rescue") do
        TEST_COUNT.times do |n|
          nodes = rand(100) > likely_pct ? real_nodes : nil_nodes
          with_rescue(nodes)
        end
      end
    end
  end
end

method_1
method_2

Sample Output

                  user     system      total        real
W/out rescue  0.520000   0.010000   0.530000 (  0.551359)
With rescue  22.490000   0.940000  23.430000 ( 26.487543)
nodes = [1, 2, 3]
                  user     system      total        real
W/out rescue  0.590000   0.000000   0.590000 (  0.601803)
With rescue   0.460000   0.000000   0.460000 (  0.461810)
                        user     system      total        real
10% W/out rescue    1.020000   0.000000   1.020000 (  1.087103)
10% With rescue     3.320000   0.120000   3.440000 (  3.825074)
20% W/out rescue    1.020000   0.000000   1.020000 (  1.036359)
20% With rescue     5.550000   0.200000   5.750000 (  6.158173)
30% W/out rescue    1.020000   0.010000   1.030000 (  1.105184)
30% With rescue     7.800000   0.300000   8.100000 (  8.827783)
40% W/out rescue    1.030000   0.010000   1.040000 (  1.090960)
40% With rescue    10.020000   0.400000  10.420000 ( 11.028588)
50% W/out rescue    1.020000   0.000000   1.020000 (  1.138765)
50% With rescue    12.210000   0.510000  12.720000 ( 14.080979)
60% W/out rescue    1.020000   0.000000   1.020000 (  1.051054)
60% With rescue    14.260000   0.590000  14.850000 ( 15.838733)
70% W/out rescue    1.020000   0.000000   1.020000 (  1.066648)
70% With rescue    16.510000   0.690000  17.200000 ( 18.229777)
80% W/out rescue    0.990000   0.010000   1.000000 (  1.099977)
80% With rescue    18.830000   0.800000  19.630000 ( 21.634664)
90% W/out rescue    0.980000   0.000000   0.980000 (  1.325569)
90% With rescue    21.150000   0.910000  22.060000 ( 25.112102)
100% W/out rescue   0.950000   0.000000   0.950000 (  0.963324)
100% With rescue   22.830000   0.940000  23.770000 ( 25.327054)

Manual Cucumber Tests?

there was some discussion over on the cucumber list about manual testing.

cucumber is great at BDD, but it doesn’t mean it is the only test technique (preaching to choir) we should use.

i have learned it is critical to understand where automated tests shine, and where human testing is critical — and to not confuse the two.

as far as cuking manual tests, keeping the tests in one place seems like a good advantage (as described in Tim Walker’s cucum-bumbler wiki <g>).

the cucumber “ask” method looks interesting. maybe your testers could use the output to the console as-is, or (re-)write your own method to store the results somewhere else/output them differently.

From the cucumber code (cucumber-1.1.4/lib/cucumber/runtime/user_interface.rb):

# Suspends execution and prompts +question+ to the console (STDOUT).
# An operator (manual tester) can then enter a line of text and hit
# <ENTER>. The entered text is returned, and both +question+ and
# the result is added to the output using #puts.
# ...
def ask(question, timeout_seconds)
...

Sample Feature:

    ...
Scenario: View Users Listing
  Given I login as "Admin"
  When I view the list of users
  Then I should check the aesthetics

Step definition:

Then /^I should check the aesthetics$/ do
  ask("#{7.chr}Does the UI have that awesome look? [Yes/No]", 10).chomp.should =~ /yes/i
end

The output to the console looks like this:

Thanks for the pointer, Matt!

[notice]NOTE: it doesn’t play well with running guard/spork.[/notice]
The question pops up over in the guard terminal 🙁

Of course, if you are running a suite of manual tests, you probably don’t need to worry about the Rails stack being sluggish :-p

    Spork server for RSpec, Cucumber successfully started
    Running tests with args ["features/user.feature", "--tags", "@wip:3", "--wip", "--no-profile"]...
    Does the UI have that awesome look? [Yes/No]
    Yes
    ERROR: Unknown command Yes
    Done.

Supporting SSL in Rails 2.3.4

Somehow, moving a perfectly happy production app to Rackspace and nginx caused URLs to no longer sport the SSL ‘s’ in “https” — bummer.

Link_to’s were fine… But a custom “tab_to” (responsible for highlighting the current tab) was not so happy (even though it used an eventual link_to).

Turns out, that it is the url_for method as I learned from here.

I also blended it with some ideas I found here.

# config/environment/production.rb
# Override the default http protocol for URLs
ROUTES_PROTOCOL = "https"
...
# config/environment.rb
# Specifies gem version of Rails to use when vendor/rails is not present
RAILS_GEM_VERSION = '2.3.5' unless defined? RAILS_GEM_VERSION
# Use git tags for app version
APP_VERSION = `git describe --always --abbrev=0`.chomp! unless defined? APP_VERSION
# The default http protocol for URLs
ROUTES_PROTOCOL = 'http'
...
# application_controller.rb
  # http://lucastej.blogspot.com/2008/01/ruby-on-rails-how-to-set-urlfor.html
  def default_url_options(options = nil)
     if ROUTES_PROTOCOL == 'https'
       { :only_path => false, :protocol => 'https' }
     else
       { :only_path => false, :protocol => 'http' }
     end
  end
  helper_method :url_for

Now the real kicker… Since I do not have SSL set up locally, I had to do some dev on our staging server to tweak the code and test that “https” showed up. so I turned off class caching: config.cache_classes = false.

However, when I cap deployed with it set back to “true” https did not show up. @#$$##@!!!!%%$% AARGH.

I suspect it might have something to do with not being able to open up a cached class and redefine it? I don’t know… I am going to have to go explore this oddity next…

Anatomy of a MongoDB Profiling Session

This particular application has been collecting data for months now, but hasn’t really had any users by design. At 33GB of data, pulling up a list of messages received was taking f-o-r-e-v-e-r!

So I decided to document how to go about and fix a running production system… Hope it helps.

Log into mongo console and turn on profiling (the ‘1’) to monitor slow queries. I entered >10 seconds, which really stinks (!). You should adjust it to suit your app’s definition of “slow”—maybe 500ms:

> db.setProfilingLevel(1,10000)
{ "was" : 0, "slowms" : 100, "ok" : 1 }

Next I went back to the webapp and executed the page request that exhibits the slow response …

Once the page returns, go in and look for any slow responses that the profiler logged:

> db.system.profile.find()
{
  "ts" : ISODate("2012-02-18T15:34:02.967Z"), 
  "op" : "command", 
  "command" : { "count" : "messages", 
    "query" : { "_type" : "HL7Message", "recv_app" : "CAREGIVER" }, 
      "fields" : null }, 
  "ntoreturn" : 1, 
  "responseLength" : 48, 
  "millis" : 119051, 
  "client" : "192.168.100.67", 
  "user" : "" 
}
{
  "ts" : ISODate("2012-02-18T15:35:51.704Z"),
  "op" : "query", 
  "query" : { "_type" : "HL7Message", "recv_app" : "CAREGIVER" },
  "ntoreturn" : 25,
  "ntoskip" : 791025,
  "nscanned" : 791051,
  "nreturned" : 25,
  "responseLength" : 49956,
  "millis" : 108720,
  "client" : "192.168.100.67",
  "user" : "" 
}

You can see there was a count query and a query for the data itself (we are using pagination). Sure enough, look here:

  • “ntoreturn” : 25,
  • “nscanned” : 791051,

 

Wow, that’s nasty… to return 25 records, we scanned 791,051! Gulp. Looks like a full table scan. Never a good thing (unless you have very small amounts of data).

Let’s see what sorts of indexes exist for the messages collection:

db.system.indexes.find( { ns: "production-alerts.messages" } );
{ "name" : "_id_", "key" : { "_id" : 1 }, "v" : 0 }
{ "v" : 1, "key" : { "created_at" : -1 }, "name" : "created_at_-1" }
{ "v" : 1, "key" : { "_type" : 1 }, "name" : "_type_1" }
{ "v" : 1, "key" : { "recv_app" : 1 }, "name" : "recv_app_1" }
{ "v" : 1, "key" : { "created_at" : -1, "recv_app" : 1 }, "name" : "created_at_-1_recv_app_1" }
{ "v" : 1, "key" : { "message_type" : 1 }, "name" : "message_type_1" }
{ "v" : 1, "key" : { "trigger_event" : 1 }, "name" : "trigger_event_1" }

Well, as expected, there is no index covering the multiple keys that we are searching on. So let’s add a multi-key index to match the query used by the controller!

db.messages.ensureIndex({_type:1, recv_app:1});

Now the app FLIES!! We dropped from 100+ seconds to 1.5 seconds (look at the “millis”) w00t!

db.messages.find({ _type : "HL7Message", recv_app : "CAREGIVER"}).explain();
> db.messages.find({ _type : "HL7Message", recv_app : "CAREGIVER"}).explain();
{
	"cursor" : "BtreeCursor _type_1_recv_app_1",
	"nscanned" : 791153,
	"nscannedObjects" : 791153,
	"n" : 791153,
	"millis" : 1546,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"isMultiKey" : false,
	"indexOnly" : false,
	"indexBounds" : {
		"_type" : [
			[
				"HL7Message",
				"HL7Message"
			]
		],
		"recv_app" : [
			[
				"CAREGIVER",
				"CAREGIVER"
			]
		]
	}
}

To prevent this sort of thing, you can consider adding indexes when you create new queries. But the best way to do this is to be empirical and know whether you should add the index through some testing. I’ll leave that for another day!