MongoDB Group Map-Reduce Performance

I wanted to get some aggregated count data by message type for a collection with over 1,000,000 documents.

The model is more or less this (a lot removed for simplicity):

class MessageLog
  include MongoMapper::Document

  # Attributes ::::::::::::::::::::::::::::::::::::::::::::::::::::::
  # Message's internal timestamp / when it was sent
  key :time, Time
  # Message type
  key :event_type, String, :default => "Not Set"
  # The message ID
  key :control_id, String, :index => true

  # Add created_at, updated_at
  timestamps!

  # Indexes :::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  def self.create_indexes
    MessageLog.ensure_index([[:created_at,1]])
    MessageLog.ensure_index([[:event_type,1]])
  end
end

Distinct

Though I originally hard-coded the message types (as they do not change very often, and are meaningless without other code changes anyway), I figured I would test dynamically gathering the distinct types. MongoDb supports the distinct function. From the MongoDB console:

> db.message_logs.distinct("event_type")
[
	"Bed Order",
	"Cactus Update",
	"ED Release",
	"ED Summary",
	"Inpatient Admit",
	"Inpatient Discharge Summary",
	"Not Set",
	"Registration",
	"Registration Update",
	"Unknown Message Type"
]

Though I saw distinct in MongoMapper, I had trouble getting it to work (this is an older app on <v2.0, method missing error).

However, a very powerful technique within MongoMapper worked just perfect! Essentially, every collection in MongoMapper will return itself as a collection that MongoDB understands (in their db.collection.blah format) — helps when you need to execute MongoDB style commands:

class MessageLog
  # @return [Array] a list of unique types (strings)
  def self.event_types
    MessageLog.collection.distinct("event_type")
  end
end

Simple Count

I used a simple technique to iterate over each type and get the associated count:

class MessageLog
  # Perform a group aggregation by event type.
  #
  # @return [Hash] the number of message logs per event type.
  def self.count_by_type
    results = {}
    MessageLog.event_types.each {|type| results[type] = MessageLog.count(:event_type => type)}
    results
  end
end

Map-Reduce Too Slow

In this instance, it turned out that Map-Reduce was significantly slower, and I am not exactly sure why. Other than I suppose that iterating over each document is more expensive than calling count with a filter on the event_type key (which is covered by an index).

class MessageLog
  # Perform a group aggregation by event type.
  # Map-Reduce was slow by comparison (20 seconds vs 2.3 seconds)
  #
  # @return [Hash] the number of message logs per event type.
  def self.count_by_type_mr
    results = {}
    counts = MessageLog.collection.group( {:key => :event_type, :cond => {}, :reduce => 'function(doc,prev) { prev.count += 1; }', :initial => {:count => 0} })
    counts.each {|r| results[r["event_type"]] = r["count"]}
    results
  end
end

Performance Results

As you can see, Map-Reduce took about [notice]10 times longer,[/notice] ~21 seconds versus ~2.3 seconds.

And this is over 1,129,519 documents, so it is a non-trivial test, IMO.

> measure_mr = Benchmark.measure("count") { results = MessageLog.count_by_type_mr}
> measure = Benchmark.measure("count") { results = MessageLog.count_by_type }
ruby-1.8.7-p334 :010 > puts measure_mr
  0.000000   0.000000   0.000000 ( 20.794720)
> puts measure
  0.020000   0.000000   0.020000 (  2.340708)
> results.map {|k,v| puts "#{k} #{v}"}
Not Set                          1
Inpatient Admit              4,493
Unknown Message Type         1,292
Bed Order                    6,948
Registration Update        852,189
Registration               123,064
ED Summary                  94,933
Cactus Update               10,145
Inpatient Discharge Summary 18,150
ED Release                  18,304

 Summary

You may get better performance using simpler techniques for simple aggregate commands. And maybe Map-Reduce shines better on more complex computations/queries.

[important]But your best bet is to test it out with meaningful data samples.[/important]

Hiring a Team Member

I read this very good post “The Number One Trait of a Great Developer” by Tammer Saleh at Engine Yard, and it made me think…

I used to rant about “It’s the Business, Stupid” in conferences, implying that we should be solving problems for our clients, not simply playing with the next shiny toy. Sure, I love shiny toys, and I love to play with them — especially when they make sense within the context of solving a problem. But your scenario is an oft-spotted pattern where folks lack the engineering skills required to build right-sized solutions to meet the here-and-now needs.

And of course, anyone can come up with a complex solution that sprawls across multiple cubicle walls on e-size plotter paper (that’s easy). Only a rare few can come up with the minimalist solution that meets the needs of the business, and can easily grow over time.

As far as hiring, I like to look for the “engineering” mind. After all, engineers put man on the moon, not scientists.

MongoDB Index Performance

As part of this (unintended) mini-series on MongoDB and indexing, I had written a little test to see if I could document performance gains through indexing. I used realworld data, albeit only 50,000 records, to query out a handful or documents (24 being the most).

Related posts:

Here is the code:

require 'test_helper'

class EncounterListingTest < Test::Unit::TestCase

  context "Indexing" do
    ProfileStats2 = Struct.new(:doctor_num, :count, :timing1, :timing2, :timing3)

    should "profile assorted doctor patient retrievals" do
      stats = []
      doctor_nums = ["602490", "603324", "212043", "602938"]
      doctor_nums.each_with_index do |doctor_num, i|
        MongoMapper.database.collection('encounters').drop_indexes
        show_indexes if i == 0
        timing1 = (measure_performance(doctor_num) + measure_performance(doctor_num) + measure_performance(doctor_num))/3

        MongoMapper.database.collection('encounters').drop_indexes
        add_index([[:private_physician, 1]])
        show_indexes if i == 0
        timing2 = (measure_performance(doctor_num) + measure_performance(doctor_num) + measure_performance(doctor_num))/3

        MongoMapper.database.collection('encounters').drop_indexes
        add_index([[:private_physician,1], [:notify_physician,1], [:visible_count,1]])
        show_indexes if i == 0
        timing3 = (measure_performance(doctor_num) + measure_performance(doctor_num) + measure_performance(doctor_num))/3

        n_count = Encounter.count(:private_physician => doctor_num, :notify_physician => 'Y', :visible_count.gt => 0)
        stats << ProfileStats2.new(doctor_num, n_count, timing1, timing2, timing3)

      end

      File.open("test/performance/index_stats_results-#{Time.now.strftime("%d-%m-%Y")}.csv", 'w') do |f|
        puts "%10s  %6s  %5s  %5s  %5s" % ["doctor", "count", "None", "Phys", "Phys/Ntfy/Vis"]
        f.puts "doctor, count, None, Phys, PhysNtfyVis"
        stats.each do |s|
          results = "%10d, %6d, %5.3f, %5.3f, %5.3f" % [s.doctor_num, s.count, s.timing1, s.timing2, s.timing3]
          puts results
          f.puts "%d, %d, %5.3f, %5.3f, %5.3f" % [s.doctor_num, s.count, s.timing1, s.timing2, s.timing3]
        end
      end

    end

  end

  private
  def show_stats(stats)
    stats.each do |s|
      puts "%6d, %5.3f, %s" % [s.count, s.timing, s.index_type]
    end
  end

  def measure_performance(doctor_num = "99602326")
    start = Time.now
    n_public = Encounter.where(:private_physician => doctor_num, :notify_physician => 'Y', :visible_count.gt => 0).all
    delta = Time.now - start
    delta
  end

  def show_indexes
    puts "%s INDEXES %s" % ["*"*12, "*"*12]
    Encounter.collection.index_information.collect { |index| puts "    #{index[0]}" }
  end

  def add_index(new_index)
    coll = MongoMapper.database.collection('encounters')
    coll.drop_index(new_index) if !coll.index_information.detect { |index| index[0] == new_index }.nil?
    Encounter.ensure_index(new_index)
  end

end

Results:

The effect of adding indexes on query performance

The results are shown in the accompanying graph. Except for the query that returned 24 documents, the general trend was that 3 indexes were better than one. And one was w-a-a-a-y better than none (of course, you already knew that). The odd outlier being for count = 6, in that a single index did not perform as well as it did in all the other tests.

A Walk Through the Valley of Indexing in MongoDB

As you walk through the valley of MongoDB performance, you will undoubtedly find yourself wanting to optimize your indexes at some point or other.

How to Watch Your Queries

Run your database with profiling on. I have an alias for starting up mongo in profile mode (‘p’ stands for profile):

alias mongop="<mongodb-install>/bin/mongod
      --smallfiles --noprealloc --profile=1 --dbpath <mongodb-install>/data/db"

This will default to considering queries > 100ms being deemed “slow.”

Add a logger (if you are using MongoMapper) and tail the log file to see the queries.

##### MONGODB SETTINGS #####
# You can use :logger => Rails.logger, to get output of the mongo queries, or create a separate logger.
logger = Logger.new('mongo-development.log')
MongoMapper.connection = Mongo::Connection.new('localhost', 27017, {:pool_size => 5, :auto_reconnect => true, :logger => logger})
MongoMapper.database = "mdalert-development"

Run your query/exercise the app.

Examine the mongo log (trimmed for legibility), and look for the primary collection you were querying (bolded below).

['$cmd'].find(#"settings", "query"=>{:identifier=>"ItemsPerPage"}, "fields"=>nil}>).limit(-1)
['$cmd'].find(#"accounts", "query"=>{:state=>"active"}, "fields"=>nil}>).limit(-1)
['accounts'].find({:state=>"active"}).limit(15).sort(email1)
['settings'].find({:identifier=>"AutoEmail"}).limit(-1)

Use MongoDB’s Explain to dig deeper.

Open up the mongo shell (<mongodb-install>/bin/mongo) and enter the query that you want explained. Hint, you can take much of it from the query in the log.

Without any indexes, you can see the query is scanning the entire table basically. A bad thing! Another tip is the cursor type is “BasicCursor.”

> db.accounts.find({state: "active"}).limit(15).sort({email: 1}).explain();
{
  "cursor" : "BasicCursor",
  "nscanned" : 11002,
  "nscannedObjects" : 11002,
  "n" : 15,
  "scanAndOrder" : true,
  "millis" : 44,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "isMultiKey" : false,
  "indexOnly" : false,
  "indexBounds" : {
  }
}

Since I was doing a find on state, and a sort on email (or last_name), I added a compound index using MongoMapper (you could have just as easily done it at the mongo console).

Account.ensure_index([[:state,1],[:email,1]])
Account.ensure_index([[:state,1],[:last_name,1]])

Re-running the explain, you can see

  • the cursor type is now BtreeCursor (i.e., using an index)
  • the entire table is not scanned.
  • The retrieval went from 44 millis down to 2 millis
  • Success!!
> db.accounts.find({state: "active"}).limit(15).sort({email: 1}).explain();
{
  "cursor" : "BtreeCursor state_1_email_1",
  "nscanned" : 15,
  "nscannedObjects" : 15,
  "n" : 15,
  "millis" : 2,
  "nYields" : 0,
  "nChunkSkips" : 0,
  "isMultiKey" : false,
  "indexOnly" : false,
  "indexBounds" : {
    "state" : [ [ "active", "active" ] ],
    "email" : [ [ {"$minElement" : 1}, {"$maxElement" : 1} ] ]
  }
}

Fiddling a Bit More – Using the Profiler

You can drop into the mongo console and see more specifics using the mongo profiler.

> db.setProfilingLevel(1,15)
{ "was" : 1, "slowms" : 100, "ok" : 1 }

For this example, I cleared the indexes on accounts. and I ran the following query, and examined its profile data.
Note: the timing can vary over successive runs, but it generally is fairly consistent — and it is close to the “millis” value you see in explain output.

> db.accounts.find({state: "active"}).limit(15).sort({email: 1})
> db.system.profile.find();
{ "ts" : ISODate("2011-11-27T21:09:04.237Z"),
  "info" : "query mdalert-development.accounts
  ntoreturn:15 scanAndOrder
  reslen:8690
  nscanned:11002
  query: { query: { state: "active" },
    orderby: { email: 1.0 } }
    nreturned:15 163ms", "millis" : 43 }

Now let’s add back the indexes… one at a time. First up, let’s add “state.”

> db.accounts.ensureIndex({state:1})
>db.accounts.find({state: "active"}).limit(15).sort({email: 1})
db.system.profile.find({info: /.accounts/})
{ "ts" : ISODate("2011-11-27T21:26:29.801Z"),
  "info" : "query mdalert-development.accounts
  ntoreturn:15 scanAndOrder
  reslen:546
  nscanned:9747
  query: { query: { state: "active" },
    orderby: { email: 1.0 }, $explain: true }
    nreturned:1 81ms", "millis" : 81 }

Hmmm. Not so good! Let’s add in the compound index that we know we need, and run an explain:

>db.accounts.ensureIndex({state:1,email:1})
> db.accounts.find({state: "active"}).limit(15).sort({email: 1}).explain()
{
	"cursor" : "BtreeCursor state_1_email_1",
	"nscanned" : 15,
	"nscannedObjects" : 15,
	"n" : 15,
	"millis" : 0,

And sure enough, we get good performance. The millis is so small, that this query will not show up in the profiler.

If you want to clear the profile stats, you’ll soon find out you can’t remove the documents. The only way I saw how to do it was as follows:

  • restart mongod in non-profiling mode
  • reopen the mongo console and type:
    db.system.profile.drop()

You should now see the profile being empty:

> show profile
db.system.profile is empty

Now you can restart mongod in profiling mode and see your latest profiling data without all the ancient history.

Some Gotchas

Regex searches cannot be indexed

If your query is a regex, then the index can’t help. With regex, retrieval is 501 ms (not bad, given 317K records):

> db.message_logs.find({patient_name:/ben franklin/i}).sort({created_at:-1}).explain()
{
	"cursor" : "BtreeCursor patient_name_1 multi",
	"nscanned" : 317265,
	"nscannedObjects" : 27,
	"n" : 27,
	"millis" : 501,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"isMultiKey" : false,
	"indexOnly" : false,
	"indexBounds" : {
		"patient_name" : [ [ "", { } ], [ /ben franklin/, /ben franklin/ ] ]
	}
}

Without regex, it is essentially instantaneous:

> db.message_logs.find({patient_name:'Ben Franklin'}).sort({created_at:-1}).explain()
{
	"cursor" : "BtreeCursor patient_name_1",
	"nscanned" : 27,
	"nscannedObjects" : 27,
	"n" : 27,
	"scanAndOrder" : true,
	"millis" : 0,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"isMultiKey" : false,
	"indexOnly" : false,
	"indexBounds" : {
		"patient_name" : [ [ "Ben Franklin", "Ben Franklin" ] ]
	}
}

Tips for examining profiler output

  • Look at the most recent offenders:
    show profile
  • Look at a single collection:
    db.system.profile.find({info: /message_logs/})
  • Look at a slow queries:
    db.system.profile.find({millis : {$gt : 500}})
  • Look at a single collection with a specific query param, and a response >100ms:
    db.system.profile.find({info: /message_logs/, info: /patient_name/, millis : {$gt : 100}})

Summary

So here you have an example of how to see indexes in action, how to create them, and how to measure their effects.

References:

 

Configuring MongoMapper Indexes in Rails App

Not quite sure where the best place is to define MongoDB indexes via MongoMapper in a Rails app… My progression has been:

  1. as part of the key definition in the model class
  2. in a rails initializer
  3. hybrid between initializer and model
  4. rake task invoking model methods

Define Indexes on the Keys

This works fine during development.

class Account
  include MongoMapper::Document
  ...
  # Attributes ::::::::::::::::::::::::::::::::::::::::::::::::::::::
  key :login, String, :unique => true, :index => true
  key :msid, String, :index => true
  key :doctor_num, String, :index => true
  ...
end

Define Indexes in an Initializer

When I wanted to trigger a new index creation, I would add it here. Only problem is that restarting a production server with tons of data gets held up by the create index task.

# Rails.root/config/initializers/mongo_config.rb
Account.ensure_index(:last_name)
Group.ensure_index(:name)
Group.ensure_index(:group_num)
...

Define Indexes in a Class Method, Invoke in Initializer

A small tweak to putting indexes into an initializer was to place the knowledge of the indexes back into the model classes themselves. Then, all you needed to do was invoke the model class method to create it’s own indexes.

The Initializer Code

 
# Rails.root/config/initializers/mongo_config.rb
Event.create_indexes
Encounter.create_indexes
Setting.create_indexes

The Model(s) Code

 
class Setting
  include MongoMapper::Document
  # Attributes ::::::::::::::::::::::::::::::::::::::::::::::::::::::
  # What the user sees as a label
  key :label, String
  # How we reference it in code
  key :identifier, String, :required => true
  ...
  # Indexes :::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  def self.create_indexes
    self.ensure_index(:identifier, :unique => true)
    self.ensure_index(:label, :unique => true)
  end
  ...
end

Enter the Rake!

Of course, you could also invoke the index creation code in a rake task, as pointed out here.

The beauty behind a rake task as best I can tell is this:

  • You can run it at any time to update the indexes
  • You do not bring a deploy to a screeching halt because you are waiting for index creation

I was already standardizing on how I was creating indexes inside each model class — where better to keep on top of what the indexes for a class should be than in the class itself!

# app/models/setting.rb
class Setting
  ...
  def self.create_indexes
    self.ensure_index(:identifier, :unique => true)
    self.ensure_index(:label, :unique => true)
  end
  ...
end

I created a new class in the model directory (so that it is close to where the models are defined) that simply loops through each model class to generate the proper indexes:

# app/models/create_indexes.rb
class CreateIndexes
  def self.all
    puts "*"*15 + " GENERATING INDEXES" + "*"*15
    MongoMapper.database.collection_names.each do |coll|
      # Avoid "system.indexes"
      next if coll.index(".")

      model = coll.singularize.camelize.constantize
      model.create_indexes if model.respond_to?(:create_indexes)
      model.show_indexes if model.respond_to?(:show_indexes)
    end
  end
end

You can invoke it easily from the Rails console: CreateIndexes.all

Next I created a rake task (in lib/tasks/indexes.rake) that invoked the ruby code to do the indexing mojo.

namespace :db do
  namespace :mongo do
    desc "Create mongo_mapper indexes"
    task :index => :environment do
      CreateIndexes.all
    end
  end
end

Any tips/comments/insights appreciated…

PS: self.show_indexes Mix-in

I created a mix-in for the “show_indexes()” class method for each model. I could not add it directly to the MongoMapper::Document class unfortunately — I ran into errors and finally gave up. Here’s the mix-in that I defined in lib/mongo_utils.rb:

module MongoMapper
  module IndexUtils
    puts "Customizing #{self.inspect}"
    module ClassMethods
      def show_indexes
        puts "%s #{self.name} INDEXES %s" % ["*"*12, "*"*12]
        self.collection.index_information.collect do |index|
          puts "    #{index[0]}#{index[1].has_key?("unique") ? " (unique)":"x"}"
        end
      end
    end
    def self.included(base)
      #puts "#{base} is being extended'"
      base.extend(ClassMethods)
    end
  end
end

And you use it as follows:

require 'mongo_mapper'
require 'mongo_utils'
class Setting
  include MongoMapper::Document
  include MongoMapper::IndexUtils
  ...

Design Debates

many times there are two (or more) seemingly viable approaches that people can be arguing for…

when in doubt, sketch it out… that is,

  • quick model diagrams, or
  • quick sequence diagrams,
  • compare and contrast

now let’s (hopefully continue to) presume this is about an exceedingly critical aspect, a make-or-break design decision. because surely, you aren’t spending precious project resources deciding whether or not to use 2 or 4 spaces per indent level!

so, if sketching out the ideas and comparing still did not reveal a clear winner, then code up the competing ideas and put them to the test. give the designs a day or two of effort, or more — in proportion to the critical nature of getting the decision right. then, have some a priori metrics by which to pick the winner via “testing” the designs.

  • better performance
  • less code
  • easier to grok
  • suitability to task
  • etc

and if you still can’t choose a clear winner, well, man-up, be a leader, and make a freaking decision (even if it is flipping a coin), and don’t look back.

Considering Sprint Length

A friend of mine had an interesting situation:

  • Novel product, many unknowns
  • Multiple teams grouped into 3 product areas
  • Experience doing 3-week sprints

Only a fool would do anything other than the 30-day official sprint cycle that I saw on some website and in a few books.

(Just kidding. Unfortunately, like most of agile development, context has a tremendous impact on what you choose to do, process-wise.)

A lot could go into what the Optimal Sprint Length should be… You could ponder the dependent variables and try and guess an optimal length to optimize the independent variable(s) — which would be, what, maybe cost and rate of feature delivery and quality? You could do the “democratic process” and allow the team to vote, or even do “rock-paper-scissors” to figure out 2 or 3 weeks.

However, what if we built a continuum of sprint lengths for the sake of discussion. On the one end, we start at the idealization of doing one useful feature at a time and deploying it immediately — think simple web app. Anything longer than this is a compromise based on some (hopefully valid) reason. On the other extreme, we could wait until the entire system is done before deploying or integrating, maybe after 6 months or a year.

The cost of “batching up” the “work in process” at the upper end of long sprint lengths, is pretty obvious to everyone. I submit, that if you agree with (or experience first-hand) the premise that batching work has a non-linear impact on overall cost (including the hidden and subtle cost of everything that we know is bad with waterfall), then it stands to reason one might favor shorter cycles and less batching.

Not to digress, but the parallels exist in industry. To allow WIP to be large, and to allow certain parts of the process to run at high levels of batching, is a risk. A risk that the items in the batch, once released into the wild, are discovered to not be as valuable as first thought. Well, it’s water over the dam, time and effort you will never get back. (Think: extra features built because someone thought they would be useful, and it turned out that the marketplace thought otherwise.) Nonetheless, sometimes weighing the risks will lead you to some level of batch that makes the most sense.

There is often much more to the decision on sprint length than purely the development team. For example, what is the cost of QA? If the cost of QA is no different for 1 feature at a time versus a week’s worth of features, than QA cycle time/cost is not an issue. However, if it requires a week of QA time to regression test the system in the case of even a single small feature or bug fix, then you have a serious input into what the optimal sprint length should be.

Naturally, one could do development sprints at one frequency, and QA sprints at another… and even customer ship sprints at a completely other cycle time.

Regarding multiple teams… this is a solution that can be recursively applied, much like you would at a software architectural level. If the teams are horribly coupled, your costs will balloon and no amount of pondering sprint lengths will have a significant impact. If the work dependencies are carefully controlled between the teams, sprint length could vary between teams due to their own local reasons.

Much like the QA process can be a “tax” on each Sprint, what other taxes does your process incur? Running down to a one week sprint will likely reveal expensive parts of the process that could be ripe for improving.

So having said all of that… Here’s a thought. Why not simply agree to try out a few different lengths for enough sprints to get a feel for the differences. Try one week sprints for the next 6 weeks. Try 3 week Sprints three times. See if you can monitor metrics that will tell you what worked better. Consider that different teams might also work at different frequencies to test the “costs” of thinking the teams should be synchronized.

Much like with our USA republic, surely don’t let democratic, mob rule win the day.

Do I Still Do Domain Modeling?

Got a very nice “blast from the past” contact (4 levels deep) on LinkedIn. Scott was a member of a team where we went through object modeling for their business application.

The reason I wanted to contact you was two fold:

First for some reason that training has stuck with me more that many trainings and I still go back to the cobweb section of my brain and bring it out every time I have a object modeling task, so thanks you did a good job helping me.  This is true for many of the people that were there.

Second as I have been given another task of modeling a large system from scratch I was wondering  about how you feel about the Domain-Neutral model that was presented in the training you did for us.  I have found it useful over the years but do you still use this model in anything that you design this many years in the future?

Thanks for the help.
Regards,Scott

And yes, I still do domain modeling the basic way that we did 11 years ago (gulp). I only have my old copy of Together anymore, so I am stuck back in time there. In person, I always use post-it notes, flip-charts, and markers. Once I want to make a computer version, I’ll use different tools for modeling depending on the need. For example, UMLet is a great little simple tool to bang out a quick diagram in no time flat and that can be used by anybody.

Another good approach from what we went through 10 years ago, is to follow the color modeling order to help in the discovery process… that is focusing on the following as you walk through the assorted primary user scenarios:

  1. First trying to discover what time-sensitive aspects does the business care about most (the pinks)
  2. Then looking towards the roles that may be at play there (yellows).
  3. Are there any specific things (like contracts, purchase orders — greens)?
  4. And finally, any descriptive elements (blues) to go alongside the greens?

For the past 18 months, I am in major love with Ruby, Rails, and MongoDB — plus all of the surrounding tools and community (see my recent blog posts). Ruby the object-oriented language I wanted when I was doing C++, and MongoDB/MongoMapper is the OO-like DBMS that I always wanted. Putting it all together makes for real productive, high-quality, consistent development — I can quickly model out domain things and try them in a working application.

I’ve never done an adequate job of explaining how I like to approach doing just-enough design up front, but you can find a few snippets here:

Some tips that I find useful when doing initial up-front Domain Modeling (what you are about to do):

  • Spend time with subject matter experts and the business/product owners
  • Build out as you gather high-level features
  • Do breadth first, then depth
  • Only go into details when it will help reduce an unacceptable level of risk

That last bullet is key. I like to do just enough up-front work to please the stakeholders and myself that we can answer the question about: how much? and when? to the level of desired specificity. When you attempt a high-level estimate at a given feature, and it is too big for your comfort, then you can get more detail and break it down further so that it becomes acceptable.

If you do not do enough up front, and your estimate is off by an order of magnitude or three, you might upset the client!

On the flipside, if you do too much up front (detailed modeling), then you risk not getting frequent, tangible, working results into the hands of the client soon enough.

It’s a big balancing act.

Development with MongoMapper

A question popped up:

Mongo is schema-less, that means we can create new fields when needed, I read the mongo mapper document, it still needs to write model code like below, so we have to change the code below when we need new field, is this kind of limitation to use the schema-less database mongodb?

class User
  include MongoMapper::Document

  key :first_name, String
  key :last_name, String
  key :age, Integer

Strict answer: NO, you do not have to change the model code to add a new “column” to your “table.”

I like to think of MongoMapper as making my domain classes behave more like an object-oriented database than a data management layer.

During rapid development, having a database that just follows along with your model makes for speedy feature creation and prototyping. Though “migrations” being first-class citizens, of sort, in Rails is a great step forward for managing development changes with traditional schema databases, MongoDB doesn’t even require that level of concern.

That said, it also implies you, as the developer/designer/architect, are treated as an individual willing to take on the responsibility of wielding this amount of “power” (just like Ruby does).

Therefore, even though you do not technically have to add keys to your MongoMapper document class, if you are talking about core aspects of your domain, I would add the keys so as to make it clear what you are modeling.

Now, there may be certain cases where a class is just storing a bunch of random key-value pairs that are not key elements of your domain, with the sole purpose of merely showing them later. Maybe you are parsing data or taking a feed of lab results, for example, and just need to format the information without searching or sorting or much of anything. In this situation, you do not need to care about adding the (unknown) keys to the model class.

As the saying goes, “just because you *can* do something does not mean you *should* do something.”