MongoDB Group Map-Reduce Performance

I wanted to get some aggregated count data by message type for a collection with over 1,000,000 documents.

The model is more or less this (a lot removed for simplicity):

class MessageLog
  include MongoMapper::Document

  # Attributes ::::::::::::::::::::::::::::::::::::::::::::::::::::::
  # Message's internal timestamp / when it was sent
  key :time, Time
  # Message type
  key :event_type, String, :default => "Not Set"
  # The message ID
  key :control_id, String, :index => true

  # Add created_at, updated_at
  timestamps!

  # Indexes :::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  def self.create_indexes
    MessageLog.ensure_index([[:created_at,1]])
    MessageLog.ensure_index([[:event_type,1]])
  end
end

Distinct

Though I originally hard-coded the message types (as they do not change very often, and are meaningless without other code changes anyway), I figured I would test dynamically gathering the distinct types. MongoDb supports the distinct function. From the MongoDB console:

> db.message_logs.distinct("event_type")
[
	"Bed Order",
	"Cactus Update",
	"ED Release",
	"ED Summary",
	"Inpatient Admit",
	"Inpatient Discharge Summary",
	"Not Set",
	"Registration",
	"Registration Update",
	"Unknown Message Type"
]

Though I saw distinct in MongoMapper, I had trouble getting it to work (this is an older app on <v2.0, method missing error).

However, a very powerful technique within MongoMapper worked just perfect! Essentially, every collection in MongoMapper will return itself as a collection that MongoDB understands (in their db.collection.blah format) — helps when you need to execute MongoDB style commands:

class MessageLog
  # @return [Array] a list of unique types (strings)
  def self.event_types
    MessageLog.collection.distinct("event_type")
  end
end

Simple Count

I used a simple technique to iterate over each type and get the associated count:

class MessageLog
  # Perform a group aggregation by event type.
  #
  # @return [Hash] the number of message logs per event type.
  def self.count_by_type
    results = {}
    MessageLog.event_types.each {|type| results[type] = MessageLog.count(:event_type => type)}
    results
  end
end

Map-Reduce Too Slow

In this instance, it turned out that Map-Reduce was significantly slower, and I am not exactly sure why. Other than I suppose that iterating over each document is more expensive than calling count with a filter on the event_type key (which is covered by an index).

class MessageLog
  # Perform a group aggregation by event type.
  # Map-Reduce was slow by comparison (20 seconds vs 2.3 seconds)
  #
  # @return [Hash] the number of message logs per event type.
  def self.count_by_type_mr
    results = {}
    counts = MessageLog.collection.group( {:key => :event_type, :cond => {}, :reduce => 'function(doc,prev) { prev.count += 1; }', :initial => {:count => 0} })
    counts.each {|r| results[r["event_type"]] = r["count"]}
    results
  end
end

Performance Results

As you can see, Map-Reduce took about [notice]10 times longer,[/notice] ~21 seconds versus ~2.3 seconds.

And this is over 1,129,519 documents, so it is a non-trivial test, IMO.

> measure_mr = Benchmark.measure("count") { results = MessageLog.count_by_type_mr}
> measure = Benchmark.measure("count") { results = MessageLog.count_by_type }
ruby-1.8.7-p334 :010 > puts measure_mr
  0.000000   0.000000   0.000000 ( 20.794720)
> puts measure
  0.020000   0.000000   0.020000 (  2.340708)
> results.map {|k,v| puts "#{k} #{v}"}
Not Set                          1
Inpatient Admit              4,493
Unknown Message Type         1,292
Bed Order                    6,948
Registration Update        852,189
Registration               123,064
ED Summary                  94,933
Cactus Update               10,145
Inpatient Discharge Summary 18,150
ED Release                  18,304

Summary

You may get better performance using simpler techniques for simple aggregate commands. And maybe Map-Reduce shines better on more complex computations/queries.

[important]But your best bet is to test it out with meaningful data samples.[/important]

Technical Debt

Jon Kern's ramblings on software development