I wanted to get some aggregated count data by message type for a collection with over 1,000,000 documents.
The model is more or less this (a lot removed for simplicity):
class MessageLog
include MongoMapper::Document
# Attributes ::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Message's internal timestamp / when it was sent
key :time, Time
# Message type
key :event_type, String, :default => "Not Set"
# The message ID
key :control_id, String, :index => true
# Add created_at, updated_at
timestamps!
# Indexes :::::::::::::::::::::::::::::::::::::::::::::::::::::::::
def self.create_indexes
MessageLog.ensure_index([[:created_at,1]])
MessageLog.ensure_index([[:event_type,1]])
end
end
Distinct
Though I originally hard-coded the message types (as they do not change very often, and are meaningless without other code changes anyway), I figured I would test dynamically gathering the distinct types. MongoDb supports the distinct function. From the MongoDB console:
> db.message_logs.distinct("event_type")
[
"Bed Order",
"Cactus Update",
"ED Release",
"ED Summary",
"Inpatient Admit",
"Inpatient Discharge Summary",
"Not Set",
"Registration",
"Registration Update",
"Unknown Message Type"
]
Though I saw distinct in MongoMapper, I had trouble getting it to work (this is an older app on <v2.0, method missing error).
However, a very powerful technique within MongoMapper worked just perfect! Essentially, every collection in MongoMapper will return itself as a collection that MongoDB understands (in their db.collection.blah format) — helps when you need to execute MongoDB style commands:
class MessageLog
# @return [Array] a list of unique types (strings)
def self.event_types
MessageLog.collection.distinct("event_type")
end
end
Simple Count
I used a simple technique to iterate over each type and get the associated count:
class MessageLog
# Perform a group aggregation by event type.
#
# @return [Hash] the number of message logs per event type.
def self.count_by_type
results = {}
MessageLog.event_types.each {|type| results[type] = MessageLog.count(:event_type => type)}
results
end
end
Map-Reduce Too Slow
In this instance, it turned out that Map-Reduce was significantly slower, and I am not exactly sure why. Other than I suppose that iterating over each document is more expensive than calling count with a filter on the event_type key (which is covered by an index).
class MessageLog
# Perform a group aggregation by event type.
# Map-Reduce was slow by comparison (20 seconds vs 2.3 seconds)
#
# @return [Hash] the number of message logs per event type.
def self.count_by_type_mr
results = {}
counts = MessageLog.collection.group( {:key => :event_type, :cond => {}, :reduce => 'function(doc,prev) { prev.count += 1; }', :initial => {:count => 0} })
counts.each {|r| results[r["event_type"]] = r["count"]}
results
end
end
Performance Results
As you can see, Map-Reduce took about [notice]10 times longer,[/notice] ~21 seconds versus ~2.3 seconds.
And this is over 1,129,519 documents, so it is a non-trivial test, IMO.
> measure_mr = Benchmark.measure("count") { results = MessageLog.count_by_type_mr}
> measure = Benchmark.measure("count") { results = MessageLog.count_by_type }
ruby-1.8.7-p334 :010 > puts measure_mr
0.000000 0.000000 0.000000 ( 20.794720)
> puts measure
0.020000 0.000000 0.020000 ( 2.340708)
> results.map {|k,v| puts "#{k} #{v}"}
Not Set 1
Inpatient Admit 4,493
Unknown Message Type 1,292
Bed Order 6,948
Registration Update 852,189
Registration 123,064
ED Summary 94,933
Cactus Update 10,145
Inpatient Discharge Summary 18,150
ED Release 18,304
Summary
You may get better performance using simpler techniques for simple aggregate commands. And maybe Map-Reduce shines better on more complex computations/queries.
[important]But your best bet is to test it out with meaningful data samples.[/important]
