I wanted to get some aggregated count data by message type for a collection with over 1,000,000 documents.
The model is more or less this (a lot removed for simplicity):
class MessageLog include MongoMapper::Document # Attributes :::::::::::::::::::::::::::::::::::::::::::::::::::::: # Message's internal timestamp / when it was sent key :time, Time # Message type key :event_type, String, :default => "Not Set" # The message ID key :control_id, String, :index => true # Add created_at, updated_at timestamps! # Indexes ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: def self.create_indexes MessageLog.ensure_index([[:created_at,1]]) MessageLog.ensure_index([[:event_type,1]]) end end
Distinct
Though I originally hard-coded the message types (as they do not change very often, and are meaningless without other code changes anyway), I figured I would test dynamically gathering the distinct types. MongoDb supports the distinct function. From the MongoDB console:
> db.message_logs.distinct("event_type") [ "Bed Order", "Cactus Update", "ED Release", "ED Summary", "Inpatient Admit", "Inpatient Discharge Summary", "Not Set", "Registration", "Registration Update", "Unknown Message Type" ]
Though I saw distinct in MongoMapper, I had trouble getting it to work (this is an older app on <v2.0, method missing error).
However, a very powerful technique within MongoMapper worked just perfect! Essentially, every collection in MongoMapper will return itself as a collection that MongoDB understands (in their db.collection.blah
format) — helps when you need to execute MongoDB style commands:
class MessageLog # @return [Array] a list of unique types (strings) def self.event_types MessageLog.collection.distinct("event_type") end end
Simple Count
I used a simple technique to iterate over each type and get the associated count:
class MessageLog # Perform a group aggregation by event type. # # @return [Hash] the number of message logs per event type. def self.count_by_type results = {} MessageLog.event_types.each {|type| results[type] = MessageLog.count(:event_type => type)} results end end
Map-Reduce Too Slow
In this instance, it turned out that Map-Reduce was significantly slower, and I am not exactly sure why. Other than I suppose that iterating over each document is more expensive than calling count with a filter on the event_type key (which is covered by an index).
class MessageLog # Perform a group aggregation by event type. # Map-Reduce was slow by comparison (20 seconds vs 2.3 seconds) # # @return [Hash] the number of message logs per event type. def self.count_by_type_mr results = {} counts = MessageLog.collection.group( {:key => :event_type, :cond => {}, :reduce => 'function(doc,prev) { prev.count += 1; }', :initial => {:count => 0} }) counts.each {|r| results[r["event_type"]] = r["count"]} results end end
Performance Results
As you can see, Map-Reduce took about [notice]10 times longer,[/notice] ~21 seconds versus ~2.3 seconds.
And this is over 1,129,519 documents, so it is a non-trivial test, IMO.
> measure_mr = Benchmark.measure("count") { results = MessageLog.count_by_type_mr} > measure = Benchmark.measure("count") { results = MessageLog.count_by_type } ruby-1.8.7-p334 :010 > puts measure_mr 0.000000 0.000000 0.000000 ( 20.794720) > puts measure 0.020000 0.000000 0.020000 ( 2.340708) > results.map {|k,v| puts "#{k} #{v}"} Not Set 1 Inpatient Admit 4,493 Unknown Message Type 1,292 Bed Order 6,948 Registration Update 852,189 Registration 123,064 ED Summary 94,933 Cactus Update 10,145 Inpatient Discharge Summary 18,150 ED Release 18,304
Summary
You may get better performance using simpler techniques for simple aggregate commands. And maybe Map-Reduce shines better on more complex computations/queries.
[important]But your best bet is to test it out with meaningful data samples.[/important]