Friday, January 16, 2015

An Introduction to the Aggregation Pipeline



Finding Names

This query will find all actors with a name like "stevens"
db.test.find({"actor.displayName":/stevens/i}).pretty()  
and I get back 40 results, with the full tweet for each result:
 /* 0 */  
 {  
   "_id" : ObjectId("54b860e65756612564fbc584"),  
   "body" : "@TheRevAl Rev Cornell is the \"misguided YOUNG kid\" caught n jihadist social media. T. Rice 12, is \"20yr old\" killed b'cuz of toy pellet gun",  
   "retweetCount" : 0,  
   "generator" : {  
     "link" : "http://twitter.com/download/iphone",  
     "displayName" : "Twitter for iPhone"  
   },  
   "twitter_filter_level" : "low",  
   "gnip" : {  
     "klout_score" : 13,  
     "matching_rules" : [   
       {  
         "tag" : "SIDM_10205",  
         "value" : "\"mobile app\" OR \"mobile application\" OR \"Social Commerce\" OR \"Social Media\" OR \"Social Campaign\" OR \"Digital Optimization\" OR \"Digital Analytics\" OR \"Digital Software\" OR \"site optimization\" OR \"conversation optimization\" OR \"B2B Integration\" OR \"B2B enterprises\" OR \"B2b customer\" OR \"B2b purchase\" OR \"product information management\" OR \"web content management\" OR \"order management\" OR \"Campaign Management\" OR \"multichannel attribute\" OR \"multichannel optimization\" OR \"multi-channel attribute\" OR \"Customer Experience\""  
       }  
     ],  
     "language" : {  
       "value" : "en"  
     }  
   },  
   "favoritesCount" : 0,  
   "object" : {  
     "postedTime" : "2015-01-16T00:33:02.000Z",  
     "summary" : "@TheRevAl Rev Cornell is the \"misguided YOUNG kid\" caught n jihadist social media. T. Rice 12, is \"20yr old\" killed b'cuz of toy pellet gun",  
     "link" : "http://twitter.com/EDSDrum/statuses/555885403102662657",  
     "id" : "object:search.twitter.com,2005:555885403102662657",  
     "objectType" : "note"  
   },  
   "actor" : {  
     "preferredUsername" : "EDSDrum",  
     "displayName" : "Ernest Stevens",  
     "links" : [   
       {  
         "href" : null,  
         "rel" : "me"  
       }  
     ],  
     "twitterTimeZone" : null,  
     "image" : "https://pbs.twimg.com/profile_images/1474488395/ES_Photo__2005___2__normal.jpg",  
     "verified" : false,  
     "location" : {  
       "displayName" : "MD",  
       "objectType" : "place"  
     },  
     "statusesCount" : 350,  
     "summary" : "Financier, Educator, Artist, Business Consultant and political junkie.",  
     "languages" : [   
       "en"  
     ],  
     "utcOffset" : null,  
     "link" : "http://www.twitter.com/EDSDrum",  
     "followersCount" : 1,  
     "favoritesCount" : 1,  
     "friendsCount" : 57,  
     "listedCount" : 1,  
     "postedTime" : "2009-07-01T16:26:25.000Z",  
     "id" : "id:twitter.com:52771595",  
     "objectType" : "person"  
   },  
   "twitter_lang" : "en",  
   "twitter_entities" : {  
     "user_mentions" : [   
       {  
         "id" : 42389136,  
         "indices" : [   
           0,   
           9  
         ],  
         "id_str" : "42389136",  
         "screen_name" : "TheRevAl",  
         "name" : "Reverend Al Sharpton"  
       }  
     ],  
     "symbols" : [],  
     "trends" : [],  
     "hashtags" : [],  
     "urls" : []  
   },  
   "verb" : "post",  
   "link" : "http://twitter.com/EDSDrum/statuses/555885403102662657",  
   "provider" : {  
     "link" : "http://www.twitter.com",  
     "displayName" : "Twitter",  
     "objectType" : "service"  
   },  
   "postedTime" : "2015-01-16T00:33:02.000Z",  
   "id" : "tag:search.twitter.com,2005:555885403102662657",  
   "objectType" : "activity"  
 }  
 ... 40 results total ...  



Projection

That last query returned more data than I needed. I can use projection to solve this.
 db.test.find({"actor.displayName":/Stevens/i}, {"actor.displayName":1, _id:0})  
and now I only get back data of interest to me, rather than the entire tweet:
 /* 0 */  
 {  
   "actor" : {  
     "displayName" : "Ernest Stevens"  
   }  
 }  
   
 /* 1 */  
 {  
   "actor" : {  
     "displayName" : "Shimmel Stevenson"  
   }  
 }  
   
 /* 2 */  
 {  
   "actor" : {  
     "displayName" : "John Stevens"  
   }  
 }  
   
 /* 3 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 4 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 5 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 6 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 7 */  
 {  
   "actor" : {  
     "displayName" : "Gina Stevenson"  
   }  
 }  
   
 /* 8 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 9 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 10 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 11 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 12 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 13 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 14 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 15 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 16 */  
 {  
   "actor" : {  
     "displayName" : "Ezra Stevens"  
   }  
 }  
   
 /* 17 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 18 */  
 {  
   "actor" : {  
     "displayName" : "patti stevens"  
   }  
 }  
   
 /* 19 */  
 {  
   "actor" : {  
     "displayName" : "Edward Stevens"  
   }  
 }  
   
 /* 20 */  
 {  
   "actor" : {  
     "displayName" : "Manny Stevenson"  
   }  
 }  
   
 /* 21 */  
 {  
   "actor" : {  
     "displayName" : "Manny Stevenson"  
   }  
 }  
   
 /* 22 */  
 {  
   "actor" : {  
     "displayName" : "Dennis Stevens"  
   }  
 }  
   
 /* 23 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 24 */  
 {  
   "actor" : {  
     "displayName" : "Ruben Stevens"  
   }  
 }  
   
 /* 25 */  
 {  
   "actor" : {  
     "displayName" : "Simon Stevens"  
   }  
 }  
   
 /* 26 */  
 {  
   "actor" : {  
     "displayName" : "Neil Stevens"  
   }  
 }  
   
 /* 27 */  
 {  
   "actor" : {  
     "displayName" : "DavidBernard-Stevens"  
   }  
 }  
   
 /* 28 */  
 {  
   "actor" : {  
     "displayName" : "Ryan Stevens"  
   }  
 }  
   
 /* 29 */  
 {  
   "actor" : {  
     "displayName" : "julia stevens"  
   }  
 }  
   
 /* 30 */  
 {  
   "actor" : {  
     "displayName" : "julia stevens"  
   }  
 }  
   
 /* 31 */  
 {  
   "actor" : {  
     "displayName" : "Mike Stevens"  
   }  
 }  
   
 /* 32 */  
 {  
   "actor" : {  
     "displayName" : "Neil Stevens"  
   }  
 }  
   
 /* 33 */  
 {  
   "actor" : {  
     "displayName" : "Michael Stevens"  
   }  
 }  
   
 /* 34 */  
 {  
   "actor" : {  
     "displayName" : "Guido Stevens"  
   }  
 }  
   
 /* 35 */  
 {  
   "actor" : {  
     "displayName" : "Stewart Stevenson"  
   }  
 }  
   
 /* 36 */  
 {  
   "actor" : {  
     "displayName" : "Eric Stevens"  
   }  
 }  
   
 /* 37 */  
 {  
   "actor" : {  
     "displayName" : "Mike Stevens"  
   }  
 }  
   
 /* 38 */  
 {  
   "actor" : {  
     "displayName" : "Skeeve Stevens"  
   }  
 }  
   
 /* 39 */  
 {  
   "actor" : {  
     "displayName" : "Amber Stevens"  
   }  
 }  

But now I have a lot of duplicates, and it would be nice to have an aggregation with a count for the number of times each name appears.


All Distinct Names

I can find all the distinct names
 db.test.distinct("actor.displayName")  
and that gives me this output:
 /* 0 */  
 {  
   "0" : "abdul",  
   "1" : "Ayunda Dewi",  
   "2" : "CheckAContract",  
   "3" : "Lisa Peyton",  
   "4" : "NJT Advisory",  
   "5" : "E3M Filters",  
   "6" : "Ernest Stevens",  
   "7" : "Social Media Time",  
   "8" : "GoodmanHailey",  
   "9" : "ERP-VIEW.PL",  
   "10" : "12Stocks.com Tech",  
   "11" : "Technology News",  
   "12" : "resign .",  
   "13" : "Kenji Hiranabe",  
   "14" : "LittleBabyTinySteve",  
   "15" : "Bryan K. Robinson",  
   "16" : "christopherlortie",  
   "17" : "ErinFlannagan",  
   "18" : "Liz Barnett",  
   "19" : "Agile Retweets 2.2k",  
   "20" : "Ferry Irawan",  
   <snip ... 71,499 records in this dataset ...>  
Distinct returns an array (above). Arrays are not compatible with the aggregation pipeline[1].


By Name

This demonstrates a basic use of the aggregation pipeline.

This will count the number of tweets that contain the name "stevens" in some variation:
 db.test.aggregate(  
   [  
     { $match: { "actor.displayName": /stevens/i } },  
     { $group: { _id: null, count: { $sum: 1 } } }  
   ]  
 )  
and gives me this result:
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : null,  
       "count" : 40  
     }  
   ],  
   "ok" : 1  
 }  
But what I'm really looking for is a count for each distinct variation of a name that contains ": stevens".  I'm missing some key elements in my query above.


The Aggregation Pipeline

This query appears to do what I want:
 db.test.aggregate([  
   {$match: { "actor.displayName": /stevens/i } },  
   {$group: { _id: { actor: "$actor.displayName" }, numberOfTimes: { $sum: 1 }}},   
   {$sort:{numberOfTimes:-1}}  
 ])  
and the output is 21 results:
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : {  
         "actor" : "patti stevens"  
       },  
       "numberOfTimes" : 10  
     },   
     {  
       "_id" : {  
         "actor" : "Edward Stevens"  
       },  
       "numberOfTimes" : 4  
     },   
     {  
       "_id" : {  
         "actor" : "Stewart Stevenson"  
       },  
       "numberOfTimes" : 3  
     },   
     {  
       "_id" : {  
         "actor" : "julia stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Neil Stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Manny Stevenson"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Mike Stevens"  
       },  
       "numberOfTimes" : 2  
     },   
     {  
       "_id" : {  
         "actor" : "Skeeve Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Eric Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Michael Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ryan Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "DavidBernard-Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Amber Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Simon Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ezra Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Dennis Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ruben Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Ernest Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Gina Stevenson"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Guido Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "John Stevens"  
       },  
       "numberOfTimes" : 1  
     },   
     {  
       "_id" : {  
         "actor" : "Shimmel Stevenson"  
       },  
       "numberOfTimes" : 1  
     }  
   ],  
   "ok" : 1  
 }  
sorted in ascending order, with a count of the number of times it appears.


Matching all Names

And if I withdraw the match condition, and just use this query:
 db.test.aggregate([  
   {$group: { _id: { actor: "$actor.displayName" }, numberOfTimes: { $sum: 1 }}},   
   {$sort:{numberOfTimes:-1}}  
 ])  
The result is a sorted list (like the above):
 /* 0 */  
 {  
   "result" : [   
     {  
       "_id" : {  
         "actor" : "BYOD News"  
       },  
       "numberOfTimes" : 529  
     },   
     {  
       "_id" : {  
         "actor" : "free movies"  
       },  
       "numberOfTimes" : 521  
     },   
     {  
       "_id" : {  
         "actor" : "mclovemckee"  
       },  
       "numberOfTimes" : 480  
     },   
     {  
       "_id" : {  
         "actor" : "Cell Phone Deals"  
       },  
       "numberOfTimes" : 417  
     },   
     {  
       "_id" : {  
         "actor" : "Jose Fornelino Muñiz"  
       },  
       "numberOfTimes" : 376  
     },   
     {  
       "_id" : {  
         "actor" : "mckeeliurvg"  
       },  
       "numberOfTimes" : 328  
     },   
     {  
       "_id" : {  
         "actor" : "NoSQL"  
       },  
       "numberOfTimes" : 325  
     },   
     {  
       "_id" : {  
         "actor" : "Capitalsecure"  
       },  
       "numberOfTimes" : 302  
     },   
     {  
       "_id" : {  
         "actor" : "Antonio Trento?"  
       },  
       "numberOfTimes" : 257  
     },   
     {  
       "_id" : {  
         "actor" : "Top Sinhala Blog"  
       },  
       "numberOfTimes" : 252  
     },   
     {  
       "_id" : {  
         "actor" : "?FollowerSale.com"  
       },  
       "numberOfTimes" : 221  
     },   
     {  
       "_id" : {  
         "actor" : "Jeff Ho"  
       },  
       "numberOfTimes" : 209  
     },   
     {  
       "_id" : {  
         "actor" : "Cloud Server Hosts"  
       },  
       "numberOfTimes" : 206  
     },   
     ... 71,449 results total ...
for every name; not just "stevens".


No comments:

Post a Comment