Discussion:
how to count the number of unique values
Anand Chitipothu
2010-10-12 13:09:28 UTC
Permalink
Is it possible to count the number of unique values by writing a couchdb view?

A typical example is to find the number of tags in a blog application.

Consider the following 2 docs.

{
"_id": "posts/1",
"title": "post 1",
"tags": ["foo", "bar"]
}

{
"_id": "posts/2",
"title": "post 2",
"tags": ["foo", "foobar"]
}

Is it possible to find that there are 3 tags?

Anand
Michael Zedeler
2010-10-12 13:18:34 UTC
Permalink
Post by Anand Chitipothu
Is it possible to count the number of unique values by writing a couchdb view?
Consider the following 2 docs.
{
"_id": "posts/1",
"title": "post 1",
"tags": ["foo", "bar"]
}
{
"_id": "posts/2",
"title": "post 2",
"tags": ["foo", "foobar"]
}
Is it possible to find that there are 3 tags?
Yes. Just write a map function that emits all tags it finds (not checked
and probably wrong):

function(doc) {
for(tag in doc.tags) {
emit(tag, null);
}
}

In the reduce-function, just use _count
(http://wiki.apache.org/couchdb/Built-In_Reduce_Functions).

Regards,

Michael.
Anand Chitipothu
2010-10-12 15:00:53 UTC
Permalink
Post by Anand Chitipothu
Is it possible to count the number of unique values by writing a couchdb view?
Consider the following 2 docs.
{
    "_id": "posts/1",
    "title": "post 1",
    "tags": ["foo", "bar"]
}
{
    "_id": "posts/2",
    "title": "post 2",
    "tags": ["foo", "foobar"]
}
Is it possible to find that there are 3 tags?
Yes. Just write a map function that emits all tags it finds (not checked and
function(doc) {
   for(tag in doc.tags) {
       emit(tag, null);
   }
}
In the reduce-function, just use _count
(http://wiki.apache.org/couchdb/Built-In_Reduce_Functions).
That gives counts of each tag, not the total number of unique tags.

The above reduce function with group_level=1 will give:

{"rows":[
{"key":"bar","value":1},
{"key":"foo","value":2},
{"key":"foobar","value":1}
]}

And without any group_level it will return 4, which is the total
number of tags occurrences.

{"rows":[
{"key":"null","value":4}
]}

Is there any way to find the number of *unique* tags?

Anand
Randall Leeds
2010-10-12 19:42:57 UTC
Permalink
Something like the following, maybe?

function(doc) {
//as before, emit every tag
}

function(keys, values, rereduce) {
if(!rereduce) {
//return the following, in an object or list to unpack:
// 1) the first and last tag in keys (or the same tag twice, if
all are equal or length == 1)
// 2) a count of unique tags (not including and not equal to the
first and/or last)
// This assumes keys is sorted (which I think is safe), but if not
you can always sort them yourself
} else {
var total = //sum of all inner unique counts from !rereduce step
var merged = //merge all the [first, last] lists
if(merged.length != 2) {
//Run the !rereduce clause (maybe put this in a function) on merged
//Add the resulting inner count to total, set merged = [first,
last] result.
}
//Return the new inner total and new [first, last]
}
}

I'm pretty sure I've messed things up horribly somewhere in there, but
the basic idea feels right. Explore it.
A word of caution though: this algorithm assumes that there is no
overlap in key range for each independent initial reduction. I believe
this is a safe assumption when running on a single CouchDB node.
However, this will not be the case if something like lounge or
bigcouch is rereducing results from many shards with overlapping tag
ranges.

If anyone can think of a better, more general algorithm that avoids
building a huge dictionary in the reduce phase, I think it'd be
interesting to see, and probably useful to more people than just
Anand. :)

-Randall
Post by Anand Chitipothu
Post by Anand Chitipothu
Is it possible to count the number of unique values by writing a couchdb view?
Consider the following 2 docs.
{
    "_id": "posts/1",
    "title": "post 1",
    "tags": ["foo", "bar"]
}
{
    "_id": "posts/2",
    "title": "post 2",
    "tags": ["foo", "foobar"]
}
Is it possible to find that there are 3 tags?
Yes. Just write a map function that emits all tags it finds (not checked and
function(doc) {
   for(tag in doc.tags) {
       emit(tag, null);
   }
}
In the reduce-function, just use _count
(http://wiki.apache.org/couchdb/Built-In_Reduce_Functions).
That gives counts of each tag, not the total number of unique tags.
{"rows":[
{"key":"bar","value":1},
{"key":"foo","value":2},
{"key":"foobar","value":1}
]}
And without any group_level it will return 4, which is the total
number of tags occurrences.
{"rows":[
{"key":"null","value":4}
]}
Is there any way to find the number of *unique* tags?
Anand
Paul Davis
2010-10-12 19:47:09 UTC
Permalink
   // This assumes keys is sorted (which I think is safe), but if not
you can always sort them yourself
This algorithm is what I was thinking of. Its basically the top N tags
example but just counting uniques. There is a gotcha in that you need
to sort the keys yourself because of a weird reduce thing I never
tracked down completely. IIRC, it was something about how the final
reduce runs that requires this.

Otherwise, what Randall said.
Anand Chitipothu
2010-10-16 15:00:22 UTC
Permalink
Post by Paul Davis
   // This assumes keys is sorted (which I think is safe), but if not
you can always sort them yourself
This algorithm is what I was thinking of. Its basically the top N tags
example but just counting uniques. There is a gotcha in that you need
to sort the keys yourself because of a weird reduce thing I never
tracked down completely. IIRC, it was something about how the final
reduce runs that requires this.
Otherwise, what Randall said.
It looks too complicated for my needs. I think I'm going to change my
data model to avoid this problem.

Thanks for your suggestions.

Anand
Wout Mertens
2010-10-15 12:33:53 UTC
Permalink
Just wanted to add that if you have a map function that emits (tag, 1) for each tag and then a reduce function that's just _count, you will have everything you need for painting a tag cloud.

The view with group=true will list all tags exactly once, with their count. CouchDB doesn't tell you how many rows are in the result so you'll have to count them yourself.

So you load that entire view in memory and you can draw the tags with their relative sizes.

Wout.
Post by Anand Chitipothu
Is it possible to count the number of unique values by writing a couchdb view?
A typical example is to find the number of tags in a blog application.
Consider the following 2 docs.
{
"_id": "posts/1",
"title": "post 1",
"tags": ["foo", "bar"]
}
{
"_id": "posts/2",
"title": "post 2",
"tags": ["foo", "foobar"]
}
Is it possible to find that there are 3 tags?
Anand
Anand Chitipothu
2010-10-16 15:04:42 UTC
Permalink
Post by Wout Mertens
Just wanted to add that if you have a map function that emits (tag, 1) for each tag and then a reduce function that's just _count, you will have everything you need for painting a tag cloud.
The view with group=true will list all tags exactly once, with their count. CouchDB doesn't tell you how many rows are in the result so you'll have to count them yourself.
So you load that entire view in memory and you can draw the tags with their relative sizes.
Wout.
The example I gave is a rather simplified example. I'm working a data
containing 25M+ docs with books, works and subjects. I need to find
the list/count of works for each subject. I don't think it is
practical to load the view into memory to compute the required result.

Anand
Eli Stevens (Gmail)
2010-10-16 17:57:17 UTC
Permalink
Assuming you have the 'works' docs contain a type and a list of
subject IDs (code untested, sorry):

map = function(doc) {
if (doc.type == 'work') {
for (i in doc.subject_ids) {
emit(doc.subject_ids[i], [doc._id]); // returning a list of a
single doc._id makes it so that the reduce function is simpler; it's
not required though.
}
}
}

reduce = function (key, values, rereduce) {
var combinedList = [];
for (i in values) {
combinedList[combinedList.length] = values[i];
}
return combinedList;
}

This produces a view with rows like:

{key: 'subj_id1', value: ['work_id1', 'work_id2', ...]},
{key: 'subj_id2', value: ['work_id2', 'work_id3', ...]},
{key: 'subj_id3', value: ['work_id1', 'work_id4', ...]},

Does that help?

Eli
Post by Anand Chitipothu
Post by Wout Mertens
Just wanted to add that if you have a map function that emits (tag, 1) for each tag and then a reduce function that's just _count, you will have everything you need for painting a tag cloud.
The view with group=true will list all tags exactly once, with their count. CouchDB doesn't tell you how many rows are in the result so you'll have to count them yourself.
So you load that entire view in memory and you can draw the tags with their relative sizes.
Wout.
The example I gave is a rather simplified example. I'm working a data
containing 25M+ docs with books, works and subjects. I need to find
the list/count of works for each subject. I don't think it is
practical to load the view into memory to compute the required result.
Anand
Anand Chitipothu
2010-10-17 02:41:22 UTC
Permalink
Post by Eli Stevens (Gmail)
Assuming you have the 'works' docs contain a type and a list of
map = function(doc) {
 if (doc.type == 'work') {
   for (i in doc.subject_ids) {
       emit(doc.subject_ids[i], [doc._id]); // returning a list of a
single doc._id makes it so that the reduce function is simpler; it's
not required though.
   }
 }
}
reduce = function (key, values, rereduce) {
 var combinedList = [];
 for (i in values) {
   combinedList[combinedList.length] = values[i];
 }
 return combinedList;
}
{key: 'subj_id1', value: ['work_id1', 'work_id2', ...]},
{key: 'subj_id2', value: ['work_id2', 'work_id3', ...]},
{key: 'subj_id3', value: ['work_id1', 'work_id4', ...]},
Does that help?
I thought about this, but in my case the list of can be very big. I've
seen subjects with 500K works. I don't think it is a good idea to let
reduce build such a big list.

Anand
couchdb user
2010-10-17 04:00:56 UTC
Permalink
Post by Anand Chitipothu
I thought about this, but in my case the list of can be very big. I've
seen subjects with 500K works. I don't think it is a good idea to let
reduce build such a big list.
Anand
I'm not sure how often your data will change but how about storing the
list of unique works, etc on a separate database, and then have a map
/ reduce what will give you the number of unique entries based on that
"intermediate" database? Something like a chain map reduce is what I'm
thinking about.

Of course you will have to maintain that other database in sync, and
using the _changes feed could help here.
Let me know if you would need some example of what I'm trying to say.

Regards,
Wout Mertens
2010-10-17 15:40:03 UTC
Permalink
Post by Anand Chitipothu
Post by Wout Mertens
Just wanted to add that if you have a map function that emits (tag, 1) for each tag and then a reduce function that's just _count, you will have everything you need for painting a tag cloud.
The view with group=true will list all tags exactly once, with their count. CouchDB doesn't tell you how many rows are in the result so you'll have to count them yourself.
So you load that entire view in memory and you can draw the tags with their relative sizes.
Wout.
The example I gave is a rather simplified example. I'm working a data
containing 25M+ docs with books, works and subjects. I need to find
the list/count of works for each subject. I don't think it is
practical to load the view into memory to compute the required result.
Anand
Right, in that case you should use either the method that was posted earlier by Randall Leeds, in which the reduce function counts the unique keys given to it (and keeps track of the first and last keys to avoid double counting), or use the multi-view code from http://github.com/normanb/couchdb

Wout.

Loading...