couchdb disk storage format - why so large overhead?

Discussion:

Alexey Loshkarev

2011-12-28 16:11:38 UTC

Hello.

I'm using CouchDB for two+ years for our company's internal projects.
It's very good and reliable database, i'm almost satisfied with it.

But, couchdb disk size makes me cry. I'll describe this.

My new project must store and manipulate with simple documents (15-20
integer/float/string fields, without attachments).
Target documents count may vary between 50M-500M. We are using SSD for
database now and need to count every gigabyte.
Currently, project data stored in MySQL.
I know why mysql data is so compact - data file consits only data, not
types and row names.
But CouchDB database disk size is very overheaded.

Some examples:

I have snippet of data (900K rows). Average row length is 200 bytes.
Total data size (disk size) is about 190MB.

I imported all of this data to CouchDB and realized, it occupies 800MB
(4x more than mysql). It was bulk insert with incrementing keys and
after import database was compacted.
I tried to reduce field names from 8-10 characters to 1-2 with almost no result.
My data consists of strings in unicode. I realized, erlang external
term format takes 5 bytes for every unicode character (instead of
1-... for utf-8). So i converted my unicode characters to ascii (just
transliterating cyrillic symbols to asci, one unicode symbol to ascii
equivalent).
Result - almost no.

Then I tried to calc sum of document sizes.
I wrote an erlang view:

fun({Doc}) ->
Emit(<<"raw">>, size(term_to_binary(Doc))),
Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
end.

According this,
raw document sum is about 725MB. So, about 10% overhead to id/rev
index. It's almost ok, but.. So much!
compressed data takes 435MB. It's much more better than 725, but still
2x more than mysql. I can live with 2x overhead, but 4x makes me cry.

Which serialization format is used by couchdb storage engine?
If it uses term_to_binary, is it possible to enable data compression?
Via config-file or by http-headers.

Also, term_to_binary seems very overheaded by itself. Any unicode
character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
cyrillic chars.

So, the questions are:

1. What can I do now, to use less space for my data?
2. Can I add compression option to term_to_binary (if it used by couchdb, sure)?
3. Possibilities to provide charset information for data, to make
unicode to binary conversion more efficient?
4. Are there any progress in CouchDB development to change data
storage format to less overheaded?

Also, I just realized here
(http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite:
===============
A float is stored in string format. the format used in sprintf to
format the float is "%.20e" (there are more bytes allocated than
necessary)
===============
So, every float requires 33 bytes off disk space. Not so efficient.

--
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org

Randall Leeds

2011-12-28 18:08:08 UTC

Permalink

Post by Alexey Loshkarev
Hello.
I'm using CouchDB for two+ years for our company's internal projects.
It's very good and reliable database, i'm almost satisfied with it.
But, couchdb disk size makes me cry. I'll describe this.
My new project must store and manipulate with simple documents (15-20
integer/float/string fields, without attachments).
Target documents count may vary between 50M-500M. We are using SSD for
database now and need to count every gigabyte.
Currently, project data stored in MySQL.
I know why mysql data is so compact - data file consits only data, not
types and row names.
But CouchDB database disk size is very overheaded.
I have snippet of data (900K rows). Average row length is 200 bytes.
Total data size (disk size) is about 190MB.
I imported all of this data to CouchDB and realized, it occupies 800MB
(4x more than mysql). It was bulk insert with incrementing keys and
after import database was compacted.
I tried to reduce field names from 8-10 characters to 1-2 with almost no result.
My data consists of strings in unicode. I realized, erlang external
term format takes 5 bytes for every unicode character (instead of
1-... for utf-8). So i converted my unicode characters to ascii (just
transliterating cyrillic symbols to asci, one unicode symbol to ascii
equivalent).
Result - almost no.
Then I tried to calc sum of document sizes.
fun({Doc}) ->
Emit(<<"raw">>, size(term_to_binary(Doc))),
Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
end.
According this,
raw document sum is about 725MB. So, about 10% overhead to id/rev
index. It's almost ok, but.. So much!
compressed data takes 435MB. It's much more better than 725, but still
2x more than mysql. I can live with 2x overhead, but 4x makes me cry.
Which serialization format is used by couchdb storage engine?
If it uses term_to_binary, is it possible to enable data compression?
Via config-file or by http-headers.
Also, term_to_binary seems very overheaded by itself. Any unicode
character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
cyrillic chars.
1. What can I do now, to use less space for my data?
2. Can I add compression option to term_to_binary (if it used by couchdb, sure)?
3. Possibilities to provide charset information for data, to make
unicode to binary conversion more efficient?
4. Are there any progress in CouchDB development to change data
storage format to less overheaded?
Also, I just realized here
===============
A float is stored in string format. the format used in sprintf to
format the float is "%.20e" (there are more bytes allocated than
necessary)
===============
So, every float requires 33 bytes off disk space. Not so efficient.
--
----------------
Best regards
Alexey Loshkarev

Future releases of CouchDB, starting with the 1.2 release, will allow
for compression using google's snappy library which should greatly
reduce the overhead you experience. Also be sure to compact if the
ratio of disk usage to dataset size starts to grow too far. An
automatic compaction daemon is also coming.

-R

Alexey Loshkarev

2011-12-28 18:17:51 UTC

Permalink

Post by Randall Leeds

Future releases of CouchDB, starting with the 1.2 release, will allow
for compression using google's snappy library which should greatly
reduce the overhead you experience.

Cool!

Post by Randall Leeds
Also be sure to compact if the
ratio of disk usage to dataset size starts to grow too far. An
automatic compaction daemon is also coming.

Will wait for it!

--
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org

Rogutės Sparnuotos

2011-12-28 18:59:42 UTC

Permalink

Post by Alexey Loshkarev

Post by Randall Leeds
Future releases of CouchDB, starting with the 1.2 release, will allow
for compression using google's snappy library which should greatly
reduce the overhead you experience.

Cool!

Post by Randall Leeds
Also be sure to compact if the
ratio of disk usage to dataset size starts to grow too far. An
automatic compaction daemon is also coming.

Will wait for it!

But you were already compacting after every step in your testing, weren't
you?

And, actually, these 2 features are already available in git, so it's
"just" a matter of compiling and testing:
http://wiki.apache.org/couchdb/Running%20CouchDB%20in%20Dev%20Mode
If you try it out, please share your findings.

--
-- Rogutės Sparnuotos

Alexey Loshkarev

2011-12-28 19:40:26 UTC

Permalink

Ok, I'll try it tomorrow.

Post by RogutÄs Sparnuotos

Post by Alexey Loshkarev

Post by Randall Leeds
Future releases of CouchDB, starting with the 1.2 release, will allow
for compression using google's snappy library which should greatly
reduce the overhead you experience.

Cool!

Post by Randall Leeds
Also be sure to compact if the
ratio of disk usage to dataset size starts to grow too far. An
automatic compaction daemon is also coming.

Will wait for it!

But you were already compacting after every step in your testing, weren't
you?
And, actually, these 2 features are already available in git, so it's
http://wiki.apache.org/couchdb/Running%20CouchDB%20in%20Dev%20Mode
If you try it out, please share your findings.
--
-- Rogutės Sparnuotos

--
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org

Alexey Loshkarev

2011-12-28 18:16:08 UTC

Permalink

Post by Alexey Loshkarev
Also, I just realized here
===============
A float is stored in string format. the format used in sprintf to
format the float is "%.20e" (there are more bytes allocated than
necessary)
===============
So, every float requires 33 bytes off disk space. Not so efficient.

Reading specs I realized that using minor_version = 1 in
term_to_binary options makes floats be 9-bytes long instead 33.
It just search/replace in few files.

I tested my data blob with this map function:

fun({Doc}) ->
Emit(<<"raw">>, size(term_to_binary(Doc))),
Emit(<<"raw_1">>, size(term_to_binary(Doc, [{minor_version, 1}])))
end.

And received ~10% space (~600MB instead of ~700MB) usage decrease.

Do I need to file a bug in Jira for it?

--
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org