Alexey Loshkarev
2011-12-28 16:11:38 UTC
Hello.
I'm using CouchDB for two+ years for our company's internal projects.
It's very good and reliable database, i'm almost satisfied with it.
But, couchdb disk size makes me cry. I'll describe this.
My new project must store and manipulate with simple documents (15-20
integer/float/string fields, without attachments).
Target documents count may vary between 50M-500M. We are using SSD for
database now and need to count every gigabyte.
Currently, project data stored in MySQL.
I know why mysql data is so compact - data file consits only data, not
types and row names.
But CouchDB database disk size is very overheaded.
Some examples:
I have snippet of data (900K rows). Average row length is 200 bytes.
Total data size (disk size) is about 190MB.
I imported all of this data to CouchDB and realized, it occupies 800MB
(4x more than mysql). It was bulk insert with incrementing keys and
after import database was compacted.
I tried to reduce field names from 8-10 characters to 1-2 with almost no result.
My data consists of strings in unicode. I realized, erlang external
term format takes 5 bytes for every unicode character (instead of
1-... for utf-8). So i converted my unicode characters to ascii (just
transliterating cyrillic symbols to asci, one unicode symbol to ascii
equivalent).
Result - almost no.
Then I tried to calc sum of document sizes.
I wrote an erlang view:
fun({Doc}) ->
Emit(<<"raw">>, size(term_to_binary(Doc))),
Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
end.
According this,
raw document sum is about 725MB. So, about 10% overhead to id/rev
index. It's almost ok, but.. So much!
compressed data takes 435MB. It's much more better than 725, but still
2x more than mysql. I can live with 2x overhead, but 4x makes me cry.
Which serialization format is used by couchdb storage engine?
If it uses term_to_binary, is it possible to enable data compression?
Via config-file or by http-headers.
Also, term_to_binary seems very overheaded by itself. Any unicode
character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
cyrillic chars.
So, the questions are:
1. What can I do now, to use less space for my data?
2. Can I add compression option to term_to_binary (if it used by couchdb, sure)?
3. Possibilities to provide charset information for data, to make
unicode to binary conversion more efficient?
4. Are there any progress in CouchDB development to change data
storage format to less overheaded?
Also, I just realized here
(http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite:
===============
A float is stored in string format. the format used in sprintf to
format the float is "%.20e" (there are more bytes allocated than
necessary)
===============
So, every float requires 33 bytes off disk space. Not so efficient.
I'm using CouchDB for two+ years for our company's internal projects.
It's very good and reliable database, i'm almost satisfied with it.
But, couchdb disk size makes me cry. I'll describe this.
My new project must store and manipulate with simple documents (15-20
integer/float/string fields, without attachments).
Target documents count may vary between 50M-500M. We are using SSD for
database now and need to count every gigabyte.
Currently, project data stored in MySQL.
I know why mysql data is so compact - data file consits only data, not
types and row names.
But CouchDB database disk size is very overheaded.
Some examples:
I have snippet of data (900K rows). Average row length is 200 bytes.
Total data size (disk size) is about 190MB.
I imported all of this data to CouchDB and realized, it occupies 800MB
(4x more than mysql). It was bulk insert with incrementing keys and
after import database was compacted.
I tried to reduce field names from 8-10 characters to 1-2 with almost no result.
My data consists of strings in unicode. I realized, erlang external
term format takes 5 bytes for every unicode character (instead of
1-... for utf-8). So i converted my unicode characters to ascii (just
transliterating cyrillic symbols to asci, one unicode symbol to ascii
equivalent).
Result - almost no.
Then I tried to calc sum of document sizes.
I wrote an erlang view:
fun({Doc}) ->
Emit(<<"raw">>, size(term_to_binary(Doc))),
Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
end.
According this,
raw document sum is about 725MB. So, about 10% overhead to id/rev
index. It's almost ok, but.. So much!
compressed data takes 435MB. It's much more better than 725, but still
2x more than mysql. I can live with 2x overhead, but 4x makes me cry.
Which serialization format is used by couchdb storage engine?
If it uses term_to_binary, is it possible to enable data compression?
Via config-file or by http-headers.
Also, term_to_binary seems very overheaded by itself. Any unicode
character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
cyrillic chars.
So, the questions are:
1. What can I do now, to use less space for my data?
2. Can I add compression option to term_to_binary (if it used by couchdb, sure)?
3. Possibilities to provide charset information for data, to make
unicode to binary conversion more efficient?
4. Are there any progress in CouchDB development to change data
storage format to less overheaded?
Also, I just realized here
(http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite:
===============
A float is stored in string format. the format used in sprintf to
format the float is "%.20e" (there are more bytes allocated than
necessary)
===============
So, every float requires 33 bytes off disk space. Not so efficient.
--
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org
----------------
Best regards
Alexey Loshkarev
mailto:elf2001-***@public.gmane.org