Click to See Complete Forum and Search --> : Compression and multithreading?


merlingt1
05-13-2007, 03:18 AM
Hi all,

I have just recently made the move to 24/7 Linux (Ubuntu 7.04) so I am a real newbie so bare with me. I was wondering if there are any compression/archive utilities out there that are capable of multithreading to make use of dual core CPUs? So far I have noticed that neither rar/unrar, nor the GUI package manager can do it. Is it the software itself or am I just missing some variable? I would like to find something that can do this because I need to compress about 1.5Gb of data on a daily basis.

I have installed i686-smp and I have other applications which DO utilize my two cores so I know it is setup to work at least with some software.

Thanks!

bwkaz
05-13-2007, 04:18 PM
I doubt the compression algorithms can be parallelized -- the algorithms themselves are probably single-CPU-only. Lots of problems are this way -- you can't just split them up. CAD programs, for instance: their core functions must be single-threaded, because there isn't any way to split them up.

Now I don't know whether this is true for sure of compression algorithms, but it woudn't surprise me at all. :)

merlingt1
05-13-2007, 10:43 PM
I doubt the compression algorithms can be parallelized -- the algorithms themselves are probably single-CPU-only. Lots of problems are this way -- you can't just split them up. CAD programs, for instance: their core functions must be single-threaded, because there isn't any way to split them up.

Now I don't know whether this is true for sure of compression algorithms, but it woudn't surprise me at all. :)

Well I was asking because I know Winrar (from Rarlabs) is actually multithreaded in Windows, and it is a beautiful thing to watch so I figured someone must have done something similar for Linux.

webwolf
05-14-2007, 02:01 AM
The winrar gui may be multithreaded in windows, but I agree with bwkaz. I really don't see how the compression algorithem could be split? It may be possible to write a small programm that is multithreaded to create a SET of archives, say 1 per thread, but I doubt that a single conpression can be split at all.

bwkaz
05-14-2007, 06:33 PM
Actually, if the rar file format is like the zip file format, then it may be possible. But this won't help .tar.gz or .tar.bz2.

The zip format treats each file in the archive individually (and this is the source of many of winzip's recent security-related issues: it's possible to encrypt only certain files in the archive, and to replace encrypted files with unencrypted ones without knowing the password). So if you can reserve enough space for each file (meaning you'd have to know beforehand how well it was going to compress), then in theory you could compress each file in parallel. But the part about knowing how much space it was going to take before you start becomes a problem.

This also won't help .tar.*-format archives, because those compress the entire tar-file (because you get slightly better compression that way). So there's only one compression operation.

webwolf
05-15-2007, 01:30 AM
Thanks for that info bwkaz, one learns something new everyday. ;-)

dkrnic
05-19-2007, 06:17 PM
There is some confusion here I would like to clarify,
especially the one that parallelization of gzip and bzip2
by compressing separate parts of an input stream
in parallel can't be advantageous for *.tar etc.

Though parallelizing the compression algorithms might be
a tough order, there are vays to scatter the compression
effort across available CPUs without significantly
degrading the quality i.e. the compressed ratio.

First of all, although gzip will compress the entire file
or stream that comes its way in a single go, it will
nevertheless inflate a concatenated series of separately
compressed chunks without complaint. So you can just cut
the input stream into chunks and let them be deflated by
as many CPUs as you can recruit. You only need to worry
about how to gather them in proper order in the output
file.

Second, the bzip2, as far as I understand the algorithm,
always deflates the input stream in discrete chunks whose
size is the compression level option time about 100,000.
If you execute "bzip2 -9" the utility will cut the input
stream into 899,981 byte big chunks. That's why if
something goes wrong and a part of the compressed file
gets corrupted, you can still get at all of the
uncorrupted parts of the file, which cannot be said about
single-block gzip.

I've been using these properties of gzip and bzip2 for a
long time in order to be better able to estimate how much
of my daily incremental backups I can fit on a tape
without overflow. The problem is that most streamers
nowadays have Lempel-Ziv in silicon making it difficult
to estimate how much of uncompressed data will fit on a
tape. However, if you compress the data first and shut
the hardware compression off while writing to the tape,
you may know pretty much exactly how much compressed data
will fit on a tape. In my case an LTO-3 tape will accept
just a tad over 407,000,000,000 bytes if you don't let
the writing speed fall below 80 MB/s.

A bit off topic but let me load it off my chest anyway:
After some initial tests I have come to the conclusion
that in most cases the improved compression rate of
bzip2 is next to marginal but carries a severe efficiency
penalty. At level 9 it takes on average 3-5 times as much
time to compress a bunch of data about 10-15 percent
better than gzip. There is a special kind of data where
bzip2 excels and that's long lists of uniformly formatted
columns, e.g. log files. For example, a full_audit log of
my Samba server for a period of 6 hours is 795,607,535
bytes. Compressed with gzip it's only 10,000,820 bytes,
but with bzip2 it's even much less - 3,207,483 bytes.
Still I fail to see the advantage at that level of
compression. Sure it's 3 times as efficient in THIS case,
but it also happens to take enormously more time to
accomplish the feat - 1,245 seconds vs. 13 seconds,
in other words a hundred times longer for a 3-fold
improvement whereas the net effect is I've shaved another
6 MB in 20 more minutes after shaving off 786 MB in the first
13 seconds. If you desparately need to preserve space and
CPU power is not an issue, then bzip2 is a no-brainer,
especially if you have a nice little scheduler to
orchestrate the effort across your farm of
liquid-hellium-cooled Opterons.

In every other case you may be quite satisfied by slicing
the input stream into 1 MB chunks and calling deflate from
the libz.so as I do for two reasons: a) I don't like the
possibility that a bit in my data gets hit by a photino
and flips its truthiness anywhere in the stream thus
invalidating the rest of the data, and b) I can poke the
compressed size of the block into an unused header field,
which comes handy if for some reason you want to
quickly access a particular part of the data and you know
it's whereabouts in raw data.

So what do I do if I need to compress big time and don't
have much time to lose? Lacking a proper scheduler I
just divide the bunch into as many equal parts as there
are processors and let them gnaw at it from different
offsets until they're done, then splice it back in proper
order. It sounds ludicrous but that's what I do each
month when I compress the full backup. It's 3.5 TB raw
and I only have 16 Opterons free over weekend. Each
gets about 220 GB and the whole task is ready in about
a full day or so. Compressing is not the problem it's
the splicing which just can't be parallelized. You
must cat part 2 on top of part 1, and then part 3 etc.
and I only have GigaBit connections between the
computers.

A much better vay vould be to write a little utility
which reads the chunks from input, assigns them to
available free CPUs, captures compressed chunks and
rethreads them in their natural order into the output
file.

I was hoping I'd find something of the sort when I
googled "parallelized compression" and stumbled upon
this forum. Is anyone listening? Oh, well.

dkrnic
05-20-2007, 10:30 AM
My rant above kept me from finding the answer to my
google query and the answer to the question posed by
the thread creator.

Both gzip and bzip2 are also available in a parallelized
version that can take advantage of extra CPUs or
CPU cores.

Mark Adler, the maintainer of zlib library, has GPLed a
parallel implementation of gzip which he calls pigz.
You won't find a mention of it on zlib.net web page
but you can download http://zlib.net/pigz17.c.gz.

Jeff Gilchrist has BSDed his parallel implementation of
bzip2 called pbzip2, availalbe for download on this site
http://compression.ca/pbzip2/.

They seem both to be recent developments, this year.
They both slice the input stream and let different
processors/cores compress the chunks which they
then splice into a respective compressed file.

Both claim nearly linear increase in efficiency with the
rising number of processors/cores when properly configured.

bwkaz
05-20-2007, 02:14 PM
(Wow, this is hard to read when it's so narrow... let me fix it in the quote here.)

There is some confusion here I would like to clarify, especially the one that parallelization of gzip and bzip2 by compressing separate parts of an input stream in parallel can't be advantageous for *.tar etc. That's not what I said. :)

I said that you get worse compression when you compress files before concatenating them. You get better compression when you concatenate them first. This is because a single stream of compression can only remove redundant information (which is what compression does) when it can recognize that information as redundant. And there are more possibilities for recognition of redundancy when you run the compression algorithm on a larger chunk of input.

there are vays to scatter the compression effort across available CPUs without significantly degrading the quality i.e. the compressed ratio. "Without significantly degrading", perhaps. But there is some degradation; there must be.

You're talking about compression performance being "not that much worse"; I'm trying to make it "no worse at all".

(If the user wants to split the stream on their own, let them run multiple gzip processes. But don't do it for them, because some of them are expecting better compression than that. OTOH, it sounds like you're a gzip user, based on what you said below, so I'm not that surprised that you're making that kind of tradeoff -- but be aware that you only have the option to decide because the gzip algorithm doesn't force you to do the splitting. :))

First of all, although gzip will compress the entire file or stream that comes its way in a single go, it will nevertheless inflate a concatenated series of separately compressed chunks without complaint. So you can just cut the input stream into chunks and let them be deflated by as many CPUs as you can recruit. You only need to worry about how to gather them in proper order in the output file. True, but see above. You get better compression when you don't split them. So don't force the user to split them when the user may not want to.

Second, the bzip2, as far as I understand the algorithm, always deflates the input stream in discrete chunks I didn't know that. So bzip2 might be almost-trivially-parallelizable then.

In every other case you may be quite satisfied by slicing the input stream into 1 MB chunks and calling deflate from the libz.so as I do for two reasons: a) I don't like the possibility that a bit in my data gets hit by a photino and flips its truthiness anywhere in the stream thus invalidating the rest of the data, You mean a photon, right? Although even photons can't flip a bit in a register (since registers use latches); in order to do that, you'd have to raise the voltage at either the set or reset input to the latch (depending on the latch's current state).

Now in DRAM, "all" you'd have to do is discharge (or charge) a capacitor, since DRAM is just a huge array of capacitors (one per bit) and a huge array of transistors (to select which capacitors get read from or written to). To flip a bit, you have to either charge an uncharged capacitor, or discharge a charged one. But photons can't do that either. And "cosmic rays", while they have been blamed for flipping bits, are blocked by the ceramic packages that any chip comes in, so they can't do it either. (Or at least, they were blocked by those packages the last time I'd heard of any research on them being done.)

On disk, it may be possible for some rare subatomic particle or ray of some sort to flip one of the magnetic domains on the platter, but to date I've never heard of that actually happening. (And at that point your data's gone anyway.) On the tape, the same thing may apply, though again I've never heard of it actually happening.

(Now this doesn't mean that a misbehaved program or kernel won't scribble all over your data either. But to fix that, you just have to run reliable programs and kernels.)

and b) I can poke the compressed size of the block into an unused header field, which comes handy if for some reason you want to quickly access a particular part of the data and you know it's whereabouts in raw data. Yes, this would help, as long as you don't mind stepping on the gzip stream format. Yikes. These fields are reserved for a reason; just because they're unused today doesn't mean they will be forever... if you want to store extra data, store it in a second (short, and if the format will let you, uncompressed) gzip stream.

dkrnic
05-21-2007, 12:56 PM
The compression ratio is an elusive quantity. It all depends.
You can't make firm rules of thumb like you do, bwkaz, in
your reply to my post. Compressing parts of an input stream,
such as individual files, may even yield better compression
ratios than compressing the whole stream of concatenated
files as in a tar or other kind of archive/backup. That's
because each file starts by building its own dictionary,
instead of having to use whatever accumulated in the
previous compress of a very different kind of file. Just for
a trivial example take one file in French and one in Swedish.
Chances are that the sum of individual compressions will
be smaller than a compress of those two files concatenated.

This kind of difference was even used by some philologists
in Italy to authenticate text ownership in some uncertain
cases. They added texts of unknown origin to texts of
several possible suspects and decided that it is very likely that
the text was written by that author for whose compressed
size the addition made the least difference.

As to the photino. I didn't missspell it. It was a pun of course.
I meant to say "if by any chance a part of my compression
gets corrupted" but used a little more florid language.
Nothing to start a brawl.

The additional information I use for navigation in a compressed
file is in a safe place, don't worry. The gunzip will never
bother about bytes 4-7. If it did, it would be incompatible to
its previous versions. On the other hand, if I poked those 4
bytes somewhere else, it wouldn't be gzip any more, would it?
I appreciate your concern, but rest assured there is no prob.

By the way, can you please stick to about 60 characters per
line. It looks much better and easier to read on my small screen.

Cool signature.

je_fro
05-21-2007, 03:21 PM
Ruh Roh!
Somebody's callin' out the Kaz!

/me subscribes and gets the popcorn...

bwkaz
05-21-2007, 07:25 PM
Just for a trivial example take one file in French and one in Swedish. Chances are that the sum of individual compressions will be smaller than a compress of those two files concatenated. Um, OK, so there are pathological cases that I've never run into... (:p)

(OK, yes, you're right: there are cases where separate files can work better. But the files' languages don't matter for that, either; the only thing that matters are the bytes used to represent the characters. In your example, if both those files used the language's default ISO-8859-x character set, then I'd guess that their bytes will be similar enough that it'd be better to keep them together. OTOH if they used UTF-8, then you may be right. It all depends on how much they use the same bytes. Regardless, there are apparently a few cases where it's better to split files up.)

As to the photino. I didn't missspell it. It was a pun of course. Ah, I missed that then. (Obviously... ;))

The gunzip will never bother about bytes 4-7. If it did, it would be incompatible to its previous versions. OK, but why? Aren't bytes 4-7 used to store the mtime in the gzip file format RFC (that is, 1952)? Why would storing an mtime make gzip be backward-incompatible?

On the other hand, if I poked those 4 bytes somewhere else, it wouldn't be gzip any more, would it? It could still be gzip, if you put those four bytes in a separate non-compressed "block". (A "member", according to the RFC. Just like you're splitting the blocks up today: create a separate block without any compression (since I believe gzip allows that, right?) whose contents are the 4 bytes you want to store.)

Actually, should it be 8 bytes instead? Are your files' sizes limited to 4GB as you have them laid out now, or is there an implied multiplier being used?

(Of course you'd have to add one extra block per block of input. Hmm.)

By the way, can you please stick to about 60 characters per line. It looks much better and easier to read on my small screen. That's why the HTML generated by the forum software wraps the lines that we all write at the appropriate places: to fit as much text horizontally as possible, depending on the user's screen. :p

Of course if your screen is too narrow to fit the ad across the top, then the table holding the replies is going to go off the right side of your window -- this is something that the forum admins should probably find out about. (I assume "your window" because I assume you're not using a text browser. Because text browsers are pretty much guaranteed to wrap the text in the right place -- at least in my experience with them.)

dkrnic
05-23-2007, 03:13 PM
The parenthesized brooding on the compressibility language-
vs. charset-wise worries me because it is plainly vrong. The
charsets do matter a little but we are not squeezing the bits
smaller as your discussion might imply. What happens is that
a series of bytes, irrespective of the charset, are analized
for repetitious patterns of some non-trivial length, which
are then substituted by patterns of shorter runlength.
In the end each dictionary algorithm calls the old Huffmann
to squeeze the bejesus out of the redundancy at the bit level.
So because they are dictionary based there will be significant
difference in the tokens used in such a dictionary for a text
in French as compared to a text in Swedish,
totally irrespective of the charset in which it is laid out.
It tangentially touches on matters of style, the so called
voice from a page, whereby given a mistery text you can
decide who wrote it by using a scientific metric, the
compression ratio when compressing the unknown text together
with texts of indisputable attribution and checking which
combination compresses better (at least int theory), instead
of more vague stylistic analysis.

The mtime field was somebody's idea of law and order just as
most of the rest of RFC1952 is a mass of irrelevant details,
precisely what you get when you let a commitee decide. It's
one redeeming feature is that it is published and everybody
is free to disregard it. The mtime field is sadly redundant and
superfluous and nobody bothers to enforce its original
meaning. Even if the field does get interpreted as a 4-byte
timestamp, so vhat? Will you render me to the International
Court of Injustice in den Haag?

On the other hand, you are up to something with the suggestion
that I could use the extensibility feature of the respective
RFC and slap another field for that purpose. As you yourself
astutely point out, that wouldn't exactly improve the
compression ratio. Some naive implementations of RFC1952 might
even gaggle on it and generally it's too much hassle if there
already is a field of adequate size which nobody's using
anyways.

4 bytes are more than enough for what I need it for. I have
done some statistic research and found out that the bigger
the chunk to be compressed the better it can be compressed.
But after some real-life size, e.g. 1 MiB, the additional
compressibility is so small that you may just fahged abut it.
It is a nice round number 1 MiB. Mr. Adler (of pigz fame)
does it with 128 Kib default but lets you specify any other
size. Obviously, since I'm compressing 1 MiB chunks,
4 bytes are more than sufficient to express the size after
compression, which is mostly less than 1 MiB except in
some pathological cases when the deflate algorithm actually
inflates the stream by just a tiny bit.

Vhy 1 MiB? Because then you can reserve so many longs in
a pseudo file at the beginning as there are MiBs in the
file and put each MiB's compressed length there. In this
way you can easily jump to a particular place in the file
if you know where it is in uncompressed data. You don't
need to decompress everything. Just the MiB in which your
data is.

I wish I could use a text-based viewer for this thread,
in which case all the text would wrap properly vor viewing,
but I'm using a graphical programm called IE5 which just
leaves 67th byte and onward off my screen so I have to
use the horizontal bar at the bottom of the vindow to
scroll to the rest. And besides, shorter lines improve
comprehension, as was recently discovered by some guys
who have patented a way to turn all text into haiku-like
strophes for improved legibility and focus.

je_fro
05-23-2007, 06:25 PM
/me unsubscribes :(

bwkaz
05-23-2007, 07:00 PM
The parenthesized brooding on the compressibility language- vs. charset-wise worries me because it is plainly vrong. Doesn't gzip operate on bytes?

The charsets do matter a little but we are not squeezing the bits smaller as your discussion might imply. If I implied that, I didn't mean to. But I don't think I did...

What happens is that a series of bytes, irrespective of the charset, are analized for repetitious patterns of some non-trivial length, Exactly right. But here's the thing: since the software is looking for repetitions of bytes, it doesn't care what the character set of the file is. It doesn't care if you have one file that's in French and one that's in Swedish: it just takes the bytes and works on them.

If the Swedish character o-with-a-line-through-it uses the same byte value as the German u-with-an-umlaut (note: I have no idea whether they do), then it will help you to concatenate the files first, because the byte values that gzip sees will be the same. Yes, the multi-byte patterns may be different, but you do still get some repetition.

So because they are dictionary based Is this where one of us is getting confused? When I read that, "dictionary" doesn't imply "natural language dictionary" to me. Instead, "dictionary" implies "lookup from bit strings to other (longer) bit strings".

there will be significant difference in the tokens used in such a dictionary for a text in French as compared to a text in Swedish, totally irrespective of the charset in which it is laid out. No, there won't, not necessarily anyway. (In practice, perhaps, but it's not required.) If the non-Latin characters in Swedish map to the same byte values as the non-Latin characters in French, then the byte patterns in the files will have a higher correspondence than random noise. Which means they'll compress better after concatenation.

The mtime field was somebody's idea of law and order just as most of the rest of RFC1952 is a mass of irrelevant details, So how am I supposed to write a gzip-decompressing program, if not by following the RFC?

As for the committee remark: this RFC wasn't written by a committee. The copyright is owned by one person, and the "author's address" shows only one person as well. Technical questions can be sent to two (other) people. That's hardly a "committee".

Even if the field does get interpreted as a 4-byte timestamp, so vhat? Will you render me to the International Court of Injustice in den Haag? Where did that come from?

I'm saying you're stepping on the standard. Your files can't reliably be moved to other machines or sent to other people, because their version of gzip (or whatever program) may not handle them correctly. That's all.

dkrnic
05-24-2007, 10:56 AM
Your definition of dictionary is correct. No confusion there.

You're right about stepping on the standard. The question is
if it matters. Am I introducing a problem that can byte back
under some conditions? I think not. Not a serious one, anyway.

I take back the vitriol regarding RFC1952. I don't see what
ftext flag or xfl flags or OS field have to do with decompression.
I haven't seen that they are used. A byte or two could be
saved there. Even mtime is not really needed. That's why I
thought it was commitee work. OK. I was wrong.

By the way, have you tried pigz and/or pbzip2?

dkrnic
05-24-2007, 11:14 AM
Funny, one of the two files your web site hosts is a GUI skin
for distributed folding written by none other than Jeff Gilchrist,
the author of the parallel implementation of bzip2.

How come, you were so skeptical about parallelizing the
compression work right from the beginning of this thread?

bwkaz
05-24-2007, 07:14 PM
Funny, one of the two files your web site hosts is a GUI skin for distributed folding It wasn't a skin; it was a complete rewrite. (Not that it matters anymore, since the DF project has been dead for so long...) But yes.

How come, you were so skeptical about parallelizing the compression work right from the beginning of this thread? Because I don't care who people are, just what they're saying (at least in most cases)? And because I didn't keep up with whatever Jeff was doing? ;)

Here's how that port happened, for reference: I ran DF for a while on my machine, and heard a lot of good things about dfGUI (the Windows version). So I decided to make a Linux port, since nobody else was doing it, and I could use it as a way to figure out Gtk (which is something else I wanted to do). After DF stopped running, I didn't touch the port again (the dates on the files are still April 2004), and I haven't talked with Jeff in just about as long.

Now, several years later, somebody asks if compression can be threaded. I say probably not, since the algorithms are likely sequential. You say "well actually you can parallelize them" (which is true, mind you -- after a fashion anyway), and start talking about some sample implementations by various people. And we get into this discussion, etc.

And of course I recognized Jeff's name when you said it. But I figured the existence of both of these programs was immaterial to what I was trying to argue: that while you can split the input up into multiple streams and cat them all together, you still can't parallelize the compression itself. (At least for gzip: I didn't know that bzip2 was block-based already.)

dkrnic
05-25-2007, 10:45 AM
There is a time consuming transform in bzip2, some sort of sort,
which is parallelizable to an extent. Not sure about impact, but
the parallelization based on "divide and conquer" is the rule.
Yes, it may be somehat less efficient in saving space
but it is a small price for the efficiency in saving time.

I've tested pigz on a 128 MiB big file with options
"-9 -b 1024 p $i" whereby i, number of pthreads, runs from 1 to 16:

p real user sys
1 0m27.705s 0m27.342s 0m0.352s
2 0m27.769s 0m27.422s 0m0.372s
3 0m15.074s 0m27.362s 0m0.344s
4 0m11.336s 0m27.142s 0m0.356s
5 0m09.150s 0m26.738s 0m0.352s
6 0m09.041s 0m26.798s 0m0.332s
7 0m09.011s 0m27.126s 0m0.396s
8 0m08.725s 0m27.182s 0m0.360s
9 0m08.955s 0m27.242s 0m0.336s
10 0m08.879s 0m27.374s 0m0.300s
11 0m09.328s 0m27.686s 0m0.316s
12 0m08.954s 0m26.978s 0m0.344s
13 0m08.967s 0m27.386s 0m0.308s
14 0m08.588s 0m27.078s 0m0.320s
15 0m08.486s 0m27.354s 0m0.324s
16 0m09.074s 0m27.442s 0m0.308s

The system was a Tyan s2895 with 2 dual-core Opteron285 CPUs.
I'd say not bad, almost 3.5 times better at its best. About
18.5 MB/s instead of 4.8 MB/s. It's not something you would
sell your granma for but if you have lots of stuff to compress
the difference between 3 and a half days and one day is a lot.

Modern streamers can do a lot better. I'm not sure but I think
an LTO-3 can compress at the rate of 160 MB/s assuming the
compression ratio of 2, which is what it can physically commit
to the medium (80 MB/s) while streaming. Unfortunately, you
can't read the raw tape back and its a little different format
anyway. It would be a good thing if someone would hack such
a tape unit to just forget about writing to tape, just write back
to the reader.

Pending that I wonder if it wouldn't be better to have a trivial
TCP service, something like echo or perhaps SETI@home, which
could run on any Linux or Windows computer, accepting any
reasonable block of data and returning its gzip compress. You
wouldn't even need explicite threads for the controlling program,
nor any forks for that matter. Just name the servers that should
be used for the compression and whip the host that pulls the most.
Theoretically, it could raise the compression rate up to about
100 MB/s given at least a GBIC uplink and a generous buffer.

And now for something completely different - how much do we
lose when the "-b" option varies exponentially between 64 Kib
and 4 Mib while always building up a new dictionary with each
chunk (option "-i")? The original file size was 1 GiB from the
beginning of a standard incremental tar backup.

KiB real compressed %worse/better than 1 MiB
64 67.157s 284,781,983 -2.81
132 73.889s 280,645,962 -1.33
256 70.002s 278,580,469 -0.58
512 65.406s 277,522,871 -0.2
1024 65.686s 276,975,488 .
2048 64.668s 276,733,455 +0.08
4096 67.193s 276,614,872 +0.13

Not that it is in any way really representative except to
get a feeling. Beyond some chunk size the advantage in smaller
aggregate compressed size is too small to worry about.