About Gzip And Tar
Everybody on Linux and BSD seems to use a program called gzip, frequently in conjunction with another program called tar. Tar, named from Tape ARchive, is a program which copies files and folders (“directories”) to a format originally designed for archiving on magnetic tape. But tar archives also can be saved to many other file systems besides tape. Tar archives can be saved to normal hard drives, solid state drives, NVMe drives, and more.
When making an archive, people frequently want to minimize the archive’s size. That’s where gzip comes into play. Gzip reduces the size of the archives so they take up less storage space. Later, the gzipped tar archives can be “unzipped.” Unzipping restores the tar archives to their original size. While unzipping, the tar program can be used again to “extract” or “untar” the archive. Extraction hopefully restores the archived original files exactly as they had been when the archive was created.
Besides archiving for long term storage, many people frequently use tar and gzip for short term backup. For example, on my server, Darkstar, I compile and install many programs. Before compiling, I use tar to make a short term backup of how things were before the compile and install.
Three Good Reasons To Compile
First, compiling gets us the most current source code for the programs. Second, once we have done compiling a few times, compiling a program from its latest sources can be easier than figuring out how to install an often older version with our distribution’s package manager. Third, compiling ourselves results in having the program sources readily available.
The programs that I compile on Darkstar usually live in /usr/local. Before I put a new program into usr/local I like (in addition to my regular backups of Darkstar) to make an archive of /usr/local as it exists just before the new software addition. With a handy /usr/local archive, if something goes crazy wrong during my new install, it’s easy to revert.
Creating Pre-Compile Backups Can Take Too Long
Lately, as more software has been added to /usr/local, it’s been taking too long to make the pre-compile archive, about half an hour.
Recently, using the top(1) command I watched an archive being formed. I noticed that gzip was reported as using 100% of one cpu throughout the archive formation.
How Much Faster And Bigger Are Plain Tar Archives Made Without Gzip?
I wondered how the overall time required to make my pre-compile archive would change if I did not use gzip. I also wondered how much bigger the archive would be. Below are shown the data and the analysis of the surprisingly large creation time difference I found. The archive size difference also is a lot, but nowhere near as much as the creation time difference.
Creation Time Data
I ran the pre-compilation archive twice, once with gzip and once without gzip. I made a line numbered transcript of both tests.
000023 root@darkstar:/usr# time tar cvzf local-revert.tgz local
000024 local/
[ . . . ]
401625 local/include/gforth/0.7.3/config.h
401626
401627 real 28m11.063s
401628 user 27m1.436s
401629 sys 1m21.425s
401630 root@darkstar:/usr# time tar cvf local-revert.tar local
401631 local/
[ . . . ]
803232 local/include/gforth/0.7.3/config.h
803233
803234 real 1m14.494s
803235 user 0m4.409s
803236 sys 0m46.376s
803237 root@darkstar:/usr#
This Stack Overflow post explains the differences between the real, user, and sys times reported by the time(1) command. The “real” time is wall clock time, so the “real” time shows how long our command took to finish.
Gzip Took 22 Times Longer!
Here, we can see that making the archive with gzip took approximately 28 minutes. Making the archive without gzip took only 1.25 minutes. The gzipped archive took 22 times longer to make than the unzipped archive!
Archive Size Data
Now let’s check the archive sizes.
root@darkstar:/usr# ls -lh local-revert.t*
-rw-r--r-- 1 root root 22G Oct 4 05:22 local-revert.tar
-rw-r--r-- 1 root root 10G Oct 4 05:20 local-revert.tgz
root@darkstar:/usr#
The gzipped archive is 10 gigabytes and the plain, not zipped tar archive is 22 gigabytes.
Gzip’s Compression Was 55%.
The zipped archive was compressed by 55%. That’s a lot of compression!
Conclusion
On Darkstar, there is abundant extra disk space. So having an archive that is twice as big but created 22 times faster might be the best choice. Going forward, before compiling, I will skip doing any compression at all when backing up /usr/local to enable revert. Now I won’t have to wait that half an hour any more!
Additional Reflections
Creation time and archive size results would be expected to differ according to the types of files involved. For example, unlike the files in Darkstar’s /usr/local, many image file formats already are compressed, so additional compression doesn’t reduce their size very much.
As I was preparing this article, I found out about pigz. Pigz (pronounced “pig-zee”) is an implementation of gzip which allows taking advantage of multicore processors. Maybe pigz soon will be a new neighbor in Darkstar’s /usr/local.
Another approach to speeding up compression is to use a different compression program than gzip. There are quite a few which are popular, such as bzip2 and xz. These other compression programs can be called with tar’s -I option.
Of course it is one thing to change the compression program with tar’s -I option and another thing to make tar itself work in parallel. Here is a Stack Exchange post about tarring in parallel. I will have to try that.
Finally, unlike when we get our sources and our compiled programs separately, it seems fully clear that the sources we compile ourselves are the sources to the programs we’re actually running. However, way back in 1984, Ken Thompson recognized that the programs we compile ourselves sometimes can be very different than what we expected.
Related Posts:
I Don't Have Time to Win the Hutter Prize, So Maybe You'd Like to Snag 500'000€ With My Idea
HostSailor Greenhouse (NL) Fujitsu Primergy Dedicated Server Review
How To Move Your Site To Oracle Cloud Free Tier With scp(1)
How To Install Caddy Web Server On Oracle Cloud Free Tier
What You Need to Know About Buying A Hetzner Auction Server
What Happens When Your Site's DNS Goes Down?
- What is “aria-label”? And why you need to use it. - August 12, 2024
- HostSailor Greenhouse (NL) Fujitsu Primergy Dedicated Server Review - October 23, 2022
- How Much Faster Is Making A Tar Archive Without Gzip? - October 7, 2022
PIGZ? https://zlib.net/pigz/
Hi cutech! Thanks for your comment. Yes, the same link to pigz appears in the article. But I haven’t tried pigz yet. I’m looking forward to trying it! Have you tried pigz? Best wishes!
pigz basically just is multi-threaded, thus resulting on a lot more compression speed(after all these days our computers dont tend to have a ghz frenzy any more but rather multiple cores/threads on a single die with a lower frequency). thus yes, I do use pigz as a drop-in replacement for gzip and yes, at least on my desktop with 16(effective) cores it obviously performs 16 times faster.
Pigz is great, but it’s not the answer here.
Things that might be: btrfs/zfs snapshots, but that’s a rather big change – what I’d actually suggest is epkg. It’s project page is unfortunately long gone, but you’ll still be able to find epkg-2.3.9.tar.gz – and its use is described nicely here: https://blog.notreally.org/2008/01/21/installing-from-source-the-easy-way/
Hi Chris! Thanks for your comment. That epkg looks nice! I also loved the part in the blog where the author used apt to install the prerequisites. :)
1. Use something like restic to backup. It will only backup things that have changed. It uses multi-threaded zstd compression by default which is many times faster than gzip or pigz.
2. This is not really something you need to ‘backup’, it is not unique data, and can be regenerated. It’s more let me not f* this up. For that, I would use btrfs snapshots.
/bin/time sudo btrfs subvolume snapshot -r /usr/local/ /usr/local/@20221007
Create a readonly snapshot of ‘/usr/local/’ in ‘/usr/local/@20221007’
0.00user 0.01system 0:00.01elapsed 81%CPU (0avgtext+0avgdata 9148maxresident)k
32inputs+3360outputs (0major+1032minor)pagefaults 0swaps
This creates a read-only snapshot of /usr/local in /usr/local/@20221007 that you can ‘cd’ into and everything will be as it was when created. Snapshots take no time to create, and use no extra space.
Ho nopro404! I’ve never tried btrfs! One of these days I should try it! Thanks for the suggestion!
If speed is what matters most, I think zstd would be better choice. Even on single core it is much faster than gzip, and offers similar compression ratio…
Hi Rhinox! Here’s a link to the Wikipedia article about zstd: https://en.wikipedia.org/wiki/Zstd . Thanks for suggesting zstd!
What have you done between the two tar runs?
Second run may have taken advantage of file system caching.
While I can expect some time differences between the two, it looks like it is too much to me.
Hints ahead.
1. Make a first dummy run, then run again the tow.
2. Make those run also in reverse order and make an average of the time.
You may want try different compression settings that affect speed: gzip:compression-level
and/or bzip2 or other compression algorithms offered by tar or – as already mentioned – pigz.
Since you have plenty of disk space, why not just tar, then gzip the tarfile after the tar is done? That way you’re already able to start downloading and compiling after one and a half minutes.
There’s also lz4. Low CPU, almost transparently fast with limited compression ratio. It’s what ZFS uses by default at the file system level.
https://backuppc.github.io/backuppc/
For localhost, it’s install and go to web interface to exclude what you do not want to backup.
This made so simple to backup everything on the server and PCs, that it backs up every linux-pc I have now. Just needs SSH key auth on target.
A better title for this article would be “How Much Faster Is Making A Tar Archive Without Gzip Using Specific Options On A Specific Platform At A Specific TIme?”
First, as others have noted, using different software can help, including pigz (which is a name which is an anagram of gzip) which is basically using gzip and splitting off tasks to different CPU cores. The gzip command itself can use parameters which may affect how it compresses, which may affect speed.
Second: After compressing (with gzip), the compressed data is written. The compression process takes time. However, compressed data takes less time to write (and read back) than uncompressed data. So, how much is the time cost for compression, compared to the time savings of being able to work with smaller data? The correct answer is: some of that is going to be dependent on the platform. Different speeds in a CPU, and different speeds in a disk, may affect that. On very old computers with very old CPUs, you may get one answer. On newer computers with much faster CPUs but the slower data storage with hard drives, you may get an opposite answer. On computers with data storage which is using a chip, the cost of storage may be reduced. On a server where slices of available CPU time are a shared resource, and access to data might involve communicating over a network which is another type of shared resource, your results could vary even when you use the same compression ratio on the same hardware.
I’ve written about this, a bit longer, here:
http://cyberpillar.com/dirsver/1/mainsite/techns/hndldata/filesys/fsformat/fsformat.htm#chdmyths
As is, the title makes it sound like gzip slows things down in some consistent, measurable way which other people should learn (perhaps regardless of what hardware they are using). The reality is that such details may vary today, and, over time, such details may be even more prone to vary as different technologies tend to evolve (in different speeds and maybe other different ways).