To Deduplicate or Not to Deduplicate ?

Updated 19/11/2021

How Deduplication Works

Deduplication, a feature we have in Open-E JovianDSS, sounds very useful and many storage marketing enthusiasts believe they cannot live without it. Admittedly, it does seem very attractive, largely due to three key benefits that it provides. First, it lets you reduce the space you use. Second, it allows you to save on the upload bandwidth. Lastly, it improves your backups and cloned virtual machines’ performance.

Everything is good so far, but is there any risk to using deduplication? Well we can actually learn a lot about it from the Wikipedia article (https://en.wikipedia.org/wiki/Data_deduplication). First of all, data deduplication does not guarantee data integrity, so keep that in mind. Second, the probability of collision occurring is very low. So-called collision is where two different pieces of data will have the same hash value.

The last significant point is that it’s dependent on the algorithm that’s used and circumstances. In general, deduplication can provide real benefits if the data is really often duplicated and if deduplication works on the application level and not on the file system level. Typical applications that highly benefit from deduplication include backups and e-mail storage in cases where plenty of the e-mails have big, identical attachments. In other circumstances, much of the benefits of having deduplication are lost.

Community Opinions on Deduplication

A while ago, I was reading a fascinating post on a newsgroup. There was an interesting discussion there about deduplication. Below is a short quote from Josef Bacik, a participant in the aforementioned discussion, about his experiences using deduplication with regular data:

“””

From: https://thread.gmane.org/gmane.comp.file-systems.btrfs/8448

> On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:

> > Blah blah blah, I’m not having an argument about which is better because I

> > simply do not care. I think dedup is silly to begin with, and online dedup even

> > sillier. The only reason I did offline dedup was because I was just toying

> > around with a simple userspace app to see exactly how much I would save if I did

> > dedup on my normal system, and with 107 gigabytes in use, I’d save 300

> > megabytes. I’ll say that again, with 107 gigabytes in use, I’d save 300

> > megabytes. So in the normal user case dedup would have been whole useless to

> > me.

“””

It’s always good to know how deduplication works when on a file system level using regular, NOT specially prepared, dedup benchmarking data.

Closing Thoughts

You’ll most likely get a great demonstration of how well deduplication works on a file system level if it’s done by intelligent, motivated marketing folks. They’ll even prove that a 90% data reduction is possible, but please be aware that that’s not always the case. It could very well be much lower, like the 0.36% in Joseph’s case.

One more thing, inline deduplication will demonstrate outstanding performance with specially prepared duplicated data on an almost empty volume. In real use cases involving regular data and volumes full of data, you’ll almost always experience a huge drop in performance.

deduplication

3 Comments

Reply
Patrick S.
May 10, 05 2012 03:04:03
Janusz, thanks for your article. I am really curious on NAS level deduplication.
As a software company, we are running a buildserver that does its work mainly on a DSS V6 NAS target. For a bunch of reasons, all libs etc. are copied into each relase build folder, which should really make up for a good dedup target. Furthermore, the build generates various zips and exe-Setups which should not differ all to much against each other in minor build versions. I always had the opinion this was a good scenario for block level dedup, which I would really love, since every build currently consumes about 500 megs of space.
What is your opinion on using dedup in that scenario, will you consider integrating dedup into DSS (the technical basement should be there, http://opendedup.org/) and if not, can you recommend something to look further on the subject?
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
- Reply
  robert.bluszcz
  May 22, 05 2012 02:23:56
  There are many deduplication softwares and opendedup is one of the possibility. The choice depends on budget, purposes and so on.
  On the other hand, our research team investiges the opendedup software but at this stage we are not able to tell you more if it will be implemented or not.
  VN:F [1.9.22_1171]
  Rating: 0 (from 0 votes)
Reply
Laurent
May 13, 05 2012 10:00:44
Hi,
dedup is no voodoo magic.
You may get the vendor prediction of 80% on database table.
You often get nothing on compressed images.
There is a case where dedup is great.
I’m using a storage bay featuring offline dedup to run VMware servers.
I deploy VM by cloning prebuild VM templates.
The dedup rate after the nightly dedup is just what you expect : very good.
A side effect of the dedup here is that it boost the cache read hit.
A host cold reboot is really showing I/O improvement after the dedup process.
My 2 cents.
VA:F [1.9.22_1171]
Rating: +1 (from 1 vote)