0 Liked

    To deduplicate or not to deduplicate ?

    Deduplication sounds very nice and many storage marketing enthusiasts believe they cannot live without it. Yes, it shines with three factors: reducing space, saving the upload bandwidth and improving backup and cloning virtual machines performance. So far very good, but is there any risk of using deduplication? From a Wikipedia article (http://en.wikipedia.org/wiki/Data_deduplication)  about Data Deduplication we can learn a lot about it. Data Deduplication does not guarantee data integrity, so keep that in mind.
    So called Collision is where two different pieces of data will have the same hash value is possible, almost like you will win the lottery  (I am exaggerating a little as the probability of such Collision is very low, but…).
    Next very important point is the algorithm that is used. In general deduplication provides real benefits if the data is really often duplicated and if deduplication works on the application level and not on the file system level. Typical applications are backups or e-mail storage where plenty of e-mails have identical big attachments.
    Over a year ago I was reading a very interesting post on a newsgroup. There is a discussion going on about deduplication here and if you have no time to read the discussion a short quote below  is from Josef Bacik about his experience from using deduplication with regular data:

    “””
    From : http://thread.gmane.org/gmane.comp.file-systems.btrfs/8448
    > On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
    > > Blah blah blah, I’m not having an argument about which is better because I
    > > simply do not care.  I think dedup is silly to begin with, and online dedup even
    > > sillier.  The only reason I did offline dedup was because I was just toying
    > > around with a simple userspace app to see exactly how much I would save if I did
    > > dedup on my normal system, and with 107 gigabytes in use, I’d save 300
    > > megabytes.  I’ll say that again, with 107 gigabytes in use, I’d save 300
    > > megabytes.  So in the normal user case dedup would have been whole useless to
    > > me.
    “””

    I guess it is good to know what deduplication may bring while working on the file system level and using regular and NOT specially prepared dedup benchmarking data.
    You will most likely get a great demonstration of how great deduplication works on a file system level if done by smart motivated marketing folks. They will even prove 90%  data reduction, but please be aware that in your case it can be as low as 0.3% as in Joseph’s case.
    One more thing, inline deduplication will demonstrate very good performance with specially prepared duplicated data and demonstrating on an almost empty volume. In case of regular data and a volume being full of data you will experience a huge drop on performance.

    VN:F [1.9.22_1171]
    Rating: 3.7/5 (3 votes cast)
    To deduplicate or not to deduplicate ?, 3.7 out of 5 based on 3 ratings

    3 Comments

    • Patrick S.

      May 10, 05 2012 03:04:03

      Janusz, thanks for your article. I am really curious on NAS level deduplication.

      As a software company, we are running a buildserver that does its work mainly on a DSS V6 NAS target. For a bunch of reasons, all libs etc. are copied into each relase build folder, which should really make up for a good dedup target. Furthermore, the build generates various zips and exe-Setups which should not differ all to much against each other in minor build versions. I always had the opinion this was a good scenario for block level dedup, which I would really love, since every build currently consumes about 500 megs of space.

      What is your opinion on using dedup in that scenario, will you consider integrating dedup into DSS (the technical basement should be there, http://opendedup.org/) and if not, can you recommend something to look further on the subject?

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
      • robert.bluszcz

        May 22, 05 2012 02:23:56

        There are many deduplication softwares and opendedup is one of the possibility. The choice depends on budget, purposes and so on.
        On the other hand, our research team investiges the opendedup software but at this stage we are not able to tell you more if it will be implemented or not.

        VN:F [1.9.22_1171]
        Rating: 0 (from 0 votes)
    • Laurent

      May 13, 05 2012 10:00:44

      Hi,

      dedup is no voodoo magic.
      You may get the vendor prediction of 80% on database table.
      You often get nothing on compressed images.

      There is a case where dedup is great.
      I’m using a storage bay featuring offline dedup to run VMware servers.
      I deploy VM by cloning prebuild VM templates.
      The dedup rate after the nightly dedup is just what you expect : very good.

      A side effect of the dedup here is that it boost the cache read hit.
      A host cold reboot is really showing I/O improvement after the dedup process.

      My 2 cents.

      VA:F [1.9.22_1171]
      Rating: +1 (from 1 vote)

    Leave a Reply


    *