Why Do I Want To De-Duplicate My Data When I Migrate My Content?

Regardless of how conscientious an organisation is when it comes to data, there will always be multiple copies of documents stored on a file system or in an e-mail. Think about it: two people get the same e-mail and they both file it away; people will take a document and copy it and keep their own copy and store it on the company file system. Before too long you will have a lot of this content proliferating – and at some point downstream, it may well need to be migrated.

That’s why, as the next chapter in our look at content migration (see previous blogs What Do We Really Mean By The Term ‘Content’?, ‘The Perils Of The ‘Garbage In’ School Of Content Migration,’ Enhancing Your Data Will Really Help Migrate Content Effectively’), we need to look at the very important role de-duplicating content has to play in a successful migration.

First off, be aware that de-duplication effort is prevalent. Google, when it does a search, does de-duplication; other search engines do as well. Google often comes back and says,  ‘Your results have been constrained by removing similar content’. Google does that out of the box and it does it for you. You will have to do it yourself. But there is help to be had. But think, how do we know what content we have got anyway? Somebody may have changed the name of the file, but it is still the same content, after all.

The good news is that there are a number of applications, both commercial and Open Source, that can help you analyse the data you have in a system and create an index of that discovered information which can then be searched by end-users. Google works on addressing it by URL, but there are offerings from EMC, an Ediscovery product called Kazeon, that can access live e-mail systems as well as Documentum and SharePoint repositories and file systems. Incidentally, Kazeon can also “crack the file open” to analyse the textural content and can highlight any sensitive information contained within.

You will soon have a very powerful way to search information, as these tools can let you search your content through a simple browser interface and then manipulate it as you wish. This latter approach also facilitates the classification and data enhancement of content that can then be subsequently loaded into your new ECRM system if so desired.

In SynApps’ case, with our content migration service, we do the data sorting that we discussed earlier and for the data clean up prior to migration we use a tool like Kazeon. Obviously, there are other tools around e.g. the Open Source ones, which do similar things but Kazeon is an excellent tool which we would recommend.

Regardless of the method chosen, the data de-duplication and data clean-up I’ve outlined here is fundamental to any migration process. Massive savings can be realised in existing storage usage through de-duplication, in some cases up to 60-80%! You can see how the de-duplication part is critical to the data enhancement, and you can then feed that in to whatever your migration approach is.

Next time, we will look at the business context – and why content migration integration with a partner like SynApps is helpful.