e-Disclosure: The Partial Deduplication Dilemma

In the industry that we work in, walking the line between what the often non-technical solicitors say they want and what activities are actually possible is a daily activity. There are times, however, when we find a member of a law firm who asks for something that nobody seems able to do, yet we are left scratching our heads as to why no one did offer such a service. When this happens, we like to turn away from pre-made tools and see if we can develop a program or script that, in-house, conducts exactly what we need.

Deduplication is a function that is offered in the majority of e-Disclosure packages, and works off creating an MD5 hash of a file – a unique fingerprint of digital data – and compares the two. If the MD5 hash matches, it’s a duplicate and only one is displayed. 99% of the time, doing this works very well and is relied upon across the industry. However, as was the case in this example, there is a small 1% of the time that it is not adequate. The more we worked upon the example, the more we realised that this 1% could actually be used to save time for many of the law professionals involved in the e-Disclosure process.

In this scenario, the solicitor had close to half a million emails to review. This would be quite an achievement alone, and understandably they wished to reduce this number to save both time and costings. Deduplicating this number using traditional hash analysis resulted in close to 400,000 emails which is still a very high number. A large portion of these emails, however, were what appeared to be duplicates but were received at differing times. For example, an email sent to two people at 10:00 was received at 10:01 by one recipient and 12:48 by another when they checked their emails. The challenge was set out to reduce the duplicates that were 90% similar in the body content. To explain this further, the script needed to ignore the sender, recipient, times and dates, and only concentrate on the body of the text. From this, it needed to be scripted to be 90% similar detecting a match, and not the usual 100%. Doing this meant that small time differences in the text that included the responded email or forwarded email would not affect the results.

Python, being an easy-to-use and very versatile language, was used to conduct this process. Opening all emails within the e-Disclosure system and, one-by-one comparing the body of each to all other emails, looking for 90% matches utilising Fuzzy Logic (i.e hello and h3llo are a match). When matches were found, the MD5 hash of the whole file was added to a list, before being fed back into the e-Disclosure platform to allow the email files to be ‘excluded’. This helped to significantly reduce the number of emails that needed to be analysed, dropping the number by approximately 1/3 – from half a million to close to 300,000. There was still a lot of work for the team of solicitors to conduct, but the script will undoubtedly have significantly helped. If it takes a whole day to look through 3,000 emails – that is over two months of viewing emails saved.

In hindsight, this solution was relatively simple and took under a week to create. We are surprised that e-Disclosure platforms did not include a solution such as this by default, but are making steps to include our solution, or one similar, available in more e-Disclosure platforms.