More On

Why De-Duping and Near Dupe-Detection is Important

About a year ago, a big law firm defending a product liability case found itself overwhelmed in a large document review that mushroomed into something much larger than anticipated. The firm assigned more reviewers, including an inexperienced younger associate we'll call Max. Sadly, Max failed to flag as privileged a document that clearly was. Even worse, Max was under-supervised, so nobody up the chain caught it.

The document Max failed to flag privileged, of course, got produced. (The documents in this part of the review appear to have been paper-source, because they are described as having been OCR'd, and some had marginal handwritten notes.)

As recounted in an earlier blog post about this case on TechnoLawyer, "The document was so clearly privileged... that each of the eight other reviewers assigned to the case had recognized and tagged its duplicates as such. [Max], however, decided that the document should be produced. And so it made its way, unnoticed, into the batch of documents (which numbered in the tens of thousands) produced for opposing counsel...."

Let's stop right here, and ask: How did nine copies of essentially the same document make it into the review stream separately?

Unless the variation in OCR quality was right off the Richter scale, there are ways to avoid having nine versions of the same document - even those with handwritten marginal notes - go into review separately.

Any e-discovery consultant or vendor with even moderate sophistication knows about software that performs near-duplicate detection, either as a stand-alone program or built-in to other processing systems.

Near-duplicate detection software will catch different variations of what is essentially the same e-mail or electronic document, with just different revisions. It will catch the same document both in its Word format and in PDF format, clearly an instance where the hash value would be completely dissimilar. Also, it is very commonly used to catch multiple copies of the same paper document that inevitably come out slightly different when OCR'd.

Near-dupe detection software can be calibrated to group documents together based on a percentage degree of similarity. If you had a batch with wide variability in OCR quality, you would set the percentage lower than if you were confident the OCR quality was consistently high.

I don't know if near-duplicate detection was used in this case, or whether it was considered but a good reason existed not to use it. From the way this story is told, it does not sound like it was used.

This document shouldn't have gotten to the inexperienced reviewer in the first place. It could have been bundled together with its other eight near-duplicates, and reviewed by someone with more seniority. The cost of near-dupe detection is much less than the cost of reviewing the same document nine times - and one of those nine reviewers making a wrong call.

Active in litigation support and e-discovery since the late 1980s, Cliff Shnier is an attorney and electronic discovery consultant who divides his time between his base in Scottsdale, Arizona and Toronto, Ontario. E-mail him at

Join the Conversation

Advertisement. Closing in 15 seconds.