Technology: A simple guide to statistical document sorting technology

Forget the math--all you need is a basic understanding of the process

Many lawyers’ eyes glaze over when we hear of Bayesian networks, concept clustering, predictive coding, suggestive coding, machine-assisted review, meaning-based coding, latent semantic analysis, probabilistic latent semantic analysis, Shannon’s theory, the Markov blanket, de Finnetti’s theorem, latent Dirichlet allocation, Gibbs sampling and so on. Like Chevy Chase, most of us believed “there would be no math.”

Can lawyers still hold to that view now that document collections are measured in gigabytes and terabytes, and sophisticated mathematical document sorting technology is going mainstream? Yes, we can. We only need a basic understanding of what the technology does so that we can know how to effectively use it in a defensible workflow.

Developers initially touted this technology for early case assessment purposes because the users hadn’t devised workflows to use it for primary document culling. But users now have developed defensible workflows that have been in use for most of the past decade. These workflows can be divided into two categories:

1. The first approach is to sort the entire dataset into clusters before humans look at the documents, review the clusters (without reviewing each document) to separate those that do not promise to contain relevant documents from those that do, and review the documents only in the latter.

There is a common misconception that the algorithms in some of these systems are trained on a very small subset of the whole, and once the lawyers are satisfied that the algorithm is smart enough, the algorithm codes the rest of the dataset predictively. That is not how it works. Each document that is coded to be produced comes from the iterative review sets that humans reviewed. Many of these technologies do suggest to reviewers what the algorithm thinks the coding should be, but these are just suggestions that the reviewers accept or reject as they complete their assigned review sets.

This technology is very powerful and defensible when combined with the right process. Users have developed good workflows to leverage this technology to very efficiently cull relevant from irrelevant documents. As new and mysterious as it may sound, it has been in real-world use in some of the biggest litigation in the country for at least a decade. And you do not need a Ph.D. to use it in your next case.

Contributing Author

author image

Thomas Lidbury

Thomas A. Lidbury is a partner in Drinker Biddle & Reath's Commercial Litigation practice and leads the electronic discovery and records management group. He advises clients in...

Bio and more articles

Contributing Author

author image

Michael Boland

Michael J. Boland is managing director of Drinker Discovery Solutions LLC, a subsidiary of Drinker Biddle & Reath, which provides electronic discovery services including processing and advanced...

Bio and more articles

Join the Conversation

Advertisement. Closing in 15 seconds.