More On

# Technology: The role of probability & certainty in developing document review strategies

Demonstrating how to reduce the cost of document review by 85 percent without AI, using mathematics and technology

With corporate legal budgets under continued pressure, inside counsel is often faced with answering three difficult questions:

1. What will the pursuit of this matter cost?
2. How long will it take?
3. If I work to control my costs, what will that do to my risk?

In pondering the implications of these questions, one’s focus quickly centers on document review. Document review now represents 70 percent of e-discovery costs — not a surprising figure when you consider that the cost to collect, process and host a document for review runs in the neighborhood of five cents (\$.05) while the cost to review a document is roughly \$1 using a contract attorney (and many times that amount if the document is reviewed by a more senior attorney.)

Aside from using less expensive attorneys to do the work, the only way to control review costs is to review fewer documents. Experience tells us that most of the time, 80 percent of the documents collected for review are not responsive to the issues. The cost of reading these documents is a waste of time and money. How can we avoid reviewing non-relevant documents?

As an industry, we have been looking for a solution to this problem for a long time. Over the past 10 years, our attention has turned to various forms of artificial intelligence (AI) algorithms to address the time and costs associated with document review. AI algorithms are intriguing, but they are hardly a panacea and they are certainly not for everyone or every matter. Altogether, AI is not the only alternative available to dramatically reduce document review costs.

In this article, we will demonstrate how to reduce the cost of document review by 85 percent without AI. No, that isn’t a typo — it is indeed possible to find every relevant document in a collection by reading just 15 percent of the collection — and that’s 15 percent after de-duplication and metadata filtering. You can get these results using mathematics and technology that have been around for more than 50 years (or in the case of the math, hundreds of years).

The process is built on three mathematical principles: certainty, probability and sampling. Taken step by step, they work to dramatically cut the cost of document review.

The process starts by determining certainty. How certain must we be that all relevant documents have been found (or produced)? From a production perspective, the goal is “reasonableness”. If we are 85 percent certain to have produced every relevant (non-privileged) document, is that good enough? Maybe the sensitivity of the matter requires that we be 95 percent or even 98 percent certain. Identifying the certainty level is the first step of the process.

Next probability enters the process. Probability determines how much work we will have to do to uncover the relevant documents.

There are two probabilistic events to consider. First, how long will it take reviewers to come across the first relevant document? And, having seen the first one, how long before the appearance of the next one? Another way of saying this is: How many non-relevant documents will reviewers have to slog through before they see the next relevant document? To save time and money, we want to do everything we can to shorten the distance between the appearance of relevant documents.

The second probabilistic event concerns the likelihood that a given document is unique. When we see a relevant document, what is the probability that other documents in the collection are relevant for exactly the same reason as this one? Or put another way: How many other documents in the collection contain the same relevant language as the document under review?

For the third and final step in the process we introduce sampling. A review is complete when we can prove — to the level of certainty agreed upon — that all relevant documents have been identified. This is a simple but hugely consequential statement.

Given these principles, the workflow runs as follows:

1. Determine an acceptable level of certainty
2. Organize the review so that the “Next Document” selection is driven by a random algorithm (19th century mathematics for finding needles in a haystack)
3. When a reviewer sees a relevant document, have him/her highlight the language that makes the document relevant.
4. Using a simple Boolean search engine, find and tag every document in the collection that contains the highlighted language. (Boolean search technology is more than 50 years old.)
5. Stop the review when the distance between the appearances of relevant documents correlates to having reached the certainty level.
6. Formally test the pile of documents not tagged as relevant to prove that all relevant documents have been identified to the desired level of certainty. (Sampling has its roots in the Bible where it was called “drawing lots.”)

The introduction of probability and certainty allows human reviewers to accurately identify responsive data in large populations while examining only a small percentage of the total data population. This workflow has been executed on many dozens of cases and has never failed to deliver.

### Andy Kraftsow

RenewData’s Chief Scientist Andy Kraftsow leads the company’s efforts to develop groundbreaking technologies. Trained as a mathematician (and a CPA) Kraftsow is one of the...

Bio and more articles