Technology: Getting defensive about predictive document sorting technology

The fight over predictive coding resembles Boolean disputes

The efficacy and defensibility of so-called “predictive coding” has been a hot topic in light of Magistrate Judge Andrew Peck stating the he “has approved of the use of computer-assisted review” in Da Silva Moore v. Publicis Groupe, and Magistrate Judge Nan Nolan conducting a still ongoing evidentiary hearing in Kleen Products v. Packaging Corp. of America, in which the plaintiffs seek to force the defendants to start over with this technology after they already used a traditional Boolean methodology.

Judge Peck’s ruling in Da Silva Moore certainly is positive about statistical document sorting technology. These “predictive” workflows leverage new technology that learns about the substantive relationships between documents based on coding decisions made by humans. But the ruling comes from a case in which the parties, at least initially, already agreed to use the technology and were arguing only over the particulars of the protocol to be followed. Moreover, the order does not adopt or approve of any particular protocol, tool or technology. Judge Peck resolved certain disputes about how to proceed with the process initially, but he reserved judgment on whether those initial steps would be sufficient until after those steps are completed. Contrary to what some have suggested, the holding is quite limited. But there is an important takeaway: Resolution of a dispute over how to use statistical document sorting technology and “predictive” workflows looks much like the resolution of a dispute over how to use Boolean technology.

In traditional arguments over Boolean searching, the parties come into court with competing proposals for searches, hit rates, unique hit rates, document counts, and sometimes samplings of documents resulting from the disputed search terms. These quantitative metrics are used in conjunction with qualitative arguments about the probable importance of various proposed search terms to advocate for competing search term lists. The judge exercises his wide discretion in discovery matters, splits the baby and tells the parties to come back if they still have disputes after doing what the court has ordered.

Likewise, in Da Silva Moore the parties came into court with competing proposals to “stabilize the training of the software” and to create the initial seed set used to train the software. Instead of hit rates and such, the arguments focus on “statistical confidence levels,” how many “iterative rounds” of human coding should be done to adequately teach the algorithm, how many of the documents humans should review and code in each of those rounds, and at what point the algorithm should be trusted to have found substantially all of the important documents that the humans will then review and code. In addition to the quantitative metrics, there were qualitative arguments such as whether the defendant should review all or only some of the documents the computer will return in the final round. Judge Peck exercised his wide discretion in discovery matters, split the baby and told the parties to come back if they still have disputes after doing what the court has ordered. So the first hotly contested ruling on the use of statistical document sorting technology and “predictive” workflows looks much like the more familiar disputes over Boolean methods. This should give comfort to new adopters of this technology.

One concern in this case is that the defendant agreed to allow the plaintiff to review the documents that the reviewers code as not responsive in each iterative round. If this becomes the price for judicial permission to use these methods, then it may not be worth it.

Another workflow to leverage statistical document sorting technology is the use of concept clustering in which a party will sort an entire dataset into concept clusters at the outset. Senior associates review each concept cluster, exclude those that do not promise to be relevant and promote to traditional linear review only the clusters that promise to be relevant. This process may be significantly more defensible because humans perform due diligence on each concept cluster; clicking into the cluster, skimming the metadata fields in the list view and sampling documents as needed. This is analogous to the time-honored process lawyers and paralegals use when reviewing boxes in warehouses. If a box could be skimmed and found to be irrelevant, then it would be set aside without reviewing every single page just to definitively rule out the possibility that something relevant might have been misfiled in that box. Moreover, this clustering workflow can cull 90 percent or more of a large dataset, which is better than many examples that have been touted where a “predictive” workflow was used.

Whichever workflow is used, do not be overly defensive about using statistical document sorting technology. It is not only cheaper, but in all likelihood will produce better results.

Page 1 of 2
Comments

InsideScoop Daily eNewsletter

InsideScoop delivers the latest-breaking news affecting in-house counsel. Get the latest business trends, current corporate litigation, labor developments, technology initiatives and more — FREE. Sign up now!

You have been subscribed! You will receive a confirmation email soon.

See the entire list of InsideCounsel eNewsletters.

Resource Library


13 Things to do Now to Reduce Risk and Avoid...

We have developed best practices for lowering your e-Discovery costs, shortening the length of your...

7 Simple Strategies for Improving Legal Fee Budgeting Certainty

Understanding the legal fee budgeting paradigm and following seven simple strategies will help you control...

Complimentary White Paper: Best Practices for Meeting Critical eDiscovery Challenges

Packed with practical advice, this white paper discusses best practices for meeting eDiscovery challenges across...

Complimentary White Paper "Key Considerations for Collection Methodologies and Resources"

This white paper addresses the need for companies to reevaluate their current collection policies in...

Moving Matters In-House: How Technology Enables Legal In-Sourcing

Strategically shifting more matters to in-house counsel has proven to be an effective strategy to...

5 Ways to Promote Responsible Content Sharing

Find out five ways that organizations can promote responsible sharing of content among employees by...

Reducing the Costs of eDiscovery from Collection to Court!

Predictive coding is only one of many ways organizations can make eDiscovery faster, cheaper and...

Discovery Shifts to the Cloud

Adoption of Cloud computing continues to gain momentum. How can IT and Legal Teams avoid...

Lower Your Total Cost of Ownership

With the deployment of Proofpoint Enterprise Archive, organizations have realized significant cost savings in automating...

Health and Safety Risks of Counterfeits in the Global Supply...

This whitepaper underscores the prevalence of counterfeits within global supply chains across a number of...

View All »

Advertisement. Closing in 15 seconds.