Keyword Searching Has Limitations in E-Discovery

When systems provider Autonomy searched Enron's electronically stored information (ESI) in a demonstration of a new method it calls meaning-based computing, the search for the term "book loss" turned up code names the now-defunct energy company had used to hide its financial misdeeds.

"We found code words Raptor, Roadrunner, Porcupine, Pronghorn and Grizzly used to name some of the off-balance sheet vehicles, partnerships and transactions that brought Enron down," says Jack Halprin, Autonomy's vice president of e-discovery.

The Autonomy system associated the code words with the search term even though none of them were specified in advance.

Halprin contends the demonstration shows the superiority of meaning-based computing over the traditional method, keyword searching, which returns only exact matches and misses documents containing nonspecified synonyms.

Software companies are also touting new ESI searching systems such as predictive coding, leverage review analytics and pattern recognition technology that they say will replace keyword coding. Some companies have trademarked these names and filed patent applications, anticipating demand for more efficient technologies to address exponential ESI growth.


Keyword Limitations

Although keyword searching remains the most common method of information retrieval for e-discovery, regulatory requests and other projects, it has limitations. It cannot determine the context to differentiate a word with different meanings. For example, if you were dealing with a spoliation claim and looking for the term "shredding," as in shredding paper, you might turn up irrelevant documents referring to shredding food or shredding the slopes on a snowboard.

The new systems search for patterns of statistically significant co-occurrence of terms, says Halprin. For example, they can return "hits" for the word "dog" if a document includes "man's best friend."

"A keyword search would generally find 20 to 50 percent of relevant documents. An approach that is right 25 to 50 percent of the time simply does not work," says Craig Carpenter, vice president of marketing and general counsel for Recommind, which is marketing predictive coding systems. "True predictive coding is supersonic air travel whereas culling data [removing items like duplicate files, system files or those outside a date range] is riding a tricycle."

Others say the new technologies haven't yet been sufficiently tested. "Predictive coding is still in its infancy, and I would continue to be wary of vendor claims until the baby has walked for a while. But as we struggle to manage vast amounts of electronic data, it looks like predictive coding may prove to be an invaluable tool," Sharon Nelson, president of computer forensics firm Sensei Enterprises, stated on her electronic-evidence blog, "Ride the Lightning."

System Training

The new systems are "trained" to locate relevant documents. A reviewer might read a sampling of 100 documents from an ESI collection and determine that 80 are relevant. The reviewer returns the 100 documents to the system to collect other documents referring to the same or similar issues included in the 80. The system can begin to determine a pattern.

Other documents continue to be reviewed manually until the review team is satisfied with the accuracy the system has achieved in returning relevant documents. The system then searches the remainder of the ESI collection without manual review, resulting in huge reductions in billable time.

Scott Kane, partner and co-chair of the e-discovery and data management team at Squire Sanders & Dempsey, says he achieved fully trained system status after 49 machine passes and human review of 1,960 documents requiring just 13 hours of attorney time. "[The system determined] that nearly 75 percent of the document collection did not require any review, much less second pass review for legal privilege, because they were simply not relevant to the dispute," says Kane.

Search solutions can also be combined. Halprin's company integrates keyword and conceptual searches with other methods. A report from Integreon's E-Discovery Research Roundtable states that an electronic search, which can reduce a data set by as much as 80 percent, "does not depend on any one technology and can range from sophisticated Boolean searches to advanced semantics techniques."

Market Acceptance

The eDiscovery Institute Survey on Predictive Coding, released in October 2010, asks, "Given the claimed advantages for predictive coding, why isn't everyone using it?" The most mentioned reason was uncertainty about whether judges would accept predictive coding as providing reasonable and defensible efforts to identify responsive documents (see "Courtroom Commentary").

Other reasons cited for the slow adoption were lack of awareness and law firms' insensitivity to costs of inefficiencies. The respondents also indicated that the value of predictive coding was higher in large-volume cases with short deadlines.

Some experts predict that the first true market acceptance of the technology may include only first pass review.

Foster Gibbons, director of document review services at Integreon, a legal outsourcing firm, says predictive coding won't eliminate human review but will make it more efficient. "I don't see that lawyers will rely entirely on predictive coding ... in a case with hundreds of potential issues," he says.

Michael Kozubek

Bio and more articles

Join the Conversation

Advertisement. Closing in 15 seconds.