The guarded court approval in Da Silva Moore v. Publicis Groupe opened the door for expansion of the use of predictive coding in document productions. In the decision, Magistrate Judge Peck stated that he was more interested in validation of the process and results than in the “black box” of the vendor’s software, that is, how predictive coding achieves the categorizing of documents. This is a valid observation in context. However, in using any new product, there are benefits to at least peripherally understanding the underlying technology.
Predictive coding is a specific application of the computer science field of machine learning—in particular, supervised learning and natural language processing. This is not a new technology, but rather is used in everyday applications. Most spam filters use algorithms to determine whether emails are legitimate or unwanted solicitations. Reclassification of emails by users permits the filters to learn to categorize better. Another common use is contextual web advertisement placement, in which ad placement is based on content displayed to the user. For example, viewing a sports-related website results in ads from sports-related companies.
In PLSA, documents are categorized by detecting concepts through a statistical analysis of word contexts. Documents are grouped based on probabilities of the number of times words occur together. Aside from these, there are a number of other potential algorithms that generate correlations and categorizations. In essence, these algorithms determine how often certain data—typically words, authors, recipients, names and places—occur together in pre-classified documents and use them to categorize other documents.
The contextual analysis of the algorithms, whether implementing spam filters, ad placement or document coding, correspond to a certain degree to the same factors considered by human coders. Documents authored or received by certain individuals are more important than those authored or received by others. The subject matter of the document is determined by which words are used in certain combinations. The difference is that a human reviewer can comprehend the meaning of the words, while a computer can only mathematically analyze correlations based on the pre-determined data, that is, human pre-categorized documents. Hence, for the computer, the quality of the analysis depends heavily on the quality of the sample set and human categorizations.