Analyzing data: Why a “bigger is better” mentality may be at odds with intelligent information governance

The Five Vs keep in-house counsel from turning “big data” into “bad data”

In Texas, there’s a deeply held belief that if it’s bigger, it’s better. Just look at the 159’ by 71’ big-screen TV in the Cowboys’ new football stadium as a prime example of the prevalent “go big, or go home” mentality. But it’s not just Texas that’s enamored with this “bigger is better” type of thinking. Many IT professionals focusing on the new “big data” craze follow the mantra that if a lot of data is good, even more must be better.

Alignment around the exact definition of “big data” is hard to come by, especially since much of the discussion is being driven by enabling vendors. That said, big data was concisely defined in a recent New York Times article as a “shorthand label that typically means applying the tools of artificial intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases.” Most big data definitions often go on to reference the three Vs: volume, velocity and variety. Yet, often overlooked are the two additional Vs: value and veracity, which are critical in an information governance and legal context. To harmonize the five Vs of big data, it’s important to examine each definition in sequence.

The Five Vs of “Big Data”

  1. Volume: Volume, not surprisingly, is the hallmark of the big data concept. Since data creation doubles every 18 months, we’ve rapidly moved from a gigabyte world to a universe where terabytes and exabytes rule the day.  In fact, according to a 2011 report from the McKinsey Global Institute, numerous U.S. companies now have more data stored than the U.S. Library of Congress, which has more than 285 terabytes of data (as of early this year). And to complicate matters, this trend is escalating exponentially with no reasonable expectation of abating. 
  2. Velocity: According to the analysts firm Gartner, velocity can be thought of in terms of “streams of data, structured record creation, and availability for access and delivery.” In practical terms, this means organizations are having to constantly address a torrential flow of data into/out of their information management systems. Take Twitter, for example, where it’s possible to see more than 400 million tweets per day. As with the first V, data velocity isn’t slowing down anytime either.
  3. Variety: Perhaps more vexing than both the volume and velocity issues, the Variety element of big data increases complexity exponentially as organizations must account for data sources/types that are moving in different vectors. Just to name a few variants, most organizations routinely must wrestle with structured data (databases), unstructured data (loose files/documents), email, video, static images, audio files, transactional data, social media, cloud content and more.
  4. Value:  A more novel big data concept, value hasn’t typically been part of the typical definition. Here, the critical inquiry is whether the retained information is valuable either individually or in combination with other data elements, which are capable of rendering patterns and insights. Given the rampant existence of spam, nonbusiness data (like fantasy football emails) and duplicative content, it’s easy to see that just because data may have the other 3 Vs, it isn’t inherently valuable from a big data perspective.
  5. Veracity: Particularly in an information governance era, it’s vital that the big data elements have the requisite level of veracity (or integrity). In other words, specific controls must be put in place to ensure that the integrity of the data is not impugned. Otherwise, any subsequent usage (particularly for a legal or regulatory proceeding, like e-discovery) may be unnecessarily compromised.

When the five Vs are then looked at in concert and cutting-edge analytical software is applied, the promise of “big data” starts to be revealed. In healthcare, for example, researchers are employing big data analytics to analyze factors in multiple sclerosis to search for personalized treatments. Similarly, healthcare professionals are also mining large genomic databases to find the best ways to treat cancer. Many of these insights are coming from novel data sources (new varieties) like web-browsing data trails, social network communications, sensor data and surveillance content to divine unheard of insights.

And yet, given the relatively narrow range of existing big data use cases (retail trending, advertising insights, healthcare data-mining, etc.) most organizations should still carefully assess the value of information before blindly provisioning another terabyte of storage simply under the auspices that big data insights might be possible. While there are clearly nuggets to be mined in this new, big data era, these analytical insights don’t come without potential costs and risks.

Many organizations sadly aren’t cognizant of the lurking tensions associated with the rapid acceleration of big data initiatives and other competing corporate concerns around important constructs like information governance. Latent information risk is a byproduct of keeping too much data and the resulting exposure due to e-discovery costs/sanctions, potential security breaches and regulatory investigations. As evidence of this potential information liability, it costs only $.20 a day to manage 1GB of storage. Yet, according to a recent Rand survey, it costs $18,000 to review that same gigabyte of storage for e-discovery purposes.

To combat these risks and costs, many entities have deployed information archives as a way to attack the data deluge, periodically deleting data when legally permissible. It is this necessary and laudable goal of defensible deletion/expiration that can be at odds with concepts like big data. The challenge for many organizations is the rather straightforward exercise of evaluating the potential risk of keeping too much information against the conceptual value of mining information for a given big data project. Even in the absence of big data analytics, this type of risk/reward inquiry is at the core of the information governance dilemma that every organization faces.  At least with the potential value big data can generate, organizations have a better chance to reap some value out of the terabytes of data that many have been mindlessly keeping in perpetuity. 

In the end, it is critical to have a laser focus on the fourth V (value) to ensure that data, which won’t be mined/analyzed, isn’t kept any longer than can be rationalized for other business needs or due to applicable regulations. Retaining meaningless data that has no big data potential threatens to turn big data into “bad data” that merely increases information risk.

Page 1 of 2
Comments

InsideScoop Daily eNewsletter

InsideScoop delivers the latest-breaking news affecting in-house counsel. Get the latest business trends, current corporate litigation, labor developments, technology initiatives and more — FREE. Sign up now!

You have been subscribed! You will receive a confirmation email soon.

See the entire list of InsideCounsel eNewsletters.

Resource Library


Bring the Benefits of Decision Tree Analysis to Your Everyday...

In this on-demand webinar, learn how to counter the challenges of litigation with predictive analytics...

13 Things to do Now to Reduce Risk and Avoid...

We have developed best practices for lowering your e-Discovery costs, shortening the length of your...

7 Simple Strategies for Improving Legal Fee Budgeting Certainty

Understanding the legal fee budgeting paradigm and following seven simple strategies will help you control...

Complimentary White Paper: Best Practices for Meeting Critical eDiscovery Challenges

Packed with practical advice, this white paper discusses best practices for meeting eDiscovery challenges across...

Complimentary White Paper "Key Considerations for Collection Methodologies and Resources"

This white paper addresses the need for companies to reevaluate their current collection policies in...

Moving Matters In-House: How Technology Enables Legal In-Sourcing

Strategically shifting more matters to in-house counsel has proven to be an effective strategy to...

5 Ways to Promote Responsible Content Sharing

Find out five ways that organizations can promote responsible sharing of content among employees by...

Reducing the Costs of eDiscovery from Collection to Court!

Predictive coding is only one of many ways organizations can make eDiscovery faster, cheaper and...

Discovery Shifts to the Cloud

Adoption of Cloud computing continues to gain momentum. How can IT and Legal Teams avoid...

Lower Your Total Cost of Ownership

With the deployment of Proofpoint Enterprise Archive, organizations have realized significant cost savings in automating...

View All »

Advertisement. Closing in 15 seconds.