It’s 10 P.M. – Do You Know Where Your Data Is?

Do you remember the “Perfectly Normal Beast” from Douglas Adams’ Hitchhiker’s Guide series? If you haven’t read about it (or don’t remember), the Perfectly Normal Beast is a fictional creature — kind of like a buffalo — that migrates twice a year across the fictional “Anhondo Plain.”

It’s (ironically) called “perfectly normal” because it spontaneously appears at one end of the migration path, thunders across the plain in a tremendous stampede and spontaneously vanishes at the other end — nobody knows where it comes from and nobody knows where it goes to. This is, in my opinion, a great metaphor for the flow of data within the typical enterprise.

Here’s what I mean. From an IT perspective, keeping track of data is like keeping track of the Perfectly Normal Beast: It enters into our the scope of our awareness when it hits the systems and applications that we’re responsible for — it flows across those systems like a thundering herd — and ultimately it “vanishes” when it leaves those systems to flow to areas outside IT control (vendors, partners or applications outside of IT).

Extremely Challenging

While the data is inside the boundaries, it’s pretty hard to miss — but trying to keep track of it before it enters or after it leaves those boundaries is extremely challenging. Unlike the Perfectly Normal Beast though, it doesn’t just enter at one point and vanish at another. Instead, it’s even more complex because there are multiple places where it might originate from and multiple points along the path where it might leave our scope of control and awareness.

From the IT side of things, the fact that data behaves this way makes our lives pretty difficult — specifically, in most respects the data is still our responsibility even when it’s outside our scope of awareness. All of us in IT have felt the pain of addressing what seems to be a never-ending parade of regulations — PCI (Payment Card Industry Security Standards), HIPAA (Health Insurance Portability and Accountability Act), SOX (Sarbanes-Oxley Act), breach disclosure, e-Discovery — the list goes on and on.

All these regulations have one core thing in common: They all require that an organization protect the regulated data throughout the entire lifecycle. Breach disclosure laws, for example, don’t just require that we notify in the event that data is lost within our systems — they also require that we notify in the event that our outsourcing partner loses it — or our outsourcing partner’s partner loses it. In short, the regulations we’re required to meet presuppose that we know where our firm’s data is, but actually doing that in practice — actually keeping track of the data our firms process day-to-day is an extremely challenging proposition.

So Where Is the Data Anyway?

Believe it or not, most firms don’t know exactly where their data is — at least not throughout the entirety of the lifecycle. Large organizations, for example, may have systems of such tremendous complexity, and so many interaction points between systems and applications, that maintaining an inventory of where the data is throughout the entire lifecycle is complex in the extreme.

Smaller firms, while potentially having a much smaller number of systems and interaction points between systems, also have fewer staff available to deal with and attend to keeping track of data within the organization. Both large and small firms also have to address the issue of locating places where data exists “under the radar”; for example, QA (quality assurance) systems that use a copy of production data for test, developers who might make a copy of data elements to test transaction flows, and staff who might send or receive data via unapproved means to get the job done.

That’s just inside our firms. How many places does the typical firm share data with vendors and partners? Most likely, it’s quite a few. Those third parties we share data with might, in turn, have data-sharing relationships with others; they might, for example, subcontract work or outsource certain processes — they might share access to network resources where our data is resident.

Add to this mix the fact that technology is constantly changing and keeping track of the data gets even more complex — new applications being deployed, new systems being released, and business process being refined and adjusted all make the situation more complex and make the challenge of maintaining an accurate picture even more difficult. Realistically speaking, by default, most organizations don’t have the time or resources to keep track of all the places where the data comes from and where it goes to.

So What Can We Do?

Many organizations have spent quite a bit of time and money looking for solutions to this problem. They may have invested, for example, in automated approaches and products designed to locate and keep track of data as it travels through systems in their enterprises. They may have updated policy and procedures to ensure that data (particularly regulated or sensitive data) is labeled and classified appropriately throughout the firm or they may have spent time doing process mapping to document the processes in place that may “touch” this data.

However, each of these approaches in isolation leaves some serious gaps. Specifically, automated approaches tend to locate only data within the infrastructure under IT control — it won’t, for example, locate the areas where data might exist in hard copy or track data through processes that are under the control of a vendor or partner.

Procedural approaches such as updating policy for data classification have the disadvantage that they often require humans to understand and follow the policy — individuals might, for example, forget to apply the policy during system development or they may run into situations (such as deployment of COTS solutions) where ensuring appropriate classification and labeling isn’t supported by the product.

Finally, “paper-based” analyses centered around documenting data flow are only as accurate as they are kept current: changes to technology and process are rapid and make this type of documentation difficult to keep up.

A Blended Approach

Given the shortcomings of these methods in isolation, one useful strategy is to use a blended approach. Ensure that policy is updated to ensure data classification and data labeling; in addition, ensure that legal and/or purchasing writes language into new contracts to ensure that vendors to the same.

Attempt to strategically use other large-scope process-related efforts — such as business impact analysis done for BCP/DR (business continuity planning/disaster recovery) purposes — to gather information about where data currently exists and to document the flow of data within the firm. Couple these approaches with a technical solution to “tip off” IT in the event that new technology or new processes have an impact on how and where data is stored within the firm, and ensure that the data from automated data cataloging tools are in sync with the view provided by the documentation efforts.

Most importantly, ensure that there’s someone with ownership of keeping the data “map” current — nothing goes by the wayside faster than something that nobody owns.

Ed Moyle is currently a manager withCTG’s information security solutions practice, providing strategy, consulting and solutions to clients worldwide, as well as a founding partner ofSecurity Curve. His extensive background in computer security includes experience in forensics, application penetration testing, information security audit and secure solutions development.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

Related Stories

E-Commerce Times Channels