My definition of this topic is “a necessary component of any big data solution with a perfect blend of skepticism and confidence by all of those involved”. Think of this as not just securing a file with permissions and capturing its name and glossary, but about sustainable architecture designs. Our recent blog “Make Data your BAE – Best Asset Ever!” by Roger Wahman helps in justifying the enablement aspect of data governance and is a great starting point for this discussion.

With data being the primary asset driving organizations, setting up controls around it at an early stage is becoming a top priority for all of us practitioners. I believe we all realize the benefits of why we would want governance and security. With that said, this article will help with how you could start your journey down this path.

Big Data Governance: Fortunately, we now live in a world where once “clash of Business and IT” is now a collaborative force where the maturing practice itself is driving innovation in this field. We are now implementing methods to discover and profile data, tag relevant data points, and add meaningful glossary helping us build and manage our data asset inventory. I see it drive not just analytical tasks but also the overall application design.

One of the notable open source projects in this space is Apache Atlas. Initially targeted at providing an end-to-end data lineage, it now is a full suite solution with business catalog, cross component lineage, and supporting existing access policies. Integrating with ranger and falcon to act on defined tag based policies is one of the most prominent features. Waterline Data is another great solution with support for numerous big data suites. It provides an ability to perform automated data profiling and data tagging. It allows users to perform faceted search, maintain a metadata repository and allow export of this information to other platforms like Apache Atlas. The value proposition to bringing in such technologies is when coupled with compliance models of FISMA, HIPAA, SOX, or PCI DSS.

Big Data Security: As one of my favorite topics of discussion, I see this as a holistic approach to define security across the overall architecture. The best approach often is to assume that an attack is inevitable. While DDoS or ransom attacks are on the rise, the ability to recover with minimal damage must be the basis for any solution architecture. Inter-cluster mirroring with small RPOs and RTO is the way to go. A reliable no-loss data streaming application hosted on cloud is another great option most of the organizations are now adopting.

There is a healthy list of solutions now for achieving your security goals. One such is Apache Knox. It integrates with Kerberos and LDAP, supports SSO to other services, and provides a decent level of audit data. You may also want to look at solutions that support Linux based block-device and field-level encryption. Another proactive approach is to set up a security analytics framework to allow analysis of audit trails, external IP’s, and building key KPI’s for the security team.

Governance requires not just adding a technology to your architecture but also to setup processes and dedicated resources who would help define and evangelize its long-term benefits. This would also drive the policies around a more secured application. Security is a multifaceted solution and my recommendation is to take a bottom-up approach starting from the underlying infrastructure all the way up to how external users access the application