The Odd Couple: Hadoop and Data Security

Jul 24

Summary: Cloudera has launched its Sentry security module for Hive and Impala. But is a more comprehensive security solution for Hadoop possible?

By Andrew Brust

Until now, security wasn’t a term that would normally be associated with Hadoop. Hadoop was originally conceived for a trusted environment, where the user base was confined to a small elite set of advanced programmers or statisticians and the data was largely weblogs. But as Hadoop crosses over to the enterprise, the concern is that many enterprises will want to store pretty sensitive data.

So what stood for Hadoop security up till now was using Kerberos to authenticate users for firing up remote clusters if the local ones were already used; accessing specific MapReduce tasks; applying coarse-grained access control to HDFS files or directories; or providing authorization to run Hive metadata tasks.

Security is a headache for even the most established enterprise systems. But getting beyond concern about Trojan Horses and other incursions, there is a laundry list of capabilities that are expected with any enterprise data store. There are the “three As” of Authorization, Access, and Authentication; there is the concern over the sanctity of the data; and there may be requirements for restricting access based on workload type and capacity utilization concerns. Clearly there are many missing pieces for Hadoop waiting to fall into place.

For starters, Hadoop is not a monolithic system, but a collection of modules that correspond to open source projects or vendor proprietary extensions. Until now the only single mechanism has been Kerberos tokens for authenticating users to different Hadoop components, but only standalone mechanisms for admitting named users (e.g. to Hive metadata stores). You could restrict user access at the file system or directory level, but for the most part such measures are only useful for preventing accidental erasures. And when it comes to monitoring the system (e.g., JobTracker, TaskTracker), you can do so from insecure Web clients.

There are more missing pieces concerning data, as nothing was built into the Apache project. There was no standard way for encrypting data, and neither was there any way for regulating who can have what kinds of privileges with which sets of data. Obviously, that matters when you transition from low level Web log data to handling names, account numbers, account balances or other personal data. And by the way, even low level machine data such as Web logs or logs from devices that detect location grow sensitive when associated with people’s identities.

There’s a mix of activity on the open source and vendor proprietary sides for addressing the void. There are some projects at incubation stage within Apache, or awaiting Apache approval, for providing LDAP/Active Directory linked gateways (Knox), data lifecycle policies (Falcon), and APIs for processor-based encryption (Rhino). There’s also an NSA-related project for adding fine-grained data security (Accumulo) based on Google BigTable constructs. And Hive Server 2 will add the LDAP/AD integration that’s current missing.

For the near term, not surprisingly, vendors are working to fill the void. Besides employing file system or directory-level permissions, you could physically isolate specific clusters, then superimpose perimeter security. Or you could use virtualization to similar effect, although that was not what it was engineered for. Zettaset Secure Data Warehouse includes role-based access control to augment Kerberos authentication, and applies a form of virtualization with its proprietary file system swap-out for HDFS; VMware’s Project Serengeti is intended to open source the API to virtualization.

Activity monitoring, what happens to data over its lifecycle – is supposed to be the domain of Apache Falcon, which is still in incubation. But vendors are beating the Apache project to the punch: IBM has extended its InfoSphere Guardium data lineage tool to monitor who is interacting with which data in HDFS, HBase, Hive, and MapReduce. Cloudera has recently introduced the Navigator module to Cloudera Manager to monitor data interaction, while Revelytix has introduced Loom, a tool for tracking the lineage of data feeding into HDFS. These offerings are the first steps — but not the last word — for establishing audit trails. Clearly, such capability will be essential as Hadoop gets utilized for data and applications subject to any regulatory compliance and/or enforcing any types of privacy protections over data.

Similarly, there are also moves active for masking or encrypting Hadoop data. Although there is nothing built into the Hadoop platform that prevents data masking or encryption, the sheer volume of data involved has tended to keep the task selective at best. IBM has extended InfoSphere Optim to provide selective data masking for Hadoop; Dataguise and Protegrity offer similar capabilities for encryption at the HDFS file level. Intel however holds the trump cards here with the recent Project Rhino initiative for open sourcing APIs for implementing hardware-based encryption, stealing a page from the SSL appliance industry (it forms the basis of Intel’s new, hardware-optimized Hadoop distro).

Read Complete Article