Data Lake Security in the Public Cloud

Name: Data Lake Security in the Public Cloud
Uploaded: 2022-12-14
Duration: 40 min 40 s
Description: Shamir Charania explores security design for cloud data lakes, covering identity and network boundaries, RBAC strategies, and data visibility. The talk examines emerging regulations like Canada's Bill C-27 and the EU's Gaia-X framework, and contrasts architectural approaches in Azure and AWS.

BSides Calgary40:4010 viewsPublished 2022-12Watch on YouTube ↗

Speakers

Shamir Charania

Tags

CategoryTechnical

TopicCloud IAM Network Security

StyleTalk

Mentioned in this talk

Tools used

AWS CloudWatch AWS GuardDuty Azure Log Analytics Microsoft Defender for Storage

Platforms

Amazon EMR AWS Glue AWS IAM AWS Lake Formation Azure Data Factory Azure Data Lake Storage Gen 2 Azure Synapse Databricks Microsoft Entra ID Microsoft Purview

Service

Amazon Redshift Amazon S3

About this talk

Shamir Charania explores security design for cloud data lakes, covering identity and network boundaries, RBAC strategies, and data visibility. The talk examines emerging regulations like Canada's Bill C-27 and the EU's Gaia-X framework, and contrasts architectural approaches in Azure and AWS.

Show transcript [en]

foreign

hello everyone my name is Shamir I know it's uh really close to beer clock so I'll try and make this uh short and sweet and uh and we'll get through it although there is a keynote uh ending keynote Steve I think is talking so that'll be super fun um just so I can kind of tailor the presentation as I'm trying to talk and I'm curious who uses the public Cloud right now for their data Lakes one two couple people in the crowd no not you okay are you primarily AWS put up your hand if you use AWS S3 with the note how about azure IBM Cloud nobody okay all right okay so my name is Shamir I'm a CPO at um at

the Turk um we are a cloud security company that's focused on data Lake security basically in the public Cloud mostly focused on Azure at the moment um and this cloud or this talk is about data Lake security kind of talking about some issues that we have today as we deploy data leaks in the public cloud and then a little bit about what's kind of coming down the pipe in terms of Regulation other things that you might want to consider when you are creating your data lakes in the public cloud [Music] so on the agenda data Lake security is a pretty big topic I could probably spend uh maybe all of b-sides talking about it so I'm going to

talk about kind of three main areas around a cloud data Lake Security First is around creating appropriate identity and network boundaries for your organization the second one is talking about different rbac strategies um and different things that you can do around service and user type authentication and the last one is going to be on data visibility and when we talk about the kind of the Beyond section I'll talk about a couple regulations that are coming down the pipe the first one is here in Canada bill c27 which is the new Privacy Act or the update to the Privacy Act and then second one is a project out of the EU called Gaia X and it's kind of

an interesting take on data regulation and and kind of the new data economy that they're trying to build out there and and the regulation that the government is introducing so if you want to partake in the data economy you're going to have to conform with all right so this level set kind of in the room in case people don't use data Lakes the first one is what is a data lake so back in the back in the olden years of uh maybe five years ago we used to use databases for all of our quote unquote data lakes and if you've ever been involved in a project that claimed that they were doing business intelligence or

something like that that's effectively what you were doing you were taking one large database that combines computer your storage your access controls your logging and so on that database and what we realized very quickly was that model doesn't scale or more importantly if you do want to scale you end up paying sap a lot of money so we moved to moving uh from a big old database to doing what we call a data Lake right now and so what a data Lake does is it abstracts a couple of those Concepts around compute around your identity provider and around centralized logging but still enforces some of the basics around security access controls so you basically have what we would call blob

storage and you would put your blobs in there and your blobs would represent sets of your data and you can structure them as tables if you want or you can leave them unstructured as well so PDFs and video files and so on and so forth can can go in there and if you ever wanted to use them or analyze them you would different in your own compute all these different types of products that sit on top of these data lakes that bring your compute on demand for when you want to make a query and the other thing that we that we abstracted out was the identity provider and the centralized logic so most of the public Cloud providers have

a very opinionated way of how they want you to do identity and logging I.E they want you to use their services and no other but the idea is is that identity provider in the centralized logging would be separated from the storage account itself and allow you to to maybe get a little bit more creative with your target architecture uh so where are we going tomorrow tomorrow I think what is what we're looking at is something similar to what I've got on the screen here where the idea is that data Lakes um and I use the word plurally in our organizations represent these kind of spaces of data that our business wants to share and that sharing may occur

business to business with my business Partners may occur business to government or B to G make may occur b2c business to consumer or might just be intra business in my own business itself and what I'm doing then is I'm is instead of relying on the data space itself to provide identity trust a catalog some type of data exchange and some type of some type of compliance framework I'm actually taking those Concepts out and putting them as kind of this abstract layer that I interact with um that that wraps my data spaces together and then my services would then connect to these data spaces through this kind of federation service that provides all of these features and

functionality so it gives me this semblance of centralized control around how data is used in those data spaces without but but not having to centralize my data as well and bring all of the all of the problems that that come with that so last slide on kind of what is a data Lake just if you do use Azure or AWS in terms of your in terms of your data platform uh in Azure uh the service the main service that's storing all of your blobs is called Azure data Lake Services Gen 2 or avls Gen 2 as it's called um and that one is going to provide kind of your storage action access controls typically for logging you're going to

use something like log analytics for governance you're going to use something like Azure purview and then for identity you're of course going to use Azure active directory and then you might have one or more services that that Connect into your Els gen 2. there's just a couple of the ones that you could use as your data Factory for example for for kind of orchestrating and doing your data movement Azure data breaks if you want to do some heavy compute or start start kind of having a exploratory type data work done and then Azure synapse is kind of a competitor Azure data bricks kind of not depending on who you talk to but same sort of deals you're bringing the those

are the compute engines that you can bring to kind of do the queries and stuff like that on your data in the on the AWS side you obviously have uh Amazon on using Lake formation here um you could use S3 without quote unquote Lake formation um which basically just doesn't do AWS glue for you so you don't get kind of a table structure into that governance structure um on on your data you just be deploying blobs directly uh in the in the AWS ecosphere logging is obviously done through AWS cloudwatch you can do governance through AWS glue and then you obviously have identity and that's done through AWS IAM and of course from a Services perspective you've got

different services in AWS they work a little bit differently than the ones in Azure uh Athena redshift and EMR all kind of bring your own compute type clients that you can add onto your data lake so now that we know what uh what a data lake is um Mr uh burgundy here is going to let us know that networking is kind of a big deal uh and it is it surely is so let's let's talk a little bit about the the network boundary that you can create around your public cloud data Lake so hyper theoretical here or getting getting really high level and into the theory um Cloud providers basically have two Network boundaries that they enforce

whenever you access a service it doesn't this doesn't have to be a DLS Gen 2 or Amazon S3 any service will work like this basically you have your cloud provider Network boundary and then you have your cloud provider service boundary and depending on what configuration is available to you from those public Cloud providers that depends on on that small blue box there at the top the service endpoint configuration that that that'll influence what you're able to configure and what you're able to not configure so the the teals in the in the provider responsibility and the blue is in the customer responsibility it's important to remember this that you know um when we talk about systems in the

public Cloud we talk about internet boundaries non-internet boundaries a lot of that is not in in your control those are not things that you should be worrying about because they're not in your area of the shared responsibility model for security of one of these Services what is important is the service endpoint config and how you kind of configure that so of course right now if you're a tried and true security professional or a vendor uh you're going to go blah blah blah what about zero trust and then I'm gonna just move on to the next slide all right Network boundaries so um Network boundary it's important that there's a um basically think about your service is accessible through what they

call endpoints within the public cloud and there's public endpoints and private endpoints and this is really just telling the service where to listen if you if you think about it Azure storage or Amazon S3 or Global Services they're always listening out on the internet there's nothing you can do about that but what you can do is configure where the service should be listening to or should be accepting traffic from and that's what the difference between public and private endpoints in the public endpoint realm you have just basic IP restrictions they'll think like your typical layer 3 firewall and then you have enhanced restrictions your layer 3 firewall with vlans basically and then you have your private endpoints

which which allow you to address your services using private addresses and so in in Azure um you typically configure the Azure storage firewall with the enhanced restrictions it's the Azure service firewall with service endpoints where if traffic is is originating from a virtual Network inside of azure they will put a special tag on that traffic

and then they will invest because coming from that virtual Network so just think about it like vlaning but in the public Cloud uh and on the AWS side you basically do the same thing through bucket policies and then bucket policies with VPC endpoint policy um condition tags on it so the key takeaways from Network boundary are that endpoints are really just configuration about where to listen you can't really think about it in terms of internet and internet internet and public it's a very blurry line in the public cloud and so thinking about it in that kind of traditional sense is going to lead you to make fairly poor decisions about how you want to structure your structure your storage

and how you want it to be accessed and the second part is that you have to consider the shared responsibility model between the providers and the consumers and understand what is in your control and what you should be worried about and then the parts that you shouldn't be worried about so for example ddosing your storage account is not your problem if that's in the shared responsibility model of the provider and the last one is is that private endpoints may increase security because they quote unquote are only privately addressable again in a very uh loose sense of that word but they definitely increase costs it's definitely a premium feature that both Cloud providers charge you for

um if that is what your security team is is mandating so moving on from Network boundary let's talk a little bit about Network detection um so basically at the now that we've talked about how you can protect your storage accounts this this part is about how you can understand what is happening in your storage account and network detection can be done through the logging of course you can you can log all server requests to both the Azure and the AWS Services um but uh but both cloud you can see that some artificial intelligence I guess to look at your storage logs because of course your storage logs are occurring at scale and try and understand whether

some of that traffic is bad or not um in Azure they have the Microsoft Defender for storage and in AWS they have AWS guard Duty um I'd love to rename these cloud services because they don't make much sense I've never seen Defender defend against anything it more snitches so it's Microsoft snitch for storage is probably a better a better term it'll tell you if something's wrong but it definitely definitely won't block anything uh and AWS guard duty is more like the mall cop I mean if it's if it's uh if it's really obvious that something's wrong it'll tell you about it but if it's uh if somebody's being pretty thoughtful about what they're trying to do it's not going to catch you

much um so typically um these Services apply um obviously to The Blob storage that you have and they have different alerts and everything that that you can that you can configure uh the most interesting one I think is on the AWS side where they will warn you if your storage account gets pen tested um I think that's basically signature detection using the user agent so if you have a good pen tester that should almost never go off um Okay so we've talked about Network boundary we've talked a little bit about what you can do at the detection level let's talk about identity uh This Guy's super smart here um there's two reasons he doesn't trust people one he doesn't

know them and two he knows them I'm sure that's uh that resonates quite quite quite often um

pitch their own Services as kind of your your identity provider your authorization and authentication mechanism in Azure that's Azure active directory and in AWS that is I uh AWS IAM and of course there's different ways that you can kind of uh configure um your identity boundary within the different Cloud providers on the Azure side you can use azure rbac which is effectively just providing uh just um just placing the r back controls at the Azure resource level um if that kind of makes sense uh there's part is just a Linux file system and you being able to to configure the posix controls and then there's storage ABAC which effectively allows you to use tags and

we're gonna we're don't worry we're going to get into some of the differences here when we talk about our back strategies uh on the AWS side you basically have two main ways of doing things both of them revolve around writing really fun Json policy documents and applying them in different areas to get unique results [Music] um the key thing to think about your identity boundary is that you don't have to rely on on uh on just like use identity control uh Azure conditional access or Microsoft condition access whatever it's called These Days um but effectively it's looking at more um more of the attributes around the authentication and authorization uh requests that are being made uh user and

location application uh real time time risk which I think I think it's basically whether you have an antivirus

or not uh and their intention or maybe block access so you can do this in both clouds it's just this is a this is an Azure diagram but think about when you're setting up your identity boundaries also setting up things like conditional access so that you can strengthen that strengthen that that control uh so the identity boundary um what it looks like in Azure is azure ID conditional access for user and workloads and you can basically like I said do location risk and MFA type policies um and then the last one there is that you can create what's called managed identities within the Azure platform which is effectively uh think of it uh if you've ever done active directory in

your in your past think of it as a computer account you're basically giving a computer account to a service and then granting that computer account access to a set of resources and that's just managed by the platform for you you don't have to manage keys or do anything like that on the AWS side you can use condition policy conditions on your policy tags there's different conditions that you can add like whether multi-factor was present at the authorization uh time uh what the source IP was the user agent and so on and so forth and AWS also has a concept of managed identities it's called um instant profiles so I think that the core part here

um when you're when you're thinking about your network and your identity boundary is thinking about trying to combine them to achieve your kind of security goals in the public cloud right uh um thank you Keanu all right so uh basically it looks something like this you can use your network boundary to paint with what I'll term the Big Brush and so anything that made the say let's say the non-prod network but within that non-product Network you can then make use of identity controls to protect um the the different uh data spaces for the identity the actual running of the identity um the actual enforcement of that is in the cloud provider shared responsibility model that's not

your risk be trusted less depending on your on your on your threat profile uh that that that identity boundary will do what it is supposed to do and so you can kind of layer these controls together where you use the network kind of in this broad brush to keep everything semi-private and then you use identity within that to get to the granularity level that that you want to get to uh typically our back strategies in most organizations especially around the public data uh public data cloud data Lake in public Cloud I think I said that right um everybody gets admined that's what they do they do that because it's hard to manage our back at scale it's hard to

manage hundreds of thousands of folders and petabytes and petabytes of data so what they do is they just give everybody admin or they turn over admin to the data analytics team and they kind of just hope for the best so our back strategies um this is a really high level slide here basically is how ACLS work so as a as a user I want to access data elements I basically have to my my uh my request gets authorized um through an ACL policy that runs um somewhere usually on the storage account itself uh before I get access to the to the data element now there's a couple of different things that you need to consider

when you're talking about the assignment in a lot of cases when signing it to the entire storage

policies now you could probably think through that and go well that's not going to be good if I combine my accounting data with my HR data with my metrics data with my with my OTP data right you're going to want some sort of segregation or some type of policy assignment that that takes into account

that you have different types of data being stored in um as I'm on this next it is posix controls pause X does not have inheritance if you've ever had to chamod a large directory with lots of files you sit there and wait until it completes because it

touched the AC is set up you need to run how do you need to run it who needs those permissions what permission does the runner need to have and what does that mean for your data security in a lot of cases for example to configure ACLS you need to be owner of the data which means that all of my admin as you can see all of my data regardless of the data sensitivity and classification and the last one is ACL sizes and

Data Lake Security in the Public Cloud

Related talks