
thank you s um it's an honor to be here know from the other side and not organizing so thanks a lot for hanging out at the end of Sunday afternoon to see my presentation uh I'm Anna opra I'm the European lead of the vulnerability coordination Center and today I will talk about security by Design um in relation to the sins that besides mun has talked about and basically what we can do to uh counteract on them um security and in relationship to security reliability um are Central elements of product development and of the product however how secure and how reliable do you think a product needs to be I would like to argue that 100% is never the right reliability or security Target it's um more than what your customers want or or need and um it will be very expensive to achieve 100% security or reliability apart from it being impossible so um what I am going to talk about today is how to adapt what your product needs what the needs of your customers are to the business risk um it will be a very high level talk I'm just going to scratch the surface and I will have some references at the end to how um you can look deeper into basically building security and reliability into your products from the beginning let me tell you a story um more than 11 years ago at Google in the Bay Area where our headquarters are um there was a change in the Wi-Fi password of the um uh on the buses that are um connecting San Francisco to the um Mountain View headquarters and uh the team um who is the uh the company that's running the buses had emailed all the googlers working in the Bay Area a few thousand people telling them the Wi-Fi password has changed please um update all the devices that are connected to the Wi-Fi um Google has uh had at the time and also still has right now a system for sharing passwords um passwords that are for third party systems so as you would expect many thousands of people had tried to access uh that password manager and try to get the new Wi-Fi password um however that password manager was created uh quite a few years back um with a target audience of a few dozen people who um were system administrators at the time so when many thousands or may even tens of thousands at the time employees tried to reach the password manager it became unresponsive um we are Google so we did think about it done see there was a second replica so when the first replica became unresponsive all the traffic was redirected to the second replica bad idea right um however the there was somebody on call um the UN call engineer was on the east coast of the US and was very confused when they got paged um about the issue because in its previous 5 years of existence the password manager had never had a problem so what happening now um the password the engineer kept cool though there was um instruction and it was a set of procedures that they could follow to restart uh the server so they looked at that and realized they would need to um a special cardart um to uh that that they would need to take from a safe uh in order to insert it into a Hardware security module an HSM in order to be able to restart the system um there was no such safe uh on the east coast of the US there was one in Australia and there was one on the west coast where the Wi-Fi password was being changed um they the engineer on the coast managed to reach the colleagues in Australia and they wanted to access the safe but where do you think the key combination of the safe was in the password manager that was Now offline um however the West the colleagues on the west coast uh soon came online and one of them actually knew the password of the safe the the combination of the safe so they did manage to retrieve the card they inserted it and there were some blinking green lights but nothing happened um they thought okay some maybe this card is wrong so um maybe it's not it's broken who knows um the people in in Australia in the meantime managed to get drills and actually access the safe and and retrieve the other um card and uh they've introduced it in the um HSM they received the same if notice the same green lights and they got a message that seemed pretty cryptic that some something about um password and it didn't seem like an error however it was an error um that didn't communicate too much and uh it actually meant that the card was introduced at the wrong way around so after about an hour they did try to switch uh to um uh put the card the other way around and then the blinking lights continue to remain green but the server started and the outage ended so what I'm trying to say with this story is that there's a subtle interplay between security and reliability um they both need to be center pieces of of system design however they we need to be careful on how they interact um in this story the uh system password manager was um offline because of a reliability issue it had bad load balancing um implemented and its recovery was basically Amplified by additional security measures so what can we do to learn from these mistakes and uh build security and reliability into products um let's start with the security part we need to manage risk I was talking about adapting the profile of what your customers need and what your product wants to deliver to the business risk we need to manage security and we need to manage reliability risk on the security risk side we should think about the people um who are should be at the center of product design um who are um the actors that are interacting with your product who which targets might your uh business uh have which assets might your business have um and what uh might potential attackers want to uh get from that so thread modeling is very important when managing security risk um some assessments that um you should take into account with regards to uh risk um that I like to call out are that you might not realize you're a Target so for example Adobe in around 2012 had an attack where um some malicious actors used their certificates basically to sign malware Adobe is not a security company um they are known for software that is enabling creators so they were not expecting such an attack so think a little bit about what assets do you have and what um might what uh um might represent the target for the attackers you should start with the basics um you um might know that fishing is actually one of the um most successful attacks that um organizations use in order to ret retrieve information um this is where U being Savvy is um important and a way to counter lust one of the um since that that was called out at the conference don't underestimate your adversary uh you might have heard a story of where the NSA has gone great length to intercept routers Cisco routers on the way to the customers and um add back doors there so you might be dealing actually with very powerful and resourceful um actors that are trying to Target you attribution is hard um this is a story related to Ukraine uh which is when we were um thinking about this what it was before the Ukrainian War but I guess it it particularly now makes sense where um not pya was thought uh the MW not Peta was thought to be related to uh Peta however the um attack that happened targeting uh Ukrainian um uh Ukrainian citizens um in the a of a a holiday there uh and targeting um the uh particular finance software that was used only locally um had nothing to do to the original Peta ransomware that had a global effect so it's um I think it it's not so important to know who the attackers are particularly related to the the fact that even if you might know who they are um they might not always be afraid of being caught because the criminal system particularly internationally might make it hard to actually um catch them so once you know what your security risk once you have done threat modeling you identified what assets you have and what targets um um particular actors might uh want to reach what can you do to to protect yourself um this is where I'd like to call out a few topics related to list privilege and um list privilege is a design strategy uh that can help you in many situations so let's say that you might actually be upset at your company and want to steal something or want to add a back door would you do that would you get caught um who would know this or what if you are working overtime and have been dealing for many hours or even days uh with trying to uh fix an outage how many mistakes away how many copy paste away are you from actually deleting all the data or um wiping something out um this is where designing for a list privilege um can help um in avoiding um many many such scenarios related to um either uh targeted attacks or um accidental mistakes so when designing for a list privilege one strategy is um zero trust so making sure that you are not um making trust decisions and allowing for who is going to um do what uh is going to do something based on um who they are but also what device do they have what um do you know about their role um and making sure that the uh infrastructure is run with um particular roles multi authorization um making sure that not only one uh one no one person can um actually perform um critical administrative operations um this will increase the price that an attacker has to pay when um trying to get access an an asset from from your business um if it's already a second engineer that they would need to compromise then it's already double the cost so maybe they think um twice before doing that auditing and detection um looking at logs will uh give you insight into who has done what for what reason and uh will allow you to um ideally detect malicious um um actions and um we are talking about how the recovery of the password manager has um been made has been slowed down by security um artifacts this is where um not only having um plans for how to recover from failure but also testing them regularly um is very important so continuing to reliability um for security we're talking about thread modeling and looking at actors looking at assets um and what might represent a Target when talking about reliability we're looking at error budgets um we said 100% is probably not achievable nether from a security nor from a reliability perspective so on the reliability side where you're looking at error budgets where all stakeholders in the organization should have approved a service level objective or SLO reliability uh Target as being fit for the product this is where above a particular threshold users are happy they will continue to use their your product they will not call support um it's a basically a a level of Happiness under that threshold you should notice that there will be more bug reports um people might um start complaining and stop referring to your product um it's a fuzzy topic um to to Define so I'll just go over fast over an example uh big table this is a large storage that Google is using and um there are actually two large use cases that we have one is customers that are serving data um or applications that are serving data directly from big table to customers so these kind of requests need to be very fast the they need to be very responsive the customers should not have to wait receive an answer um the other um scenario um the other types of applications that are using big table are using them for large scale um for example running map rues um on large amounts of data and in this case it's okay okay to to wait like the request don't need to to be answered very fast so when we think about slos and error budgets um we actually have two very distinct scenarios where success um is um opposite is antithetic between between the two in the case of the user serving or customer serving big table for example the cues need to be um always empty because we want to have the responses provided immediately and in the throughput use case where we run map produces over data we want the cues always full because we want to have all the time more data that is going to be processed so having an SLO for both use cases would be very expensive so the way Google has uh decided to answer this qu this situation is to have two types of data centers some data centers where um there's um High SL SLO of 49 99.99 for example I don't know the exact numbers but uh something along these lines I to be able to answer request very fast and the other one where um we have throughput data centers for big table where answers where requests um are being answered on a much lower um um Baseline and then we can um it's much cheaper to run a data center for uh uh through then it is for the um um direct serving applications so continuing with reliability risk we talked about design strategies for security with regards to design strategies for reliability um in relationship to the um designing for Leist privilege you would be surprised but they or maybe not but they are actually the same um we we talk about zero touch which is the which is going one step further uh from zero trust um zero touch refers to running infrastructure uh that humans don't have direct access to because we make mistakes or maybe somebody has been compromised and actually has malicious intent so we don't want humans to be able to modify um the state of production directly but we want them to be able to uh do that through tooling tooling allows for auditing and detection um this also allows for uh for example verifying how many resources do we need um in terms of CPU or memory or running an application um multi party authorization is um the same concept as with reliability risk in as with security risk in terms of reliability uh it has the advantage that um um careful um a second person might be more careful in detecting um potential potentially malicious uh potentially uh small mistakes that you the engineer changing production might not have thought about and um in terms of recovery we need to design our systems so that um we always expect that they will be failing and they need to have a way to go back to a state where um they can run even though um maybe there's less memory currently uh or um a particular resource that's that's unavailable um so automation is key in each of these design strategies um only there's only one part of what I previously mentioned that I'd like to shortly go deeper into and that is auditing and detection what does it mean um reviewing all access logs and justifications to make sure they are appropriate basically we want to know who has done what for what reason um when and what was the outcome um in practice um at Google how we do this is that we try to use business um that we try to use structure justifications so for example if somebody from support needs to access a particular customer case um from from called support then they will always have a particular um support ID that they will use a structure justification and then we can connect basically through the logs what actions have been done with in relationship to this U particular ticket um ID um what another key part of auditing and DET action is who is the auditor um at Google we have two types of audits that we're particularly looking into one is um when um looking for Best Practices so for example an s or science reliability engineering team might do weekly checks of when did we do break glass break glass meaning um we requested um additional access in order to perform a particular task so think about it a little bit like doing pseudo um on production um so something like this um is reviewed by the team so that we know how can we improve and um change our tooling so that we don't have to do breake glass anymore but that we can run in the in in the regular context that we uh typically have and um there's a second uh team that does auditing and that is the Central Security team here the is where we implement the larger detection capabilities and we look at um what events have happened how um are all the actions that were performed by this um teams that don't have anything to do with each other um are they connected is there potentially an attack that's um happening and so on um so a pitfall um that we've noticed here is particularly the auditor missing context or being objective um so that's why we're trying through these two um areas where where we're um uh looking at audits to um make sure we don't have any gaps between the two so to conclude um security shifting security and reliability left um there's a lot of trade-offs between security and reliability um only if we commit to the full life cycle will we manage to um make better product the more uh trust we put into the systems and into the products that we're using the more secure they need to be um and the more we use them and we rely on them the more reliable they need to be um that's very hard to do and Implement after the fact um like once systems are running production um and we have to uh follow up with many patches and and so on it's much more expensive to to build security and reliability after the fact um with that um I'd like to conclude with a few resources that I mentioned you can learn much more from and um go deeper into these topics um I'd like to call out diligence and this is where um the B team was talking about um bringing people into the community and um allowing for growth and each business and each individual finding their path into um cyber security or potentially security career um I think you've seen also many uh sponsors here looking into um Hing people if you're new um to the um cyber security or security uh information security um Community or potentially career you might know it's not the easiest uh to start from um but there are many people here who hopefully you can you can reach out to hopefully businesses learn that um they should provide mentorship they should um be open to people um joining maybe changing careers um and so on so I'd like to encourage all of you either individuals who are looking for a change or companies looking to uh get more awesome colleagues to um think about how you can what you can offer and give back to this community and I also like to close by asking for a huge Applause for the organizers [Applause] having been on the other side I know how hard it is to make this Happ and I think you guys did a great job so thank you question thank you so much Anna uh any questions for [Music] Anna anyone yes hello Anna thank you for the presentation this was awesome uh one question I have is uh so what I've usually seen in companies in my limited experien is that security is often some afterthought that happens after they think about reliability so what often happens is security comes as a need later in the product Lifetime and so you're kind of forced to bake that in at some later Point not not in the beginning how do you usually handle that this is where I think there's a lot of potential and having um the people responsible for reliability bmin Sr devops or um even the developers themselves and trying to get like um bringing security to the mind of the people responsible for reliability it it is very hard to take it at the end you need to start with that because you might be in a position where um know you've had a breach or or you've had friends somewhere or or whatnot and then you need to do something to stop the fire and stop the bleeding that's and unfortunately there's nothing around that but afterwards um hopefully it will become more obvious to the business and this is where you need to get the business on board to um invest earlier because it was very expensive to stop the fire last time so what can we do to to make sure we don't have a fire to put on uh to put off next time so bringing the security team closer to the developers and working together to teach them um to make security make them more responsible from the security perspective and Empower them I would say it needs to be not US versus them because otherwise you've lost the ba