K003: The Second Factor – “Systems” For Immutable Infrastructure

transcript

Kamalika:

Before I begin today's episode, let's do a quick recap of what has been covered so far. In the previous episode I presented to you 10 factor infra, a framework for secure, scalable and robust modern infrastructure. In the first factor of the 10 factor infra I explained how segregated network, perimeter security, single secure entry point and dedicated peer to peer links would help to build the foundation of a robust infrastructure needed for today's digital transformation.

In this episode, I'm going to discuss the second factor of the 10 factor infra framework that is systems. In this factor, I will share the design principles and contract parameters for servers hosting applications for today's digitally transformed world to deliver always on services, I will explain some of the common challenges that today's online world faces on application hosting, and explain how to mitigate those using the system factor of the 10 factor infra framework. cloudkata is available on all podcast platforms. So tune into the infra journey on your favourite podcaster and learn the art of mastering modern infrastructure. Do visit my website, www.cloudkata.com. That's right, it's www.cloudkata.com and subscribe to the complete playlist along with the transcripts. So you get notified about upcoming episodes, and various supporting blogs and articles that gets published. And if you have any queries or feedback about the sessions, connect with me on www.cloudkata.com. So let's continue our deep dive into anatomy of modern infra on cloudkata - mastering modern infrastructure.

Intro Music:

Hey there, welcome everyone. Thanks for listening to Cloudkata - Mastering Modern Infrastructure. Learn how to design cloud ready modern infrastructure with zero downtime deployment, security and effective finops with me Kamalika Majumder, I will explain some of the common challenges that today's online world faces on application hosting and explain how to mitigate those using the system factor of the 10 factor Infra framework. Systems is the first layer of the anatomy of modern infrastructure, where you will start noticing operational expenses, since it is the first chargeable service that you will encounter in any cloud today. Now, you might ask, what about networks ? doesn't it cost anything? Well, most cloud providers almost charge zero or very minimal for their network services. Unless you go with some very fancy security devices or load balancing systems, the base network, as explained in the previous episode, a three layer segregated network would not cost too much, especially if you're going on cloud and you're opting for a vpc. So vpcs are not charged additionally. So, this is delivered systems where you will be charged based on the utilisation or number of hours or seconds your system is running. So it is very important to be careful that this layer does not end up being in the, you know, the most expensive infrastructure service. And although every business in today's digitally transformed world works at a meet or aims at becoming the best seller in its own domain, by delivering quality applications and services, however, none of them are running businesses with a budget. Trust me no matter how big or small that organisation is. Everybody has a set budget for their operational cost. And the moment that budget is exceeded, that's when the alarm is you know, hit and panic starts and people start asking, you know, optimise your infra trim down your infra and then put additional pressure on the development community to make compromises are there is a concern that it will risk compromising the performance of what the services has been built on that infra so it is very important that you take care of optimised infrastructure and optimise Philips right at the beginning of designing of your infrastructure. That is why I will share with you how to build scalable, yet optimised systems to prevent over spilling your budget. So let's start with some of the challenges and the impact of them in in current day digitally transform for So, what are the top most challenges or if I see demands of modern infrastructure, and I have already spoken about it in my first episode

of networks, so the top most demand is performance right? And everybody wants to get their sales you know increase based on the demands and The application should perform really, really well. And nowadays, as more and more mobile applications are coming to the market, the demand for performance has even gone higher. And what prevents organisation from meeting those performance? Especially? What does prevent them from meeting the required or demanded sales during peak hours? And how are they getting impacted today. So there are three major things that are impacting the peak our sales, or performance. And the first one is mutable infrastructure. What is mutable infrastructure, it defines the infrastructure, which has gone multiple changes over time. Or you may say multiple mutation over time, such that every server has become its own work of art. And this leads to a bottleneck in testing. And testing takes forever. Any kind of changes or upgrade that that server has to take really becomes time consuming, because it has gone through so many changes in your version, that the server itself becomes a dependency. So mutable infrastructure is not scalable, because you really depend on that single server, which has gone through multiple changes, you cannot scale it out. If you want to add another server, let's say to meet the performance demand, you cannot do that, because you don't know how many changes it has gone through because of all the manual steps that has been taken. So these leads to a halt, or a bottleneck in scalability of your infrastructure. The next thing that hits performance, relate a scalability related to performance is stateful systems. Most often we see that compute services, which are truly meant to host your applications and services. In the being stateful system, you end up storing stateful data into your compute service. Whereas it was meant only to focus on computation, right only services or application or software not to host your data, what a stateful system, bring in stateful system bring you into an ecosystem, which leads to a non zero downtime upgrade. So today, we if you want to upgrade your compute services, and if the compute service or storing some kind of data, and that data is a dependency for that service to run, then you cannot really do on demand upgrades, you cannot quickly scale it out because you will have to scale the data as well. And so when your compute system become stateful system, that brings you to a downtime prone environment. And then that halts you from doing an automatic scaling of your infrastructure whenever there is a peak hour sale demand hit. And that brings me to the third and most important factor for performance is scaling. Due to mutable infrastructure and stable systems in the compute services, scaling becomes time consuming and costly. Because you you just do not have to spend on heavy servers, which has been, you know, enhanced with a lot of resources and changes over time, you also have to think about what all those changes are, and how to replicate them when you're scaling your server limit. So these three are the top most challenges to meet the performance demand that today's online world needs. The next challenge is, as I mentioned earlier, fin ops and what is fin ops finance means managing your operational cost on cloud, this term literally means that so where does it become challenging, most often organisation, just assume that they will need need this set of budget. And that communication does not get passed on to the, you know, engineers who are actually developing the infrastructure. And they think that they can just invest on infrastructure and choose the highest available resource. And then they start developing their application accordingly. And this, this is this also gets contributed by the immutable infrastructure that I mentioned earlier. Let's say let's take this example, like you have one server. So you start developing your service. And you identify that, oh, I need two virtual machines, one for web server one for app server and, you know, rest for the database service. Let's say you're on cloud. So you choose easy to instance, on the on the cloud or a Compute Engine on the cloud. And you say that Okay, my service will need this much CPU, this much memory and that will be enough. That's the start of development maybe for your development environment, but you never think of production need it. The day of development. And that, where you hit the first challenge or the you know, you risk your infrastructure, when you're doing development, it is all good to do everything in one machine and you know, take it forward, but you when you go to production, or even before you do performance testing, and sometimes have observed that during performance testing itself, you are testing the application, and you are just capturing how much it needs. And you keep bumping up that server, and you choose a really high end server for your stateless services as and that, that this kind of scaling is called vertical scaling. And it costs more, because you're spending more on a single server, which you may or may not be able to scale out within five minutes, right, and you end up spending so much on that server. And due to other reasons, and since you do not know how much your server will need tomorrow, or when you go live, you you just keep it on a pay as you go model, right. And you cannot leverage the benefits of cleanups that the cloud provider really gives and I will explain how to, you know, mitigate these and build a very effective, you know, with finances the challenge today, because most often there is a gap of communication between your operational group and your engineering group. And

or if I may say, between your business group or the budgeting group or finance group or the engineering group, and due to the this gap, one party is not aware that they have a, you know, false ceiling that they once they hit that they cannot really go beyond that. And the other party is not aware that they need to communicate it so that measures are taken to meet an optimised design within the threshold that is defined. And the last but not the least, the most important challenge today is to meet the desired security so that you are compliant with the regulatory needs of your service no matter which domain you are in. In today's world, it is not just the banks are different texts, you have to meet the regulations. It is also the service organisations who have to meet the security compliance or the regulation policies. So delivering a secure system for your application, which is especially which our public facing is very important today. And these are a compulsory need to get your application certified. So these are some of the common challenges that today's online or digital world faces. And these have to be really taken care of right at the beginning when you're designing your infrastructure, so that you do not hit a bottleneck down the line when you're right before you're going live. Right. Now, let's look into how to mitigate these challenges while how to tackle them. None of them are black boxes, there is a solution model for each one of them, and I'll explain them one by one. So let's begin looking into these solutions. The first solution. This I'll be focusing on next couple of solution that I'll be focusing on how to get how to prevent yourself to getting into mutable infrastructure model, or stateful infrastructure model. Now first thing first, for your services always choose a compute as a service model within whichever cloud provider you're going or if you're hosting on premise, make sure that you identify the services which are purely compute it may be your micro services or it may be if you are a monolithic application, it can be an application service that you will have to identify make sure those app servers or web servers or micro services ecosystem framework are stateless. So that remember your application will go through faster changes will have to adopt you will have to enhance will have to do feature updates very frequently or and that will be on demand and with the model of continuous delivery. It is an expectation that you are always production ready. So that is why when you have to do so many enhancements and you'll have to grow your application, it is very important that your application layer is stateless. So that whenever needed, you can upgrade it, you can scale it or you can enhance it with zero downtime deployment model. Right. You cannot say that, oh I need to deploy a new version of an application. So I will be taking 20 minutes downtime. Your customers are not going to wait 20 minutes to for you to upgrade they will immediately jump on to your competitors services. So stateless systems for your applications or services. To achieve zero downtime deployment model is very, very important. And that can be achieved. Firstly, it's not necessary to say that for all of these, you need virtual appliances or virtual machine, but I'm still mentioning it. Because sometimes when you are on premise or if you're in hybrid cloud model, you end up you might end up looking into a bare metal option, but don't go environmental those ADA is gone. So everything should be should be virtualized, and if possible should be in an appliance model. And this virtual machine or a system or instance, whatever you may call, when you are building it, make sure you have separate partitions for your root operating system, and your application data or application config. I'm calling it data, but it is not really a stateful data, it is the application config, which has to sit on your system unless you are using, you know, memory based storage for your configuration. But still, you need some space for replication, especially, let's say you are in a containerized framework. And you need certain space where your Docker containers or your whichever containerized model you are in those containers will occupies uncertain space in your system.

And this that space is not a data at rest, because that will get cleared up on demand. But it is important that you separate out the partitions, which are for your root operating system, which is the Linux operating system and the separate partition for your application date. Plus, also, if you have logs that are being generated by your application, and you're storing this, these logs into log files on on the local disk, make sure you have a separate partition for your logs as well. And I will explain much more in details about logging in my upcoming sessions, which is purely dedicated on logging. But in this session, I just mentioned that you need separate partitions for separate kind of configuration that you have on your stateless compute service. So separate for operating system or root disk separate for your application data, or application config and call it data separate for your logs. And then the next thing is thin provisioned disk you when you're creating a server on cloud, and this is related to how virtualized system virtualization works. Always choose thin provisioned disk because thin provisioned disk can be expanded in future. If you're using thick provisioned disk that reserves that disk to the limit that you have defined first, you cannot extend it beyond it, if you have to provision this if you and tomorrow if for some reason, you have to extend it or you know, extend it by few gigs, you will be able to do it. So that is why always choose the option of 10 provision disk. And when you are mounting multiple disks into your servers safer route operating system for application config for logging data, always make sure that those disks are formatted and those formatted disk are mounted at the boot up. Now cloud provider will not automatically format your additional disk that you are using, it will only have formatting for the root disk, which is packaged within the images. And we'll talk about how to do a proper packaged image to mitigate these mount points. So make sure that all the disks that you are creating are mounted at the system bootup. And this will help you get an immutable infrastructure. And now let's see how you can get with those configuration model how to build an immutable infrastructure. Now, as I explained immutable infrastructure is that no static configuration on the machine, right? It's all being packaged as a template, so that you can create as many machines as you want. And creating servers or servers becomes like a commodity item using through. If you are using it, you scale it out. If the server has an issue, just do not discard it and build another one. In order to achieve that kind of immutable infrastructure. What you need to begin with is that a packaged base image so a customer's customised base image in line with your need and what are those needs, that is saving the installation time. So this base image will instal the operating system will already have the installed operating system and it will have a standardised operating system not like one is using Linux one is using Debian etc. And it will have a base sizing and config. So create a base image which has a particular sizing or we can call it T shirt sizing for your CPU memory and disk space. Let's say you're building a base image for your compute services. So for any Linux system 20 is enough for the operating system. So to keep some buffer or to keep some space for the swap memory, etc, you can keep like say 30 gb for a root filesystem space, and you can add another 60 gb that's for application data. Now, if you are also generating logs which are stored within the same system, before it is sent to the collectors, certain files or if you are running a Kubernetes cluster, or say, any kind of dockerized cluster, what you can do is these containers will also occupy some cached images, so you can add another disk more, more often you don't need it just two is enough one for application data application data, you know is composed of all these cached images or logs etc. So separate out you know, this is the base slice for you separate out and then you can instal some default tools into it. Let's say you have a standardised logging to like say e l k or a standard monitoring tool or deployment tool, which is an intense or which needs some kind of agents to be installed. So instal it on these default tools, and package this whole thing into a image.

Now images can be in if you're on cloud these images are called ami. If you are on premise on a bare minimum virtualization system like VMware, these are called as ISO files. Now there are different extensions in which you can save these images. It can be vmdk, ami, you know, ISOs, whatever vmdk. Well, it's the disk file. So ami psoriasis ami is for cloud. And ISO is for only virtualized system like VMware. So always automate the base image creation. So there are tools to automate this whole process you don't have to set and click through multiple buttons or go through multiple scripts. So tools like Packer is very handy in this as it can output all standard image types like ami is ISOs and it can be also used to create base image for your containers. So let's say you are in a containerized ecosystem, we are building micro services. So you can also create a base image for your micro services say you are using a base CentOS image. So and that micro service will need certain third party libraries so you can and certain configuration parts etc. So you can all pre instal them and package them. And all of these can be achieved through one tool which is Packer. So use Packer to automate your you know system or base images so that you have one template that can be used multiple times, and your infrastructure becomes truly immutable. The next thing you'll have to take care of it. Once you have this system in place, you will still need a operating system you know file which is ISO file. So always use the trusted official or source that the official community provides. For example, if you're going with CentOS use the official community version of CentOS from their repositories, if you want to be and be very careful when you're choosing open source version and an enterprise version, remember, you might be able to download the enterprise version and use it but you will be violating the end user licence agreement if you have not purchased a licence. So, don't just download from any random source if they give you Red Hat or any other enterprise system, make sure that you always go to the official site and official repositories and be aware of the licencing terms. So so that you do not encounter any problem in future. Now all these images are built store these images in a centralised repository. So you will have a lot of packages to store that you are building always make sure you have to have a centralised repository and nowadays there are these artefact repository available where you can store all kinds of artefacts so even for images you can store them you can even store executables or files or zip files etc. So some of these tools are like Nexus or artifactory but make sure you are saving these images and they are not just you know stored in any server on on a folder and things like that. The reason why you need to store an artefact repository because you with this automation and with these standardised images, you will want to pull it multiple times and many times during the day. So it is always important that you store it so that in parallel you can just go to the repository and download it and always version your images. So if you are using a certain version, say 8.0 of CentOS and you have built the base CentOS image for your infrastructure and Next, you are going to next version 8.2 make sure you build when you're rebuilding the image it is version and label correctly. So there it is. Now don't just Marquez latest You know, when you're doing Docker imaging it will always show you as latest, but don't just over keep overriding the image say with the latest tag make sure you change the version and then you upgrade otherwise the whole purpose of imaging and using version control system or centralised artefact repository will be gone because you will be literally having one image and you are just overriding So, you will end up you know, getting into that a mute the mutable infrastructure loop. So, do not go into this, whenever you are doing any change on your image from the first version that you deliver, always upgrade the version and then repackage it. So, you have a previous version and the next version and you can also add description into it saying okay, it has these enhancements, some extra folders or extra users and why it was done. So, always version control your images, which you are going to use.

Now, once you have

a version control image, and you're all ready to start provisioning your infrastructure or your compute services, you will be having two options on a cloud, you can either go with the simple compute service, where just these are virtual machines, which which are launched when you instal your software, these model this model is called a self managed service. So, only the compute server is managed by cloud and rest all is self managed by you, then there is a next model for your compute service which is managed model which gives you which still works on the same compute layer. However, it gives you only the managed interface and you do not have to go into the server configuration and you can still use your base image which you have built. But you do not have to bother about any kind of patching you know, scalability things like that, this is called a managed version on managed services. So, there are two models of the way to choose services either managed or self managed. Some positives of the managed service which is cloud managed, the cloud provider is managing the system set up you just tell them that okay, I want my base image I want these many servers and I want a service of Kubernetes. So give me and they will create it there are some benefits like maintenance and support. So, you know uptime of those servers and maintenance and that they are always up and running is taken care of by the cloud provider, you don't have to keep monitoring onto them, and they give a certain level of SLA So, if they are not meeting that kind of SLA, either 99.99 or 99999 you can always get a payback right. So, some a lot of operational headache is gone. The next thing is performance and high availability most of these managed service come with a ha mode and performance enabled things like auto scaling, so, you can enable them if you need them, and it will be taken care of again by the cloud provider that means that high availability means they if you choose that option, they will spin up multiple servers in across regions, which gives highly available infrastructure plus if you are enabling a performance mode with auto scaling, they will automatically scale the servers if you are if you reach certain peak limit, but remember, you will have to tell what is your peak payment the cloud provider will not know whether you will be you know you want to scale out when you hit 70% of your CPU utilisation or you want to scale out if you reach 80%. So, you will have to define what you want to achieve. But once you define it, they will take care of doing the rest. And the second thing is you will as I said you will have to define the config parameter config values parameters will be given by them you will have to define the value. So, configuration management is only your headache, the infrastructure management infrastructure provisioning is the cloud provider setting. Now, there are some negatives of this model. The first thing is a complex pricing model or hidden service charges because it is managed by the cloud sometimes by partners of the cloud provider you may be charged a huge amount based on the utilisation like per second utilisation or per hour utilisation or how many users user accounts you are creating there are different pricing model. So, you will literally have to spend some time to play around with that pricing model twist and tweak and see which is which meets your budget requirement. So and there can be some hidden service charges like say you know update charge and things like that which might not be visible right at the beginning which might come in on this a network bandwidth charge and things like that. So, finance is a broad you know issue sometimes they they in turn out to be very costly for smaller businesses, sometimes larger organisation as well. So, sorry, you will have to be careful about it. The second negative I may say is a lack of ownership or visibility on the management node. So all these servers are as we know that for any kind of clustered mode, there will be some management servers like food, where you have control centre. So the control centre remains with the cloud provider and you will not have any access or any visibility to inside those systems you may own, you only have, you know, control over the configuration values that are defined by the cloud providers, the parameters are defined, you will only have control over the values, you will not know, from where the cloud provider is managing it, they might give you some IP addresses and regions, but still you will not have 100% visibility like you would have with self managed system where, you know, which is your primary, secondary, you know, control centre etc. And this lack of ownership and visibility

might raise, you know, privacy concern and your auditors or compliance. You know, regulators will ask who has visibility to the data, especially for fintechs, let's say loan platform and likes services. There is a clause in many countries in some countries where they will ask who has visibility to the data? Whom are you sharing the data? How are you copying the data outside of the of the country. Now, those data privacy concern have to be clarified, if you're using managed services. And sometimes, these managed services might internally be integrating with another public facing SAS service, where it is sending data. Now, if you have some kind of sensitive PII data that is being sent to those third party servers, which are not within your country, then that might bring up another challenge to meet the regulations. So data privacy concern for many services should be looked into. And many times, organisations like banks, in turn end to go to the self managed model, and not to the Manage model, because they have all these concerns, as to data sharing and data privacy, because many services is completely managed. The next thing is that portability, if you're going with the cloud managed service, you do not have much of a portability, you only have portability of the data, so you can export and, you know, import the data. But if you think that, let's say you have a you know, RDS service on AWS, and you can just as is take everything and implement it on GCP you may or may not be able to do it, you might need to invest on some data migration tools. So, data migration becomes critical or a challenge sometimes due to the portability, and you enter into a vendor locking and that might cause you to step back from any kind of data migration. So, you will be literally depending on that particular cloud. So, basically, managed services is very helpful if you are a single cloud based hosting and not a hybrid cloud hosting and if you are a regulator or if you have met the right agreement with your cloud provider required for the regulatory needs.

So, what what comes to when it comes to the self managed services Of course, there are all these you know, challenges or negatives of managed services, these becomes positive of self managed service, like say, self managed is that the virtual server is created by you, the services are installed by you. So you have complete control and the IP is within your control. And you can even encrypt the system to add another layer of, you know, security. And did you do not have the data privacy and confidentiality because you can regulate who can get inside those server and who has visibility to the systems or services running into the server, there is no hidden costs because you are only paying for the service. There is no hidden service charge or maintenance support, unlike in the managed cloud managed services model where there might be hidden charges for your many services for add ons or support, etc. And then there are no vendor locking factors. So let's say if you have a Postgres cluster, and if you have self managed or saleratus cluster, if you have self managed, you have three machines where you have a Redis cluster, and you are on AWS, let's say you have to create another similar Redis cluster on Google Cloud to in some other region where Google is available, you do not have to bother about anything, you just go create machine and take the same setup and instal it. So you do not have to invest more onto it, you have portability across and that also helps in data migration. So you know what, or application migration because there is no data on that. So managed and self managed services and negatives of the self managed services is of course, which is the positive of managed is support and maintenance. You will have to take care of our own these operational maintenance if a particular node is down a cluster is down any kind of you know, security batches etc, you will have to take care of everything cloud provided will not take care of anything. And in many service, you only have to take care of config management. In self managed service, you will have to take about infra and config both. And negative. Another negative is on demand scaling. So, you have auto scaling for managed services, if you define that I want to scale out to four, or 10 different nodes, if my utilisation hits 70% of the CPU for my Kubernetes cluster, once you define it, you are done, the cloud provider will take care of, you know, scaling it out when it reaches that threshold and automatically scale it down when the threshold comes down. But in case of self managed services, those kind of operations has to be taken care of by you by whichever operational mechanism. So of course, there is an operational and management overhead that you are adding, but the benefit is that you have 100% control on the system. So based on your requirement, choose the compute services. And since I'm talking about immutable, stateless infrastructure and compute service, it is if you're not having any critical data, if you if you have made your compute service, completely stateless, then it is okay to go with managed services. Because there is no PII data that you are storing, just make sure that you have the right agreement with their cloud provider. So that they are not unnecessarily scanning your systems and in the name of, you know, you know, security review or in the name of monitoring, they are not planning your system unnecessarily and sending data somewhere else. So just make sure that you clarify those things in the agreement that you're getting into with your cloud provider. As, as I've been talking much about scaling, so let's also discuss about scaling. So scaling can be done in a There are two ways to do scaling. One is called vertical, which is scaling up, that means you add more power to your existing server server or machine. And the second is horizontal, that means scaling out. That means you're adding additional resource to a system by adding more machine like you know, network or sharing processing or memory.

So, which is the one that you should follow, always follower scaling out mechanism, remember the mutable infrastructure, you will again end up in the loop of mutable infrastructure, if you're scaling up, that means you are putting a lot of power to a single machine, that means that machine becomes the bottleneck for you. So do not end up into that model, always use a horizontal scaling or scaling out model where you define a certain benchmark on your machine images and the package and then use a managed service where you define that okay, maximum number of servers that I need for my Kubernetes cluster, let's say let's say four. And when do I I you know, minimum is two maximum is four, and I want to scale out whenever the performance you know, the utilisation crosses 70% of CPU, that's it done, what is interest creating Will not you know, end up in mutable infrastructure with horizontal scaling, you can get immutable infrastructure and also auto scaling that you will need during peak hour traffic. Now, in our in auto scaling, you can do it in two mode, truly auto that you define in and it is automatically scaled out and the in some cloud you may be restricted to do it, if you are restricted to do it, or if you are not very comfortable in automatic scaling, make it on demand. That means it is still automatically scale, but the approval is given but once the approval is given from your end. So on demand scaling, usually you will need when you are in the development phase. But once you have reached a comfortable stage of your application, you have done a fair amount of testing and you know how your application behaves. It's good to go in auto scaling and auto scaling, you may I would recommend you all do auto scaling for your production like infrastructure development and your test infrastructure will not need it. You might be only good with the you know, no scaling model, because again, you will have to meet your budget. You cannot just you know overspend on your budget with zero utilisation right? Thinking that oh, I will need this down the line in six months. So let me just create it and keep running. Remember all of the services especially from system layer, as I mentioned earlier, you will start getting charged per second per hour or per day, based on how long your server is up and running. No matter you have 100% utilisation or 0% utilisation. As long as your server is up and running, you will be charged for it. So make sure where you're building this model. So in short, follow horizontal scaling and auto scaling for production like in environment and on demand scaling for a development test in case you need it if you do not need it Do not go with automatic scaling there.

So that brings me to the next thing. Once you have all these features like managed services and automated scaling, you will really hit the requirement of finaps. As I mentioned earlier, if you're not careful enough of where all you're paying for it, and what is your limit, you will hit overspending and exceeding your budget. So very important when you're building the machine images like I mentioned earlier, have a T shirt size machine images. Every cloud provider gives you n number of machine image size don't have one type of family for one environment and another for production have a standardised machine images and the way you can do it is look into what that machine image or in that instance family provides. If your application is compute intensive, use a compute intense compute optimised machine family if it is a CPU and IO intensive use a CPU and IO optimised instance families. If you're not, if you know that oh, I will only have this much utilisation that you are aware of how much of an utilisation let's say you are installing a third party software and you know that typically it will consume this word you can go with burstable size which is means how much you have defined it will only need this. Remember, the cost varies between these each family size. So you will have to be very careful on which family you are choosing. Don't unnecessarily go and take a very high I optimised instance family for your say Kubernetes cluster. Most often the micro services built on Kubernetes they are not IO intensive, they are most you know memory intensive. So think about which one you need. Based on that choose a standard instance family don't keep 10 different instance families. Because I'm sure your application is one you may have multiple microservices, but the nature of them will be the same or at least one or two types of you may have a compute service layer micro service layer, you may have a messaging or caching layer, so the behaviour will remain the same. So choose a set standard for instance families, I will recommend that choose a compute optimised instance family for your applications and services. Choose IO optimised instance families for systems like Kafka read is because they have more CPU and IO intensive application and it should be good enough and make sure that you have packaged these images as I mentioned earlier, into different versions. So that when you are installing a Kubernetes cluster, you can choose the compute optimised when you're installing a Kafka one you can choose in no memory and I optimizer Likewise,

the next thing is very important, always use a subscription model or in cloud terms use do not use a pay as you go model use a subscription model or a pre paid model. Even if you are running this machine for even a month. If you're if you're doing a labs kind of environment, then it's okay because you will be destroying if you're destroying the machine within a day for testing, then is fine, you can go with pay as you go. But make sure you have stopped the machine or destroyed the machine. If your machine is going to sit and run 24 seven for more than 10 days or 15 or at least a month, make sure you build machines with subscription model. Now some of you might think that, oh, if I go with a subscription model, and if I don't need it, within that month, and I want to destroy it, I will be stuck. It's not like that. If suppose during the month, you're you want to change the machine and destroy the machine, you can do it and redeem the money from the cloud provider. So that is only optional, right? I mean, you may have it and those options are very rare. So always go subscription model, believe it or not, by following the subscription model, we were able to and that differential that we saw. And just last year is around $10,000 that we were saving, purely going with a subscription model of the service. So use a subscription model and in some of the cloud. The longer the subscription time period, the better discounts you will get. So if you are not in a state to commit for more than a month, don't go it but if you're in a state or to commit for, let's say six months or one year, you will get better discount. Or if you are able to commit for two years or three years, on all the more reason to go with subscription model. So go with a subscription model that will literally bring down your cost. And when you're going with subscription, even the cloud provider gives a good amount of discount. And if you're a space, you can even negotiate that price if you have good relationship with the cloud provider. Now, suppose you're not going to you know are not able to go with the subscription model and you Still have instances which are created up as you go. Make sure that you stop the instances when they are not run. I'm sure that if you're if you have instances which are non critical and just for testing, you will not want to keep them running during the night or during the non working hours. So stop those unused instances, automate that process. Cloud provides providers give you API's to automate those processes. So write a script to stop the instances, run it in a scheduled manner during your office hours. And these are typically for set testing or labs kind of environment. And so basically, overall, for an effective and optimised cleanups always go with a subscription model, make sure the servers are t shirt sized, and uses a standard instance family because we hit depending on the service behaviour.

Now, that brings me to the next thing, which I say the challenge on security right now. Security is very important today, and not just banks and fintechs, and healthcare. Even the service organisations have different compliance model SOC, two SOC one, etc, that they have to meet. And every compliance benchmark needs your system to be secured. So how can you secure your system so that you meet those compliance need? First thing is hardening, the system has to be hardened and have to be protected, such that it has a standard official image even if it is open source, but it still has to be from a trusted vendor or approved repository. And it has to be hardened. Now, how can you harden it? There are various ways you can harden it like let's say you can enable. If it is a Linux system, you can enable IP tables or SC Linux in a permissive mode so that it will not unnecessarily give permissions to all files and folders and avoid root account based setups create separate account for your users or services. And every server should have a complex password policy if you're using passwords. If you're using no password mechanism that is the best. If you're only using SSH key and password list mechanism, that's the best way. But suppose you still have a password list mechanism then use a complex password policies. But the best practices do not use any password. Use a key based SSH key based passwordless mechanism to give access to your system or even if service to service communication is happening. And all servers should we scan regularly for any vulnerabilities, these vulnerabilities can be on your base operating system, it may be let's say St. Louis has released some security patches or have identified some security issues in that particular version. So they will release some patches. So those have to be scanned regularly, monthly is good. But if you can have a very regular like a daily basis scanning, so you will be updated immediately when the vulnerability is published by the group and you can patch it. And alerts will be sent out when any modification to your system is done with all network policies and security group policies. So the other thing is approval system for upgrade. So remember, when you are having these vulnerabilities, sometimes they might end up disturbing your existing setup. So when you get notified that when usually when the vulnerabilities are published, there are three level they are divided, low, medium critical. So look into the critical ones first, because they are, of course a critical one. And first test those patches on to your labs or testing system before applying it on production, don't just blindly go ahead and patch the system with that. Because suppose it does not those changes the impacts or other software that are installed on your system, you might be able to highlight it to your regulator to say that look, this is you know, thing or you might be might have to go back and see what can be done to mitigate it. So, recently, like sometimes TLS upgrade to TLS. Three, it was a you know, causing issue in many system in our side. So we had to push back and go to TLS. Two, so sometimes they give those leeway that if not to the you know, usually these security patches are released on the latest version. But the good practices always be on n minus one version one version less. So if you are at least able to update your system to n minus one, it is good enough for regulators and system but remember that you will have to keep tracking these vulnerabilities so that you do not just sit on n minus one someday, that n minus one or B can become n minus two. So have to prepare to upgrade it and find the solution for it. So hardening and patching and regular auditing your system to an automated system is very easy. Important to achieve the security that you need to before you host your application. And it is important, if possible, if you have a security benchmarks already ready, implement it on your base image itself.

Or else you will have to build a hardened image again. So if you're if you can do it, do it at the beginning, so that from day one, your developers start getting on boarded and, you know, testing your application onto the hardened base. And you do not have to think about hardening and patching, when you are, you know, thinking about going live. So the earlier you take these measures, the better for your releases. And all of these should be managed using a configuration management system. Like always upgrades if you have to upgrade to the next minor version or a security patch version, and you have like 100 virtual servers. So you will say, Oh, I am on an immutable infrastructure, why not I just rebuild the 50 virtual machine. Now it is or 100, a virtual server from the you know, immutable infrastructure, you can do that now depends on which is faster. Sometimes, immutable infrastructure is a base model. But you cannot do too many changes for configuration. And sometimes it might happen that these patches are specific to one particular service, let's say Kubernetes patches, right. So you do not want to apply Kubernetes patches onto your Kafka, Kafka servers or your ad servers, right? That's why think about a company in Configuration Management System, which is automated configuration parameters, which is version control that can be applied on demand whenever you need it. So configuration management is very important. So that you will not hear things like it works on my machine, but not in prod. Right. So a configuration management on top of the packaging and immutable infrastructure model will give you an enhanced way to extend your infrastructure and make it more scalable. And one thing I'd like to touch base upon but not going to deep dive into it is that with today's age of containerized infrastructure when you're installing software, you can choose a containerization model to instal any base software as well, because that gives you a cloud agnostic platform. It is not typical to say AWS a AC two instances are Google's Compute Engine, you can run containers anywhere, provided it's the underlying Linux infrastructure. So when you're installing a software, let's say a logging tool monitoring tool or even the centralised artefact repository, you will still have to instal the service, you can use the containers to instal it. And you can even if since you're using pack Packer already to package your system images, you can also package your base container images. And for stateless application, it really works well containerization and and make and remember when you are using containerized applications or some services do not make it stateful they will give you an option to make it stateful but do not go into it because containers are just processes running on the system they do not care about data replication. So even though the system will provide you mechanism, but they will not give you the rip application, you know data application so make it stateless, always. So, this brings me to summarise the whole of system factor. And how can you achieve a robust infrastructure by designing a well defined system? So what do you need for today's modern digital world is a compute service that is easy to scale out on demand during happy hours. You don't want to sit and wait for your operational team to extend you as soon as you hit the traffic you want it to perform. Next thing is needed is security servers that are secured enough to host public facing applications and prevent overspending on infrastructure or over spilling your budget. Now how do you achieve them, you can achieve them. And for this is this since this episode I'm talking purely on systems. There are other ways. additional ways on top of these that I'll be explaining in the next episode. But let's talk about what can you do on the systems layer or for the system factor to achieve the need for performance security and you know, fin ops. What you're gonna do is immutable and stateless compute servers built from version control machine images, standardised across all your environments, and enhanced with automated horizontal scaling on demand or horizontal auto scaling. Next, regulate regular validation of system hardening patching and upgrades through Configuration Management System. This will give you the security. This will meet the security policies that you will need to match the regulatory needs. The third is prevent over spending our you know, under spending on infrastructure, go with a prepaid model for hosting your service on cloud, that will save a lot of costs that otherwise you will be spending on a pay as you go model. So

this, in summary is the system factor of the 10 factor in a framework. I hope you liked these sessions and it helped you get more visibility into the anatomy of modern infrastructure, how to make it robust and meet today's demand. In the last two episodes, we have covered about two factors from 10 factor in our framework. So far, we have covered networks and system and next week I will be discussing about the brain of modern infra, where all the data and information is stored. Yes, you guessed it right. The next episode will be about storage, the third factor in the 10 factor infra framework. With that note, I would like to conclude today's episode, you can get the transcript of both the episodes on cloudkata.com and connect to me on cloudkata.com. Right send your queries questions or if there is anything specific that you wouldn't like to like me to cover in my podcast? Let me know on cloudkata.com this is your infra Coach Kamalika signing off. Enjoy your weekend and take care stay healthy. Stay safe. Goodbye. Meet you next week.

Transcribed by https://otter.ai

Other episodes

Leave a comment

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments