FinOps Gaps: From Theory to Production

[00:00:00] Speaker A: You read the book, you passed the certification but your cloud is still a mess. Today we are leaving the classroom and we are entering the FinOps work room. We have Olsa from Azure Cloud Academy to teach us why FinOps implementation normally takes three times longer than you planned. You will learn how to stop running in cost alerts and the exact moment you should and when you should not automate a fix. If you are tired of building dashboards an alert that nobody looks at. We are going to start right off the bat with the question. So for you Alpha, what is the biggest gap between the FinOps theory and what really happens in production? [00:00:38] Speaker B: Yes, actually FinOps it's a framework. It's a lot of practice and cultural practice. It ensure to maximize the business value of using the cloud technology and it mainly have has some principles like six principles and one of the principle I see gap important gap between the theory and the real production world is the business value, how the business value drives the technology decisions. And here there are issue on the knowledge spread. What I see actually in what I saw in different context of the implementation of the technology is issue on the knowledge spread for the business value when implementing something in AGI team we implement incremental digital product and team does not have always the idea on the business value of the feature they are implementing. Generally this input comes from the product owner or the product team and there is generally a ratio between the business value and the risk of putting this feature in production. Generally this ratio is not known in advance. So this one of the thing that makes a big gap between the theory and the production. [00:02:11] Speaker A: Yeah, that's. That's a great insight. I also believe that you know, the product connection to the business value and you know, making it real happening. It's. It's a very difficult thing to. To do and it's a real gap between you know, the cleanups like theory and what really happens in production. There is a lot, let's say a long road between this idea and influence. [00:02:32] Speaker B: It's one of the gap. It's one of the gaps that can. That we can highlight between the theory and the real world. But it's in my point of view is one of the biggest gap because project team let's say they implement a feature but do they really know the business value of this feature? It's not always the the case. Yeah. Generally to avoid really implementing things that with low business value is really to have to question and let's say continuously asking for the this ratio between what's the value versus the risk If I'm upgrading the digital with this function. [00:03:18] Speaker A: Yeah, that's interesting. And another question regarding the product side is for you what are the aspects of a pin ops implementation that normally takes like two times, three times longer than the expected time to implement? [00:03:34] Speaker B: Yeah, there is from aspects which take consistently longer than what we expect is the ownership, how we assign ownership and how engineers feel it regarding to what they generally used to do. Engineers generally they do not feel the cost as an ownership. They feel like we need to deliver something functionally speaking working but they generally do not have this cost ownership because they say if it works well and I can deploy it smoothly so even if it costs twice time what it should cost. So it's not my problem. Let's say it's more financial department problem in big organization generally this aspect to having assign to assign the cost ownership to engineers is one of cultural aspect that implementation should generally lead and also changing the habits. Yeah, we all human. So we used to have a comfort zone and we used to do some tasks and we are quite comfortable to do tasks engineering development, testing etc. But changing some habits to integrate this finops culture taking some time actually. And so this because this is always a continuous enablement and it's not just one setup and we do not do other actions after. So we continuously need to do actions and to enable them. [00:05:23] Speaker A: Yeah, definitely. It's a continuous implementation movement and you don't need to. It's not a one time thing. Definitely. And moving on to one of the things that is mostly common in finos which is the. The anomaly. So we have talked about the engineering side and we especially I come from the engineering side on the observability. We have a lot of anomalies especially for you know, something going down and for availability side. Right. But Phenops also introduces the concept of anomalies for the cons. Right. So you know, talking about the alert fatigue is, you know how do you balance the sensitivity and being accurate in the alerts with the aller fatigue that causes the anomaly detection. [00:06:07] Speaker B: Yes. One of root cause for the anomalies is issue on provisioning in different contexts. We don't have always automatic provisioning templates or some parts are still done manually and error is human. We say to for example do a retention log duration or having a mistake instead of putting 90 days for example putting more or I don't know adding zero in some duration which can happen if you do provisioning manually. We can also see the auto scaling issue sometimes when working in strategy, you know, cluster Provisioning for example, if the strategy is not defined and when should it scale. So this is also another issue that can happen. It can for example scale in a test environment and we should not have this case Test environment is only done for typically continuous integration. Also another issue which is quite common is the fact of not deleting properly the orphan's resources or yeah, the resources are not used anymore but they were not deleted correctly and most of anomalies detected generally 80% like more than even more than 80% are a configuration issue. This is to recap the three common. [00:07:44] Speaker A: Root causes yeah, and that's definitely like three of the typical enemies of any phenops and any engineering in the practice and you know, regarding these alerts so they normally like happen frequently so there is the issue of the. Of the sensitivity so how do you adjust the sensitivity for this anomaly detection? [00:08:08] Speaker B: How to make. We ask ourselves when you're doing alert and dashboards how to make a balance and do not spam let's say the alerting people for something not pertinent. So generally we should have a method organized method to start with the basic things and then refine make something basic for example monthly alert and then refine do daily, hourly or per service or pertinent depending on what things you want to show. But refine better refine the alerts Also avoid the false positive. For example, we can have an alert for the day of promotion day and we made an alert that is quite positive. So generally to avoid that we should configure the alert to have some absolute thresh plus a percentage meaning if we use for a period to have double traffic for the this scope and so in this case we should configure the alert to have this absolute threshold plus 10% from yeah, if it increases with a percentage that we define that should really be the alert threshold and yeah, also that's quite what I said. Avoid making alerts during periods or we expect to have something different for example the deployment phase we expect to have downgraded for example CPU because some something is we are trying to upscale so we generally avoid to trigger alerts during the deployment phase because resources are. It's not relevant and yeah, that's generally the things to see we say generally not seen alert is more dangerous than a not defined one because also with automatic some mechanism of cloud even in Azure some alerts trigger some actions actually and if they are not relevant then also actions triggered could lead to big issue. [00:10:42] Speaker A: Definitely, definitely. If you are not 100 metrics it's definitely a problem and to work on the anomalies and to fix that. What are the fastest remediation wings that people or the teams can implement? [00:11:02] Speaker B: Yeah, so definitely the fastest remediation is the automatic shutdown for the non production instances. So we know these VMs are using only for QA activity or some performance B test. The rule is to do the automatic shutdown on the weekend and do define office hours rules. This helps to save from 20 to 40% from the cost. Yeah, just. And it's quite, it's the fastest remediation. [00:11:36] Speaker A: Definitely simple and you can save a lot if you can afford to, you know, you know have like 24, 7 availability requirements and all of that then I think it's like a good one. And you know talking about, we just talked about what we should do then, what we shouldn't do, like what we shouldn't do. Not automate the remediation. [00:12:00] Speaker B: Yeah. If for example we don't know the context, it's doing automation without context. We don't know what's the business case for this client and we are trying to do automation for delivery process for example or let's say, yeah, trying to build a new metric but we don't know actually the context, all the business case in this case yet. It will lead to inefficiency. [00:12:29] Speaker A: Yeah, that's, that's totally true and I think like it's, it's an inefficiency like you can totally avoid with, with a simple, you know, with that simple consideration. Right. [00:12:41] Speaker B: Yes. It doesn't make sense actually to do this automation. Imagine we, I don't know we have remediation on config rules or a user permission set to remediate something automatically without knowing what's the consequence of that. So it will lead to sometimes to correct it, it takes more time than to set it up. [00:13:09] Speaker A: Yeah, definitely. You don't want to spend more time fixing automations. The time that is going to save you. You can apply that on everything. And you know, talking about the, you know, the problems with automations and, and the trust is like how do you build trust for the engineering teams and the people for their domain of remediation to trust or like how do you build that trust for automated remediation? [00:13:35] Speaker B: Yeah, generally the first thing is really to put the action to the service owner and not to someone, I mean in the financial or cfo so each action to the specific service owner. So this will give the ownership to the people and generally we also need to add an automation on the rollback in remediation always thinking that remediation can lead to failure because there is a big drift for example between two deployments or I don't know what can be the issue, depending on the context. So here we need also to always think about an automation of the rollback in case the remediation doesn't work and also to show always the saving evidence between the state before the remediation and after. And here we can build the trust. [00:14:41] Speaker A: Yeah, that's a good advice. I think that building trust and also with having good policies and explain it as well to the engineering teams and they being able to understand it and to see the value of that and seeing the process because sometimes you have an automation and you don't get it and what this is doing and if you explain it, what's the process, what's the logic behind. And they will probably get some buying to that implementation. Right. When you realize that it's saving you work time rather than giving you more, then you automatically buy it in and talking about the finance practice and building trust and all of that. So in your perspective, what is one thing that everyone or most of people consider in Freenaux that do you think is a waste of time or something that is not very efficient in the time space? [00:15:38] Speaker B: Yeah, what I see commonly happening is in the dashboard and observability. We have too many metrics and doesn't help to have consistent decision. So in many metrics like error rates for this API, APU usage, a lot of metrics shown with a lot of data without what's do, what's to do next actually and there is no ownership mapped because generally if we deploy resources and there is no tag for the owner, so we don't know who did what. So do not have this ownership mapping also is a big mistake. Yeah, the fact of not having the context there is something shown in the cost. We know that this deployment for example upgrade this cost, but we don't know for which context for which client in case we have a multi tenant and why. What was the feature which I mean introduced this cost gap. [00:16:47] Speaker A: Yeah, that's also true. Like I think also that you know, metric need to be with context as well. Like they need to mean something and you can get the insights and you know like tons of metric. Like for example if you check like Google Analytics, you have like thousands and thousands of values and metrics that to be honest, if you're like there you're like you don't understand anything unless of these metrics. [00:17:10] Speaker B: If there is no do next, what to do next information. So actually you can see how many goods but you don't know the decision to do. If we have some spike some lot of data but no quick decision next plan to understand in this definitely. [00:17:28] Speaker A: And you know talking about, you know you gave good advice on this but if you had someone you know starting their phenops journey today and you can give them one piece of advice that are listening to us, what what would it be? [00:17:43] Speaker B: Good question. Yes. Really someone who trying to understand and practice finops he always need to keep in mind it's a practice, it's not a tool, it's not a set of tools. So the point here is to continuously practice this enablement and being aware of the cost consequence on what we are developing and not just I'm good in this tool in that and that tool is not a set of tools, it's a practice. This is the main point to keep in mind. [00:18:19] Speaker A: Yeah, I think that's a good point to keep in mind. You need to be aware of the practice, you need to implement it, but you need to think about dominating the whole thing and the whole stuff. It's a practice and you need to it's an ongoing thing. If you are liking this episode, don't forget to check this automation special episode podcast with Ernesto and Astin so you can know more about how to automate your anomalies. And I think with that we can finish the today's episode. So thanks, thanks for being here today. [00:18:52] Speaker B: Thank you. Thank you, Victoria.

FinOps Gaps: From Theory to Production

Show Notes

Chapters

Episode Transcript

Other Episodes

Why Rightsizing is Nonsense in FinOps (and What to Do Instead)

How to Build FinOps Into Your Pipeline

English Teacher to FinOps Author: How She Did It