De-mystifying Cloud Operations

↓ Demystifying Cloud Operations ↓ The Cloud Ops dilemma ↓ Cloud Operations Transformation ↓ The Way Forward ↓ Service Operations in the Cloud ↓ Tools Approach ↓ Lifecycle management ↓ New approaches for Success In the Cloud

Demystifying Cloud Operations

Successful cloud acceleration requires superior operational diligence to achieve high performing and always available business apps hosted in the cloud.

A well-oiled, transparent approach to cloud operations is a critical factor to derive true business benefits from your cloud journey.

In a recent whitepaper, we covered the most common security blind spots to be aware of while onboarding to public clouds. In this whitepaper, we cover the foundational pillars and common patterns needed to operate your hybrid cloud environment effectively.

If you fly by the seat of your pants while operating your hybrid cloud, it could potentially risk your business¹.

It won’t be clear and sunny skies. You need a well-instrumented cockpit with well-defined systems and procedures to soar through the cloud.

The Cloud Ops dilemma

Digital-Transformation-Pillar-Impementing-800x600

Balancing control and agility is a tough task while flying at your optimal altitude.

IT Executives look to synchronize velocity and safety while operating the cloud:

Do I have full visibility of my hybrid cloud infrastructure and apps?
Is my cloud environment fully compliant with existing regulatory policies?
Am I exceeding my budgets?
Are my deployments faster than on-premise?
How can I enable self-service for my app builders while ensuring zero drift from my governance guardrails?
Shall I invest in new tools and Ops frameworks for cloud or extend my existing tools and processes to the cloud?

Cloud Operations Transformation

cloud-storage-background-business-network-design-1

Cloud providers draw clear boundaries for support in different cloud service models and require their customers to own support and security for key areas

Shared responsibility model in the cloud

So, customers have a lot to consider in ensuring security and resiliency for their cloud infrastructure.

Many customers assume that a public cloud providing Infrastructure-as-a-Service (IaaS) is equivalent to classic outsourcing². They expect that public cloud providers are full-service providers and should be managed like a traditional outsourcing provider³. They underestimate the implications of the shared responsibility model. Understanding the support responsibilities and deploying the right tools and controls to manage your piece of cloud is very critical. Many organizations struggle in the “Day 2” operations of their cloud.

In general, born-in-the-cloud startups and small IT organizations prioritize velocity and agility but operate their cloud environment without many written rules. That may work for a handful of cloud-hosted applications. However, for most organizations, there are regulations and policies to be met, and these organizations must carefully consider how they’re going to manage their cloud operations as they move forward. Clear governance is worth the time it takes to ensure a secure, high-quality cloud ops platform.

On the other hand, large corporations that have many decades of IT operations experience now need to evolve their people, process, and tools to handle the DevOps mode of operations. They typically have the governance process in place but lack the agility and velocity to be successful with cloud ops. If they fail to evolve and try to operate the cloud using their traditional processes, they will not realize the benefits of cloud ops and may end up repatriating from the cloud.

Business-driven approach

Cloud migrations only a means to an end – business growth. So, cloud Infrastructure Operations frameworks and methodologies must be devised to meet the business needs. We must have an outside-in approach to design our cloud operational framework and to adequately meet the business requirements. Business owners don’t care what technology or tools the IT team uses to deliver business applications for their customers. Every enterprise is now striving to accelerate their digital transformation and increase their customers’ delight in using their business apps.

Net promoter score (NPS) and Service level agreement (SLA) performance are some of the key metrics to measure customer delight and affinity. We must arrive at the service level objectives (SLO) to achieve the desired NPS and SLAs. Once we identify the SLO’s, we should calculate the Error Budget, which is a concept coined by Google as part of site reliability engineering (SRE). It is a measure of risk, or the amount of headroom IT Operations have above their agreed SLA. As the number of 9’s in the SLA goes higher, the window of acceptable downtime (Error Budget) becomes minutes or even seconds.

Higher levels of application resiliency need significant investments. At the same time, IT organizations should not over-engineer or over-invest in operational tools and processes. Too many tools will result in tool-fatigue and may impact the overall return of investment.

We need clear, measurable goals while designing a cloud operational framework aligned with business objectives:

Improved MTTD (mean time to detect) by better observability instrumentation.
Improved mean time to de-stress by bringing operational efficiencies from better team collaboration, better skills, and better traceability.
Reduced MTTR (mean time to repair) to drive higher service Levels –more 9’s.
Cost avoidance (penalty and, credits).
Increased number of defect-free deployments through automation and better deployment strategies.

The Way Forward

Digital-Transformation-Pillar-Intro-800x600

To meet the desired resiliency and security, the new organizations new to cloud ops must start their transformation with a balanced approach that expands the traditional People, Processes, and Tools Venn diagram.

Technology and tools challenges are relatively simple to solve with money. However, people and process transformations are fraught with predictable and unforeseen problems along the way, even in the most ideal environments.

Build the Core Team

It is a common theme to quickly stand-up a DevOps (sometimes called site reliability engineering – SRE) team with a few full-stack developers and expects them to build and operate a large enterprise IT environment in the cloud. However, in reality, it is rarely effective in action. Although the developer-heavy DevOps approach has many advantages, in large hybrid cloud operations, we need specific infrastructure operation skills like network management, element management for on-premise gears, and enterprise-level centralized log management implementation. Infrastructure automation is another area where a developer may find it difficult.

One way to address this challenge is to build a highly specialized CloudOps Ground-Crew team by cherry-picking the best talents from different teams: Datacenter IT, Full-stack Dev, and Automation SMEs.

Here are a few recommendations to build an effective Cloud Ops team:

Schedule rotation among the ground-crew team to reduce risk and spread the knowledge throughout the system.
Encourage automation of manual operational tasks.
Erect sufficient process guard rails to prevent major human error.
Ensure blameless postmortems.
Cultivate a culture to recover fast rather than preventing accidents/ human errors.
Plan simulated outage-drills to polish/ sharpen tools, team, and processes.

Update processes

In traditional IT environments, changes and releases are performed manually. Organizations establish a change approval board (CAB) to review and assess the impacts of each change and approve it for implementation in the next scheduled change window.

Unlike traditional environments, the cloud enables infrastructure as code which significantly affects operational processes. The age-old manual steps can now be automated and integrated into the continuous delivery pipeline.

In a cloud environment, elasticity, automated provisioning, and decommissioning are the norm. In some scenarios, we may need to perform automated remediation of any anomalies by intelligent automation (AIOps). Those actions must be pre-approved by the CAB and be triggered as business needs. For new software releases, end-to-end automated, continuous, deployment pipelines are used. With this approach, a change is automatically promoted through the various stages. Appropriate guardrails like blue-green deployments and canary deployments5 are implemented to ensure a seamless application upgrade.

Use the Right Technology and Tools

Cloud environments need a different approach for instrumentation because the observability of cloud apps needs enhanced tooling. Cloud providers offer a host of native tools for monitoring, log management, security, and governance. At the same time, the market is very crowded with ISV tools for cloud management.

For service management and collaboration, most large enterprises tend to extend their existing legacy tools to the cloud. This approach often faces challenges because of feature limitations, licensing complexities, and integration issues.

If we adopt a 100% cloud-native approach for tooling and management, many CSP native tools are not mature in features when compared with established third-party IT management tools4. Each enterprise is unique and there are no “best practices,” which could be applied to every organization. We should consider business needs, cost implications, integration requirements, and vendor lock-in aspects before instrumenting the cloud. In the Interop ITX 2018 State of the Cloud report4, 37 percent of respondents mentioned that they use cloud-native tools. Unfortunately, relying on these native tools restricts enterprises from having a single pane of glass management dashboard, especially in a hybrid or multi-cloud scenario.

For a hybrid cloud enterprise environment, we believe that the ideal approach is to continue to rely on your on-premise third-party tools investments and extend them to cloud (with integrations with cloud services), if possible, because eventually most of the ISVs will be enabling their platforms with cloud management features. On the other hand, CSP proprietary tools will focus on a specific cloud, without much support for on-premise workloads.

We help enterprise customers to achieve the ideal interim state by marrying both worlds – integrating on-premise management platforms with Cloud-native tools and services (Cloud watch, Cloud watch Logs, Azure Monitor, Azure App Insights, and others). Most of the tools allow this integration with API hooks or connectors. This way, the cloud service anomalies are monitored, and the customers can continue to use the familiar dashboards and consoles to manage their hybrid environment.

However, in the case of new microservice-based applications, organizations may need modern tools to manage these environments. We will cover this in detail in the Tools Approach section below.

Service Operations in the Cloud

Once you have the Cloud Ops ground-crew in place, you need to ensure that the collaboration between team members is in real-time, agile, and transparent.

Traditional collaboration tools may be a bottleneck to fulfill the needs of the new operations culture of Cloud Ops/DevOps/SRE. Modern ChatOps6 tools are here to help.

Tools like Slack, Microsoft Teams, Mattermost, etc. are revolutionizing the way support teams collaborate and resolve issues. Gone are the days where 25 or more people are on a SEV1 incident bridge for long hours struggling to collaborate and share their findings to fix a production issue. Also, there are modern Ops management tools like Victor Ops and PagerDuty which could be integrated with ChatOps tools to drive seamless and efficient incident management.

The Chatbot is the new team member in triaging sessions using ChatOps. Chatbots bring efficiencies in interacting with machines that are under investigation. This is a good approach for better security too. The Chatbots could own privileged access to the systems and could provide instant health checks and command runs as requested by technology SMEs on a SEV1 bridge. This way, the SME doesn’t need to be given privileged access to sensitive systems.

Tools Approach

Identifying the right tool for each of the operational pillars is critical for Cloud Operations success.

Observability Instrumentation

Observability is an important pillar of cloud ops because organizations are now moving towards containerized workloads and scalable microservices architectures in the cloud. Traditional agent-based tools may not scale or be adequate to monitor and trace the modern, distributed cloud-native systems.

There is a notion in the IT community that observability is nothing but monitoring-on-steroids; however, it’s much more than just monitoring. By definition8, “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”.

Monitoring is an action we do on systems. Observability is enabling or instrumenting the system to help us to detect and track anomalies in the system. Some observability practitioners call this a “digital exhaust”. When you re-architect your legacy apps to microservice-based distributed systems or build green-field apps in the cloud, it is critical to instrument your application with a high level of logging and then enable the operations with modern tools which can monitor the logs and trace the events.

Observability should start from the customer experience, and its instrumentation approach should encompass synthetic transaction monitoring, APM, Log monitoring, Tracing, and Dashboards. Popular tools for observability and monitoring for cloud-native, distributed business applications are included in the eBook.

Enterprises may choose any of these tools or combination of tools depending on the business needs, budget, and long-term sustainability. Every organization is unique. So, it is difficult to put any best practice here. RCG helps customers make the appropriate decision with our extensive expertise in building cloud-native apps and applying various tool features.

Lifecycle management

There are many moving parts to manage hybrid cloud environments.

Enterprises can’t provide cloud console or command-line access to every team member. Cloud management platforms (CMPs) are essential for easier and controlled resource provisioning in the cloud. CMPs also help to impose appropriate governance for the environment. An ideal CMP should have an end-to-end feature to support right from providing self-service catalogs all the way to decommissioning and cost control in the cloud. Functions of cloud lifecycle management include:

1. Service Catalogs

While using cloud resources, it can be frustrating to ensure everyone has the right level of access to the services they need. Things get even more complex when administrators need to make sure everything is properly updated and compliant with security standards. Cloud Service Catalogs help us to organize, govern and distribute application stacks as templates or blueprints, and we can group sets of products and folders called portfolios. By applying Identity and Access Management (IAM) permissions and constraints, administrators can then provide users the ability to self-service, discover, launch, and manage those products or services without needing direct access to the underlying cloud services.

2. Account Management

One of the challenges customers face in managing cloud is to manage accounts. Many enterprise customers implement Account Vending Machines (automated account creation) to support self-service and also for better security and governance.

One of the popular AWS solutions to automate account creation is the AWS service catalog. Using the AWS Service catalog, we can quickly generate accounts for the AWS organization, configure baseline templates to employ during the creation process and provision the new accounts using those templates. The account baseline cloud formation template will be deployed into the new account that is created and provision an AWS service catalog portfolio with a set of predefined products that can be deployed on AWS, the lambda function will create a user with least privilege access to the service catalog products.

3. Image management

Highly regulated enterprises prefer using hardened, purpose-built server and Docker images in their cloud environments. AWS Image builder is widely being used to generate an automated pipeline to produce customized Linux and Windows Server images. When software updates become available, Image Builder automatically produces a new image and distributes it to AWS regions after running tests on it.

Best practices include:

Enforce image refresh policies to ensure regular updates on the golden images.
Implement Tagging, Versioning, Naming and cataloging Images.
Use pipelines to build/burn the images.
Ensure to add vulnerability tests (for both VMs and Docker containers) while baking/burning images.
Enforce retention policies to address legal or compliance requirements.
Other image builder tools: HashiCorp Packer, Azure Image Builder (preview), Google Cloud builders, and others.

4. Tagging

Tagging of cloud resources is essential for effective governance of multi-cloud environments. Tagging helps you better categorize cloud resources for budget tracking, reporting, chargeback, cost optimization, compliance, and security. Enterprises should approach tagging as follows:

Automation is key to implementing tags. If you use cloud management or provisioning orchestration tools, you must make sure that the workflows or pipelines scripts are equipped with tagging enforcement while provisioning new cloud resources.

5. Cost management

Cloud users underestimate their wasted cloud spend12. In a recent survey, responders estimated 27% waste. Managing cloud spend is the highest priority for enterprise customers.

For effective cost management in the cloud, we must follow the below hygiene habits.

Right-sizing cloud workloads
Track unused or un-attached Cloud resources
Track resource-idling and implement load-shedding if possible.
Wisely choose between Reserved, Spot and on-demand instances for cloud compute power.

There are many tools available from cloud service providers as well as third-party ISVs.

6. Security and governance quadrails

Cloud governance is an extension of Enterprise IT governance. There’s no need to re-invent the wheel. There are many IT governance models like COBIT, ITIL, and CMMI, and they all take a different approach. COBIT is a more risk-focused approach in terms of the governances. ITIL is a very process-oriented framework and CMMI is a maturity focused framework.

For cloud security controls, you can use CSA, NIST or CIS controls. There is no need to adopt one framework to its entirety. As per your business and regulatory needs, you can mix and match suitable controls from these frameworks.

New approaches for Success In the Cloud

AWS and Azure introduced organizational units / resource manager and control policies to manage large multi-account enterprise cloud environments.

Organizations Management (Account Management)

This concept is somewhat similar to the age-old active directory group policy. We can define guardrail policies at the organization level and at the OU level to prevent policy drifts.

Identity and Access Management

Cloud IAM authenticates and authorizes all cloud platform users according to the policies and roles. Traditional identity providers like active directory can be integrated into cloud IAM to enable seamless federated access control with a single gate to manage. We can use multi-factor authentication methods to further control privilege account access to cloud platforms. Enterprises must implement MFAs before open up the cloud environment for everyone.

Here are some of the common patterns popular in the Enterprises:

Rotate root access keys for your cloud
Restrict root access for cloud
Use groups and roles to provide access rather than giving to individual users.
Adopt zero-trust policy, and provide only the minimum required privileges.
Enable logging for all IAM service calls in the cloud. Integrate the log monitoring to your SIEM tools.
Use services like AWS control Tower, Service Control Policies, and others to implement strict IAM policy enforcement.
Frequently inspect the environment for unused IAM users/credentials.

The Foundation to Effectively Manage your Cloud:

Be aware of the ‘shared’ management responsibilities of your cloud. Cloud Provider ensures the security OF the cloud and you are responsible for security IN the cloud.
Focus on enabling maximum visibility in your cloud environments. While you build new cloud-native apps in the cloud, ask your app builders to embrace observability principles and implement them. This will save you from extended outages and unhappy business users and consumers.
Cloud Operations need a new approach. You will need modern tools, a skilled team, and re-factored processes to ensure success in the cloud.
Failure to implement tagging is a perfect recipe for management disaster when the environment grows larger.
You have plenty of options to do tooling in the cloud. Don’t throw away any investment already made unless it is not extensible to hybrid or multi-cloud environments. Look for integration opportunities for cloud-native tools and traditional IT management tools.
Explore and implement cloud governance tools to do effective policing in the cloud. Agility and velocity can take a second-row seat while considering cloud security and governance.

Work Cited

1. Inspired from Dr. Chris Harding, director for interoperability and SOA at The Open Group.

2. Moe Abdula, Ingo Averdunk, Roland Barcia, Kyle Brown, Ndu Emuchay: The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud. Wiley.

3. https://blogs.gartner.com/rene-buest/2018/12/14/public-cloud-infrastructure-operations-and-management-is-a-shared-responsibility-model/

4. https://www.networkcomputing.com/cloud-infrastructure/top-5-challenges-monitoring-multi-cloud-environments

5. https://medium.com/faun/blue-green-deployments-with-amazon-ecs-part-1-console-d2e77345735d

6. https://chatbotsmagazine.com/how-chatops-can-help-you-devops-better-5-minutes-read-507438c156bf

7. https://newrelic.com/resource/cloud-monitoring-platform-requirements

8. https://en.wikipedia.org/wiki/Observability

9. https://www.gartner.com/en/information-technology/glossary/cloud-management-platforms

10. https://github.com/aws-samples/aws-golden-ami-pipeline-sample/blob/master/Golden-AMI-Pipeline-Guide%20V1.0.pdf

11. https://www.infoworld.com/article/3246986/better-cloud-management-through-cloud-resource-tagging.html

12. https://resources.flexera.com/web/media/documents/rightscale-2019-state-of-the-cloud-report-from-flexera.pdf

De-mystifying Cloud Operations