The words “digital” and “cloud computing” seem to be embedded throughout every ERP presentation today. You can’t get away from promises of reducing risk, cost and faster deployments to enable your “digital transformation”. But when you lift the lid – what does this really mean?
When a mission critical SAP/ERP implementation undergoes a major technology enabled change (move to the public cloud, migrate to HANA or a system upgrade), there is a common denominator. The main event – the production downtime window to go live!
The public cloud offers a multitude of benefits, however lurking underneath the covers is also an array of risks and issues, likely to sting you during the main event.
Don’t get me wrong, I totally support the cloud movement – though digital transformations can also be supported on-premise too, as they have been for the last 20 years. My aim however is to educate and inform to ensure the risk profile of production cutovers in the cloud are understood.
It’s all about visibility and control – these are the two key areas you lose.
The downtime window
Often mission critical ERP maintenance windows are fixed and sometimes agreed with the business a year in advance (sometimes greater). With such advanced scheduling there is likely to be limited insight into how they will be utilised and sometimes, insufficient for the change they are allocated too.
Even when you can influence the duration of the window, you will still be subject to business constraints and held to early experience based estimates.
Whatever environment you are operating in, there is a common challenge – execute a complex and often multi-dimensional change in a production downtime window that is never long enough!
So, we make the impossible possible by reducing the technical runtime, whilst at the same time reducing risk, eliminating variables and creating a repeatable recipe.
Over the years I have developed an approach to making that impossible possible. That itself deserves a separate blog, but in a nutshell, identify levers, variables, risks, benefits and then devise a strategy to rinse and repeat – prove it, break it and document it!
Cloud computing simplified and reduced the cost of the key on-premise prohibitor – compute and storage! The ability to stand up instances (at the right size/scale/config) and be able to pay by the hour for the privilege, became a real game changer. This allowed us to focus more on the creative levers to help solve the problem.
The team working on this will feel like the challenge is nothing short of launching a rocket into space.., well we like to think that!
The end product
Once we have mastered our recipe our toolbox is equipped with a technical runbook and detailed cutover plan. Somebody is assigned to ordering pizza, whilst some of us prepare for no sleep for the next few days (or catch what you can on the office sofa or even the floor – of which I’ve done both).
Authority to proceed
You secure a GO decision from the exec and the team quickly move into executing the plan. People mobilised, technical processes running, governance checkpoints governing and we are now full steam ahead.
There are two types of events that are likely to occur when things go wrong; something breaks and everything stops; or the most painful of all, it slows down! You are unable to achieve the benchmarks you recorded in your rehearsals and now your plan and contingency is at serious risk.
It’s only then in the dark of the night, that an exhausted team (with the eye’s of the execs reigning down on them) realise how vulnerable they are, due to a lack of visibility and control.
A recent project involved a complex migration of a large SAP implementation to the cloud. Even though this migration involved a database and operating system change, we soon realised there was less risk in moving all of SAP Production in a single event. Data transfer was one of our biggest challenges, we addressed this via a complex set of daisy chained events across several transfer links.
When it starts to go horribly wrong
and the adrenaline kicks in…
Once data hit the first staging area in the cloud something started to smell wrong, everything was running much slower than planned. Incidentally Hurricane Florence was battering the US East Coast during the same time. Even though our change was in Europe, there was news after the event that cloud providers were moving loads from North America to Europe to ensure availability. So replication of huge volumes of data and shifting of compute demand was likely to stretch hypervisors and push even the most highly provisioned storage solutions to their limits. There were no incidents, reports or status updates being declared by the cloud provider.
On another project an application upgrade slowed down far below our benchmarks. No hurricanes or world disasters to blame on this occasion, but we never really identified the root cause. However our analysis (once out of the heat of the battle) suspected this may have been unrelated to the cloud infrastructure.
Both examples bring us back to visibility and control. The decision for cloud providers to move workloads was entirely their own, they have availability SLA’s and other customer to look after too. When under pressure the lack of visibility across the full infrastructure platform severely impacts your ability to effectively troubleshoot. It soon becomes a distraction to the working team and stakeholders.
What did we learn?
You soon learn how to magic contingency from a plan that doesn’t seem to have any left – that itself is an art.
Lets not forget the cloud isn’t really this magical unified layer of compute that is always on, always performing in the sky somewhere. It’s a complex amalgamation of data centres, aimed to provide an unprecedented degree of scale and availability. But when it comes to maintenance of mission critical ERP you have to understand that decisions and changes by the cloud provider are not done with your go-live in mind. That lack of control is a real risk and if an incident occurs, getting the right level of visibility to support troubleshooting can be a real challenge.
The key take aways here are three things:
- Sufficient contingency – both planned and also know how/where you will find contingency if it becomes exhausted
- Set realistic expectations with the exec around the technical control/visibility risk of your go-live
- If you haven’t yet moved/deployed mission critical ERP in the cloud, reflect on the potential impact extended (if infrequent) planned downtimes may have on your business
Alternatively you could argue the cloud provided a degree of resilience to enable our change to be completed without a hardware failure and the ability tap into an immense amount of pay-as-you-go compute. Even if these changes were performed on-premise there is always a risk of hitting unexpected issues not witnessed during rehearsals (even a natural disaster). The argument can swing both ways, but again this is all about visibility and control.
Mission critical ERP are the crown jewels that run your business, the cloud is an incredible enabler, but does come with inherent risks/challenges that we should be tuned into.