Lifting the hood on our Internal AWS Development Platform (IDP)
Written by Jamie Banks,
Enterprise Product Owner of AWS & DevOps at EDF (UK)
OK, before we get into this, let’s start at the top, so we’re on the same page.
What is an IDP, and why should you care about them?
An Internal Development Platform (IDP) is a key tool for companies that want their development practices to be more consistent, making it easier for engineers to get to work on their products and bringing them to market faster. Today, all companies that exist need a digital presence in order to compete. Alongside this demand, the software landscape is growing by the day, whether it be new languages, tooling, or cloud technologies - it's complex.
So, how do you get a group up and running quickly? And how do you avoid the early questions like "this tool vs. that one?" It’s simple - set up an IDP!
A good IDP makes it easy and painless for the engineers to adopt the company's development practices.
Usually, when I start explaining our IDPs, I say phrases like; “A platform that aims to decrease cognitive load and speed up time to market.” But those benefits are only experienced if you do it well. A poorly designed and managed IDP can have the opposite effect.
Here at EDF, we treat our IDPs as products. We pay attention to what our customers (the in-house and partner engineers) say and continuously improve them based on their feedback, without making them so complicated that no one can understand them.
We’ve learnt a lot and found some fantastic resources that put structure around the concept of a platform. For example, Internal Developer Platform
is a fantastic resource for anyone looking to learn more about IDPs. They outline 5 core components of a well-designed IDP, which are;
- Infrastructure Orchestration; Integrate with your existing and future infrastructure
- Environment Management; Enable developers to create new environments whenever needed
- Deployment Management; Implement a Continuous Delivery or even Continuous Deployment (CD) approach
- Role-Based Access Control; Manage who can do what in a scalable way
- Application Configuration Management; Manage application configuration in a scalable and reliable way
Alongside these core components, we treat security as a first-class citizen by ensuring that our security and compliance standards are baked into our pipelines as a gate and having security represented in the team at all times. We find this works well for us as development teams get early feedback on their compliance status and there are no hidden surprises later down the line.
The platform I'm writing about now is essentially an AWS vending machine, with built-in controls that limit where things can be deployed and align common things like IAM with our central security posture. If an engineer needs a new account or set of accounts, they can raise a PR against a YAML file, and as long as all the correct information is there, it will be available for them to use within 30 minutes of the PR being approved. We have a central library of Infrastructure as Code modules that can be referenced and a suite of security and compliance tests that are run during the CI/CD process.
We've tried a few times to build our AWS platform here at EDF, which is probably the same as a lot of other enterprises. I won't go through all of our history, but here is a quick round-up of what we’ve learned along the way:
- It's great when an IDP works well. Engineers can get to work faster, knowing that they have control over what they are deploying and that it is safe, secure, and in line with the company's policies.
- On the other hand, when an IDP doesn't work well, it can lead to a lot of frustration, slow down development, break trust between departments, and generally make everyone in the team unhappy.
- ClickOps makes applications fragile and a platform difficult to scale.
- Having multiple unrelated production workloads in the same account increases your risk and the size of your blast radius.
- As a cloud platform scales, so does the need for common, repeatable processes, team autonomy, and changes driven by code rather than ClickOps.
What is ClickOps? ClickOps refers to making changes through the console. ClickOps is not the best way to manage production workloads because it is hard to repeat at scale, which can lead to misconfigurations and more human mistakes, which are more difficult to roll back.
What is Blast radius? Blast Radius refers to how much trouble you'd be in if a bad guy got into one of your accounts. And AWS's way of limiting the blast radius was to encourage you to create multiple accounts. The idea is to have one for each domain, application, product, etc. You get the idea.
The platform we have built is almost completely stored as code, so we get all the benefits of git, and we can make it accessible to our engineering community. By taking this approach, it means we can employ a way of working referred to as “GitOps”.
What is GitOps?
GitOps is just a set of rules for running a production environment, whether it's an internal development platform or a service that external customers use. These rules are based on the same methods that Google has been using internally since 2006.
We then combine our GitOps approach with Service Level Objectives (SLOs) to ensure we’re meeting our core platform objectives.
What are SLOs?
SLOs, or Service Level Objectives, are essentially how often you can fail without annoying your users. Site Reliability Engineering: How Google Runs Production Systems, the book that gave rise to the term SRE, popularised them.
To bring this closer to home, the SLA (Service Level Agreement) is a close relative of the SLO, i.e. an SLO with a contract attached. SLAs are commonplace, whether you look at Amazon Web Services or your preferred cloud provider. Elastic Kubernetes Service, for example, offers an SLA of 99.95%, which means that if they are down for more than 21 minutes and 54 seconds in a month, they pay you, the customer, a percentage of the service fee.
SLOs must be built into the fabric of your Landing Zone for one simple reason: If you don't clearly manage expectations via SLOs, your customers will subconsciously set even higher ones for you.
Here is a list of examples of SLOs that we use and the SLIs that go with them:
|Account Vending||The account is usable by the requesting person within one business day of the request being submitted.||80%|
|User Provisioning||The user is provided with credentials within four business hours of the request being submitted.||90%|
|User Access||After the request is sent, the user gets access within 4 business hours.||90%|
|IAC Compliance Scanning||A full run of the compliance scan finishes in under 3 minutes.||95%|
|SAST Coding||Within 3 minutes, the SAST tooling successfully serves a request.||95%|
Our most recent platform is accessible to the community that uses it, and we, the platform team, manage it based on our performance against our SLOs. We prioritise our work like any other good product team with our SLO performance in mind, ensuring we don't overengineer.
- If you’re building an internal platform that others depend on, you need to build it to scale, which means you must operate everything as code.
- If you’re not treating your platform as a product, it's likely you're not listening to your customers, and you are at risk of overengineering (an antipattern in lean).
- Whatever you do, don't let your platform or any of your production services be built by “ClickOps”!
- SLOs are a good tool that can help you operate your platform to the right level.