Scorecards are a powerful way to establish best practices, define a baseline to understand quality for existing services and resources, and encourage progress on initiatives, such as migrations or general organization-wide goals.
Cortex users commonly define Scorecards across a few categories:
- Development Maturity: Ensuring services and resources conform to basic development best practices, such as established code coverage, checking in lockfiles, READMEs, package versions, and ownership.
- Operational Readiness: Determining whether services and resources are ready to be deployed to production, checking for runbooks, dashboards, logs, on-call escalation policies, monitoring/alerting, and accountable owners.
- Operational Maturity: Monitoring if services are meeting SLOs, on-call metrics look healthy, and post-mortem tickets are closed promptly, gauging if there too many customer facing incidents.
- Best Practices: Defining organization-wide best practices, such as infrastructure + platform, SRE, and security, such as, "are you on the right platform library version," "you must be on kubernetes," and "you should have the right CI file checked in.”
- Migrations: Tracking ad hoc projects like migrations between language versions, platforms, or deployment strategies, or performing security audits, such as PCI DSS or SOC 2 compliance.
Scorecards should be aspirational. For example, an SRE team may define a readiness Scorecard with 15 to 20+ criteria that they feel services or resources should meet in order to be considered "ready" for SRE support. The reality may be that the engineering team is not resourced to actually meet those goals, but setting objective targets helps drive org-wide cultural shifts and sets a baseline for conversations around tech debt, infra investment, and service quality.
Based on our experiences working with engineering teams across a wide spectrum of sizes and maturity levels, we’ve put together some example Scorecards. Some examples might be very specific, so tweak them for your use case.
Developers should be checking in lockfiles to ensure repeatable builds.
sonarqube.metric("coverage") > 80.0
Set a threshold that’s achievable, so there’s an incentive to actually try. This also serves as a secondary check that the service is hooked up to Sonarqube and reporting frequently.
git.lastCommit.freshness < duration("P30D")
As counterintuitive as it may seem, services that are committed too infrequently are actually at more risk. This is because people who are familiar with the service may leave a team, tribal knowledge accumulates, and from a technical standpoint, the service may be running outdated versions of your platform tooling.
Use a wildcard search to make sure there are unit tests enabled.
git.numRequiredApprovals >= 1
Ensure that a rigorous PR process is in place for the repo, and PRs must be approved by at least one user before merging.
Enforce that a CI pipeline exists, and that there is a testing step defined in the pipeline.
owners.count > 2
Incident response requires crystal-clear accountability, so make sure there are owners defined for each service or resource.
oncall.escalations.count > 1
Check that there are at least 2 levels in the escalation policy, so that if the first on-call does not acknowledge, there is an established backup.
runbooks.count >= 1
Create a culture of preparation by requiring runbooks to be established for the services or resources.
When there is an incident, responders should be able to find the right logs easily. Usually, this means load balancer logs and application logs.
dashboards count >= 1
Responders should have standard dashboards readily accessible for every service or resource in order to speed up triage.
custom("pre-prod-enabled") = true
Use an asynchronous process to check whether there is a live pre-production environment for the service or resource, and send a true/false flag to Cortex using the custom metadata API.
sonarqube.metric("vulnerabilities") < 3
Ensure that production services are not deployed with a high number of security vulnerabilities.
oncall.analysis.meanSecondsToResolve < 3600
Make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
oncall.analysis.offHourInterruptions < 3
If engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services and resources that are causing high numbers of off-hour interruptions, you can improve developer happiness.
JIRA: post mortem tickets opened in the last 6 months that are still open
Developers creating action items for services without actually closing them is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
jira.issues("labels=customer and created > startOfMonth(-3)")< 2
A reliable service or resource should not be a source of frequent customer-facing incidents.
Make sure there are no outstanding compliance or legal issues affecting the service or resource.
Migrations and best practices
custom("ci-platform-version") > semver("1.1.3")
Having every CI pipeline send a current version to Cortex on each master build lets you catch services or resources that rely on outdated versions of tooling, like CI or deploy scripts.
package("apache.commons.lang") > semver("1.2")
Cortex automatically parses dependency management files, so you can easily enforce library versions for platform migrations, security audits, and more.
Once you have Scorecards set up, you can start using Initiatives to drive progress across the organization on these goals.
Article is closed for comments.