Scorecards are a powerful way to establish best practices, define a baseline to understand quality for existing services and resources, and encourage progress on initiatives, such as migrations or general organization-wide goals.
Cortex users commonly define Scorecards across a few categories:
- Development Maturity: Ensuring services and resources conform to basic development best practices, such as established code coverage, checking in lockfiles, READMEs, package versions, and ownership.
- Operational Readiness: Determining whether services and resources are ready to be deployed to production, checking for runbooks, dashboards, logs, on-call escalation policies, monitoring/alerting, and accountable owners.
- Operational Maturity: Monitoring if services are meeting SLOs, on-call metrics look healthy, and post-mortem tickets are closed promptly, gauging if there too many customer facing incidents.
- Security: Mitigating security vulnerabilities, achieving security compliance across services, measuring code coverage
- Migrations: Tracking ad hoc projects like migrations between language versions, platforms, or deployment strategies, or performing security audits, such as PCI DSS or SOC 2 compliance.
- Best Practices: Defining organization-wide best practices, such as infrastructure + platform, SRE, and security, such as, "are you on the right platform library version," "you must be on kubernetes," and "you should have the right CI file checked in.”
Scorecards should be aspirational. For example, an SRE team may define a readiness Scorecard with 15 to 20+ criteria that they feel services or resources should meet in order to be considered "ready" for SRE support. The reality may be that the engineering team is not resourced to actually meet those goals, but setting objective targets helps drive org-wide cultural shifts and sets a baseline for conversations around tech debt, infra investment, and service quality.
Based on our experiences working with engineering teams across a wide spectrum of sizes and maturity levels, we’ve put together some example Scorecards. Some examples might be very specific, so tweak them for your use case.
Developers should be checking in lockfiles to ensure repeatable builds.
sonarqube.metric("coverage") > 80.0
Set a threshold that’s achievable, so there’s an incentive to actually try. This also serves as a secondary check that the service is hooked up to Sonarqube and reporting frequently.
git.lastCommit.freshness < duration("P30D")
As counterintuitive as it may seem, services that are committed too infrequently are actually at more risk. This is because people who are familiar with the service may leave a team, tribal knowledge accumulates, and from a technical standpoint, the service may be running outdated versions of your platform tooling.
Use a wildcard search to make sure there are unit tests enabled.
git.numRequiredApprovals >= 1
Ensure that a rigorous PR process is in place for the repo, and PRs must be approved by at least one user before merging.
Enforce that a CI pipeline exists, and that there is a testing step defined in the pipeline.
owners.count > 2
Incident response requires crystal-clear accountability, so make sure there are owners defined for each service or resource.
oncall.escalations.count > 1
Check that there are at least 2 levels in the escalation policy, so that if the first on-call does not acknowledge, there is an established backup.
runbooks.count >= 1
Create a culture of preparation by requiring runbooks to be established for the services or resources.
When there is an incident, responders should be able to find the right logs easily. Usually, this means load balancer logs and application logs.
dashboards count >= 1
Responders should have standard dashboards readily accessible for every service or resource in order to speed up triage.
custom("pre-prod-enabled") = true
Use an asynchronous process to check whether there is a live pre-production environment for the service or resource, and send a true/false flag to Cortex using the custom metadata API.
sonarqube.metric("vulnerabilities") < 3
Ensure that production services are not deployed with a high number of security vulnerabilities.
oncall.analysis.meanSecondsToResolve < 3600
Make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
oncall.analysis.offHourInterruptions < 3
If engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services and resources that are causing high numbers of off-hour interruptions, you can improve developer happiness.
JIRA: post mortem tickets opened in the last 6 months that are still open
Developers creating action items for services without actually closing them is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
jira.issues("labels=customer and created > startOfMonth(-3)")< 2
A reliable service or resource should not be a source of frequent customer-facing incidents.
jira.issues("labels=compliance") < 3
Make sure there are no outstanding compliance or legal issues affecting the service or resource.
snyk != null
The first step in monitoring security is making sure each service has as associated Synk project.
git.lastCommit.freshness < duration("P7D")
By confirming whether a service was updated within the last week, outdated code can be caught sooner. Plus, if there is a security issue, you can quickly determine which services have or have not been updated to patch the vulnerability.
ownership.count > 0
Making sure each entity has at least one owner helps ensure updates don't fall through the cracks.
git.numRequiredApprovals > 0
Changes should be pushed through unless there is at least one approval.
sonarqube.metric("coverage") > 70
By monitoring code coverage, you can get a sense of how much of your code has been tested — entities with low scores are more likely to be vulnerable to attack.
git.branch_protection() != null
Make sure that your default branch is protected, as vulnerabilities here are critical.
sonarqube.freshness < duration("P7D")
And check to make sure a SonarQube analysis has been uploaded within the last seven days, so teams are monitoring for compliance to coding rules.
snyk.issues() < 5
sonarqube.metric("security_hotspots") < 5
sonarqube.metric("vulnerabilities) < 5
Once an entity is meeting core requirements, developers can start focusing on quality by making sure entities have a low number of Snyk issues, security hotspots, and/or vulnerabilities.
custom("ci-platform-version") > semver("1.1.3")
Having every CI pipeline send a current version to Cortex on each master build lets you catch services or resources that rely on outdated versions of tooling, like CI or deploy scripts.
package("apache.commons.lang") > semver("1.2")
Cortex automatically parses dependency management files, so you can easily enforce library versions for platform migrations, security audits, and more.
Best practices are unique to every organization and every application, so make sure to work across teams to develop a Scorecard measuring your organization's standards.
git.fileExists("yarn.lock") or git.fileExists("package-lock.json")
Make sure a Lockfile is checked in to provide consistency in package installs.
git.fileExists(".prettierrc.json") or git.fileExists(".eslintrc.js")
Projcets should have a standard linter.
jq(git.fileContents("package.json"), ".engines.node") != null
Node engine version should be specified in the package.json file.
jq(git.fileContents("package.json"), ".devDependencies | with_entries(select(.key == \"typescript\")) | length") = 0 or git.fileExists("tsconfig.json")
Typescript projects should have a tsconfig checked in.
jq(git.fileContents("package.json"), ".engines.yarn") = null or jq(git.fileContents("package.json"), ".engine.npm") = "please-use-yarn"
If a project is using yarn, it should not allow NPM.
jq(git.fileContents("package.json"), ".engines.yarn") = null or !(semver("1.2.0") ~= semver_range(jq(git.fileContents("package.json"), ".engines.yarn")))
Finally, ensure that the yarn version being used is not deprecated.
Creating a culture of accountability with Scorecards
All of these examples share a common goal: fostering a culture of accountability at your organization. Because Scorecards are fully customizable, there is no limit to how you can use this powerful tool. We have, however, provided some other starting places to help you make the most of Scorecards: you can check out our guide on building a DORA metrics Scorecard, or use any of the templates build into Cortex.
Once you have Scorecards set up, you can start using Initiatives to drive progress across the organization on these goals.
Article is closed for comments.