Speed vs Quality in Engineering

Oct 22, 2024

...is a false dichotomy. As the SEALs say: “slow is smooth, smooth is fast”.

“Move fast and break things” is widespread advice in the startup community. It’s bad advice and a hallmark of subpar organizational process design. This is not a business advice post; this is an organizational process and system design post that ideally helps someone improve engineering culture and align it with delivering value to customers.

Doing things right the first time IS the fastest & easiest way to do them. It does not require incredible engineering talent: it requires discipline and a solid foundation.

From here on out we have a collection of non-proofread short stories. Let’s start!

Around 6:31 p.m. on January 27, 1967, a spark caused by unprotected and/or chafed wires in a test capsule filled with pressurized 100% oxygen resulted in Virgil Grissom, Edward White, and Roger Chaffee being burned alive on the ground during a test. The last thing the engineers who built those components heard was a mortifying scream.

Amongst many drastic changes resulting from the tragedy, a document called SP 287 was produced. It outlines safety and reliability design principles which landed Apollo 11 on the moon safely in 1969, guided by a 0.043MHz CPU with 64KB of memory.

The saying "those who can't do, teach" is true in CS. SP-287 is exceptional because it is a real-world systems design guide, born out of very heavy lessons by those who did.

An introduction

This entire section is based on quotes only from Pages 9 and 10 of the document.

You should care about these because if you don’t live, breathe, and sh*t them, you won’t suddenly start doing them when you have to work on something important, where mistakes mean financial, medical, or physical infrastructure damage, loss of customer trust, injury, or loss of life: mistakes you don’t want to have to live with.

For the sake of brevity and simplicity I’ll refer to Microservices, Lambda Groups, API Gateways, Federators, etc as “DMs” or Domain Modules: origin usually found together in Github, ideally representing a single business domain one team is responsible for.

If the below sounds a lot like 12-Factor, SOLID, DRY, KISS, DDD, TDD, etc., that’s because SP-287 is where most of them likely came from. A large part of this design philosophy is echoed by Martin Fowler and others I personally hold in high respect.

Translation: always run 2+ instances of any DM or its’ service components (like a DB).

Translation: Simple API Contracts / SDKs.

Translation: DMs should have their own supporting services like DBs, Caches, etc. The team building a DM should also own, control, and test its supporting infrastructure.

Translation: Abstraction, upheld by DM boundaries. DMs that are too large become unmaintainable. The maintenance complexity and overhead scaling is not linear.

Translation: Fix what’s not intuitive - in code, this mostly means naming things well.

Translation: Observability & automatic alerting should be proportional to autonomy.

Imagine a movie where people are flying a plane, the engines or thrusters fail, nothing beeps, and they start figuring out what’s wrong 500ft from the ground.

It’d be a pretty short plot. Possibly even shorter than the time it takes to coalesce logs, add alert filters for them, and make said filters trigger an xMatters or PagerDuty event.

Translation: Unit, integration, E2E, and UAT tests all date back to 1969 (maybe earlier).

Key takeaways for software

Safety as a Paramount Priority
- Security should always be a part of your initial design and architecture.
- Reliability also means simplicity; detailed explanation on why below.
Redundancy and Fault Tolerance
- Redundancy: never run less than 2 instances of anything except a CDN.
- Faults: should immediately create an alert that triggers SMS, email, etc.
Materials Selection and Environmental Control
- Safe Materials: use low-dependency, high user count tools and libraries.
- Monitoring: use multiple SCA tools, consider trying AI PR feedback.
Human Factors Engineering
- Design: behavior test with your users and fix UXUI they don’t find intuitive.
- “Oh sh*t!”: Build these buttons ahead of time for your orgs (not just ENG).
Systems Integration and Interface Management
- Modular: use registries to share code between repos & publish API SDKs.
- API Management: API SDK versions point to corresponding API versions.
Quality Assurance and Rigorous Testing
- Tests: should be easy to maintain and reflect actual use of the UI and APIs.
- CI/CD: includes integration testing in your test and staging environments.
Configuration Management and Documentation
- VCS: Use things like changeset to enforce patch notes and docs with PRs.
- Docs: Should always be auto-compiled / generated for every deployment.
Effective Communication and Coordination
- Agile: just like SP-287, the point is effective interdisciplinary communication.
- Meetings: ban recurring ones, anything without an agenda, and anything over 45 minutes long to force Slack, JIRA and Confluence to handle regular comms. It’s difficult to run effective meetings when your team is burned out on them.
Verification and Validation
- Requirements: are from the requesting party (PM, other ENG team, etc)
- Acceptance: is by the requesting party or downstream API consumer.
Emergency Preparedness
- Emergency Procedures Development: in software, we have it easy. Have automated rollbacks, always deploy rolling with gradual LB canary swap.
- Incident Response Plans: Emergencies should be covered as part of SOP.
Continuous Learning and Improvement
- Learning from Incidents: can only be done effectively when processes have single responsible parties. Empower your team. One is responsible for patch notes. Another for docs. Another for upgrades / deprecation. Rotate all of it.
- Improvement: when people f*ck up, the process they had or laid out for themselves allowed it. It’s a process problem, not a people problem.

The Right Changes

Your processes should make doing things wrong long and painful, and doing things right the easiest, simplest, and fastest way to do them. Your processes should also be kept to a minimum and be designed to be the path of least resistance: I can’t stress this enough. Organizational change is difficult for those who do not evaluate it by the proper measure: you know you are communicating effectively when people on the receiving end are doing or changing what you wanted them to do or change.

That’s it. If you can’t convince someone to adopt a process, it’s because you have not explained the why behind it appropriately. If you can’t do that, you should probably rethink the necessity and sanity of the change you are trying to push to begin with.

Now that you are a top notch master at change management, let’s talk about effective principles which are most likely to apply to the largest amount of people reading this:

A service’s observability should be proportional to its’ level of autonomy.
A service’s reliability should be proportional to the ecosystem’s complexity:
- Example: ecosystem with two services, both 99% uptime SLA:
  - Ecosystem uptime = 0.99×0.99 = 0.9801
  - System failure rate = 1−0.9801 = 0.0199 or 1.99%
- Example: ecosystem with ten services, all 99.5% uptime SLA:
  - System uptime = 0.99510 ≈ 0.9512
  - System failure rate = 1−0.9512=0.0488 or 4.88%

Not enough of either is dangerous, too much of either is wasteful: balance carefully.

Abstraction is the primary line of defense against complexity. Is your team providing a low-level caching service? A high level business function API? Etc. Think hard about whether the pragmatic approach for your specific situation is:
- Going wide - many atomic functions with few high level composed functions.
  - Ex.: you have 2 layers of abstraction with 200 functions in each.
- Going deep - many composed functions built on top of few atomic functions.
  - Ex.: you have 5 layers of abstraction, 4×85 and 60 in the highest level layer.

I say “think hard” intentionally. ThoughtWorks, the inventors of Agile (eh), are known for delivering almost all of their projects on time and within budget. In software, that is RARE. Last I heard, they spend around 80% of their time on requirements, design, and architecture. Only 20% of most project’s execution time and resources go to coding.

It’s impossible to overstate the criticality of proper abstraction and code contract design. It’s the difference between adding features taking an an hour vs weeks. If you can handle code contracts the way Kafka handles topic changes (addition-only), you know you’ve done a good job designing them. It’s also the quickest AND safest way.

There are eight rules in Fight Club so let’s shoot from the hip for 8 measures that may give you clues for effectively modulating the abstraction dimensionality in a service:

Cyclomatic complexity. V(G)=E−N+2P, yo. Ideally under 10, hopefully under 20.
Code duplication - consider percentage AND per-module density. Less than 5%.
Depth of Inheritance (for you OOO people) - should track #1 in the exact inverse.
Afferent / Efferent Coupling - Low Afferent = flexible; Low Efferent = stable.
Halstead Volume, Effort & Difficulty - actually ok at estimating how buggy stuff is.
Response for Class (OOO #2) - scaling often lock-stepped with #3, lower = better.
Fan- …In = many functions call yours = reusable. Out = you call many = complex.
Weighted Methods per Class (OOO strike 3): high WMC = maintenance nightmare.

Admittedly, really stretching here by trying to have 8 generally applicable ones. These are useful for sizing up what you are getting into quickly, what feature velocity and defect rate you should expect, etc. Like most statistics, they require the interpreter to understand the nuances so they can be leveraged pragmatically instead of be toxic.

You can use these to evaluate existing code and greenfield implementation designs - but remember, unless you are a F500 exec, hitting metrics which only exist for the sake of metrics probably won’t help you or your team. Don’t talk about Fight Club.

Consider your consumers and the overall system architecture. If your evaluation of how to implement a feature doesn’t start with “should this be an API call or a system event”, it probably should. Encourage your engineers to get very familiar and intimate with the high level system architecture, especially if you are in a bigger organization.

One way of doing this is making sure your architects regularly communicate with your engineers. Don’t call a group meeting and waste the whole team’s time - that’s where productivity goes to die. Encourage them to schedule 15min 1-on-1s with engineers, get their opinion, feedback, and suggestions for a design before the formal proposal.

Even if no design changes happen, there is inherent value in the form of buy-in. As a newly minted top notch master at change management, you now possess an intrinsic understanding of this. You always make sure everyone is onboard, feels heard, and has an opportunity to contribute to (or contest) every design they will be implementing.

Running everything by committee is unviable but in this case, more heads are better.

Hallmarks of an effective ENG org

First off, this → https://martinfowler.com/articles/developer-effectiveness.html

Let’s get some concrete examples going. We’ll use NodeJS/TS since it’s very common.

Private NPM Registry

Reduce code duplication, increase environment parity and process uniformity.

We all know that code duplication is bad. Then most of us turn around and add things like husky, hooks, eslint, and cicd code into every repository along with a list of scripts in package.json long enough to be a novel. Promote a uniform way of doing these things by making it the quickest and easiest way to do them with a tool like Lerna.

This one’s for bigger orgs: the last thing you want to run into is having to push an emergency fix and finding out that a registry hosting image layers or dependency packages used in the build is having an outage. Yes, I’m 100% projecting on this one.

CLI

Automate braindead tasks teams / engineers do daily. Commander is popular. In essence: provide convenient defaults for everyday tasks with the option to override.

Does .env config pulls for projects based on their name in package.json (ex.: you should have something like Pulumi ESC CaaS for local / test / staging / etc)
- “tHiS sErViCe DoEsNt StArT, wHaT’s YoUr EnV FiLe BrO?” gone forever.
Gives consumer SDK migration / upgrade notices based on package.json versions
- This is important. It eliminates A LOT of unnecessary pain and mistakes. Your engineers probably don’t give a sh*t about the 100+ emails they get daily, but they will absolutely read what their terminal tells them when they open it.
Handles commands that would normally be loitering in every repo’s package.json
- Changes to these should ALWAYS be on the CLI’s upgrade notice request.
Almost everything shared in your CICD pipeline should be inside this CLI.
- Projects like Nektos Act are underrated. Your CICD setup should be exactly how new engineers onboard onto a project. No docs, always up to date.
Between the two of the above, your package.json scripts should usually be empty unless your project is so unique that it doesn’t conform to standard. Usually isn’t.
One-time append to .zsh/bash/etc rc to check for new versions. Update request when someone opens a terminal with a description of what that update does.
Manages and updates other CLIs. Things like AWS CLI, Prisma, Supabase, etc.
Creates project symlinks to centralized .vscode / .eslint / .prettier / tsconfig / etc only if a team chooses to not have their own as a hard file in that project.
- Don’t make your default lint/whatever rules strict. Nobody will use them.
Helps open PRs via terminal by asking predefined questions that go into the PR template and then a basic patch note description via something like changeset.
- Changeset also helps people version SDKs / UIs / etc. Highly recommend.

Ideally, have a shared CLI and encourage bigger teams to build their own. This is handy and useful for both the teams themselves and their downstream consumers.

Shared Code

TLDR: see “Private Registry” above.

We also all know to check for what’s already out there before invertedly reinventing the wheel. Then most of us turn around and write code for things like authorization, authentication, logging, and SDK generation in every repository that needs them.

Example: your logging should be as simple as “import { logger } from “@you/logger” (a package in your private registry). That package should manage singletons, and automatically pull the config for endpoints / settings from the environment vars.

Wrap as much as you can. Provide sane defaults which can be easily overridden. You can thank me when your org is upgrading logging, switching vendors, adding another hybrid CAAS system, or really any wide-sweep change. Instead of wasting hundreds of engineering hours across dozens of repos, teams update your wrapper and if they can do so without changing the code contracts (aka it was designed well), boom done!

Environment Parity

I’m going to repeat myself here because this is worth repeating. Any team publishing a service, repo, etc used by someone else owes the other orgs a list of simple things:

An SDK for downstream consumers that also contains per-environment fetches for CAAS, service mesh endpoints, etc. Your consumers shouldn’t write a single line of code that is not an invocation of your SDK to interface with your service.
Said SDK which your team surely publishes also needs to have either a start command for a local server that mocks your service or a port forwarder to dev.
- This is what consumers develop with. This is what CICD E2E tests run on.
- If the SDK version is being migrated/deprecated, make it push that to logs.
- If the SDK version is incompatible with the services’ registered CAAS version in a target environment, bomb immediately. Yes, your CAAS should contain versions of the services in an environment as well as their feature flags.
Said SDK is the single, typed source of truth for your service’s API code contracts.

Every aspect of what runs in prod should be runnable or bridgeable to your engineers’ locals. Let me further expand upon “should”. This isn’t an ideal, it’s how to move fast: by reducing bugs in development via parity and enabling your engineers to diagnose and fix production bugs faster by not having them spend most of their time trying to replicate the bug to begin with. Advanced concepts like how to implement cross-env event replay and canary execution are out of scope for this post. But while we’re here: yes, you should ABSOLUTELY have locally runnable images for log sinks and pubsub.

P.S. Event replay is lit when bug-related data is PCI DSS, HIPAA, etc. compliant. This is inherent in event replay systems (data anonymization/scrambling) because they rely on a specific style and schema of cross-service trace logs, which your platform team has surely ensured are auto-redacted in that logging package everyone surely uses.

Those SDKs the services publish have surely implemented the shared logger in such a way as to surely be able to reconstruct & replay Rq/Rs pairs from log sink via trace ID.

Closing Out

A QA engineer walks into a bar and starts ordering beers. They order 2 beers, 0 beers, -1 beers, a lizard, a NULLPTR, then try to leave without paying. They give the OK. First customer finishes their drink and asks where the bathroom is. The bar explodes. The odds of a greenfield architecture design surviving contact with reality do be like that.

Yevelations 17:38

Discussion about this post