Phil Calçado

Copilotcalypse

Fri, 08 Mar 2024 00:00:00 +0000

Let me start by saying that I am not a fan of chatbots as the primary user interface for enterprise systems. Although LLM-backed tools are significantly better at detecting intent and making data more useful, as a survivor of the fever dream that was Chatops in the mid-2010s, I don’t see how they solve any of the fundamental issues we faced with the previous iteration. Problems with permissions, concurrency, versioning, reuse, verbosity, and error handling are all still present, and we now have to deal with fresh new challenges brought by AI systems, such as hallucinations.

Even if these problems were to be solved, I believe that we tend to overestimate our desire to talk to machines. Conversational interfaces can be great when you are exploring something unfamiliar, don’t have a device handy, or when you want to ask questions. However, they are inefficient and annoying when you are on a laptop or phone and know exactly what you want, from turning off the lights to creating a pivot table.

That’s why Outropy doesn’t have a chatbot. Instead of asking users to talk to our system, Outropy offers a user experience inspired by tools such as VSCode and Intellij. These integrated environment make coders highly productive by providing timely and relevant insight and allowing users to “double-click” for contextual and actionable information about anything they see on the screen.

To our dismay, Microsoft invested millions of dollars in marketing to convince the world that copilot is just a synonym for chatbot. That’s why we stopped referring to Outropy as the Copilot for Engineering Leaders and started calling it The VSCode for everything else.

But, back to Microsoft, it’s no secret that the adoption of Copilot for both Windows and Office has faced issues. I suspect that they are facing the same issue as many AI startups: a lot of users sign up to try the product upon launch, investors get excited about massive revenue projections, but users start to churn as soon as the first or second bill arrives. It’s easy to get IT managers who need to provide some kind of “AI strategy” to their bosses to sign up for a trial, but once they’ve played around with it for a little while it’s hard for them to justify what the extra $30/user/month buys them.

Anyone paying close attention to the enterprise AI market could see this copilotcalypse coming, but I had hoped that it would be a forcing function pushing Microsoft and other deep-pocketed companies who are haphazardly plastering ✨ buttons all over their systems to invest in some UX research to move away from chatbots.

Instead, they are doubling down and adding new chatbots to help you extract value from their existing chatbots. Chatbots will keep being added until engagement improves.

This is disappointing, but not surprising. I have worked in tech for long enough to know that inside a large organization such as Microsoft, Google, or Meta there are so many competing goals that the safest way to keep your head on your neck is always to keep doing what the higher-ups like, even if it’s clearly not working.

And this is the exact kind of opening that startups exploit to take on incumbents.

Attention is All A Manager Needs

Fri, 21 Jul 2023 00:00:00 +0000

As I learn more about the current renaissance of Artificial Intelligence for something I am building (more on that at the end), I find myself reading and re-reading papers that talk about how to consume and make sense of information at scale.

All this talk about managing information at scale makes me think of challenges faced by engineering managers and directors as they have to deal with both information overload and scarcity simultaneously. This is a recurring major topic when coaching new managers or folks who made the transition from line manager to senior management. In this article, I am going to discuss the challenges and offer a few practical tools that have worked for me in my own journey.

The challenges with attention

Managing is a balancing act of where to direct one’s attention—knowing when to zoom in for details and when to zoom out for the bigger picture.

In modern organizations, this is made worse by the proliferation of multiple inboxes. Each tool used by teams on a daily basis (Jira, Slack, Workday, Expensify, etc.) comes with its own notifications and alerts system. As a manager, you find yourself having to check them all continuously, deciphering which issues demand immediate attention and which can wait.

Information overload is a huge pain point, consuming valuable time and draining your energy. On the other hand, the opposite challenge arises as well—relevant information often fails to reach you at the right time. Team members may not proactively share essential updates, and you may find yourself in the dark about the status of a critical task or project unless you explicitly request a status report.

The general challenges manifest themselves differently at different levels of management. Let’s take a look at some peculiarities of each one.

Engineering managers

When I am coaching an engineer or other individual contributor through the transition to management, usually the first challenge is shifting their attention from individual tasks to the team and project as a whole.

This change requires learning how to delegate and trust the team, but the most challenging step is understanding_ what_ needs your attention and where to look for it. Widening our lens can be overwhelming, and the new manager is usually terrified they won’t be able to provide a competent answer when their director or some important stakeholder asks about something.

How people deal with this depends a lot on their personality. More extroverted people tend to rely on talking to their team members to gather this information. While seemingly benign, this style can quickly devolve into a stream of interruptions for status reports and a team that is overloaded with meetings. In remote work, this is made worse by the inability to have quick, spontaneous conversations.

On the other hand, more introverted managers usually dread asking for status updates. They spend hours every day reading through Slack messages, Google Docs comments, and GitHub PRs rather than bothering someone for an update. It has the benefit of not breaking the team’s_ _flow, but the manager ends up with whatever imperfect information they are able to piece together from reverse engineering these artifacts. It also results in a misalignment between team members as, without synchronization checkpoints, it is natural that people drift apart and start behaving like many micro-teams instead of one single entity.

And even when they _have _a good picture of what’s happening, another attention-related challenge for the first-time manager is avoiding reactivity.

As a manager, you are exposed to an endless stream of activity that might impact your projects and people. A new manager will often fall into the trap of handling these issues as they arrive, dealing with each one individually. Resolving problems may feel productive, but if you don’t apply some strategy, you’ll quickly become a troubleshooter rather than a leader. You will spend so much time dealing with these issues that virtually no time will be left to do your actual job, which is to put in place whatever is needed to prevent these issues to begin with! Even when you have time, you will often be brain dead from hours of context-switching constantly.

Directors and above

After a lot of trial and error (and hopefully a good coach), most new managers find their way around managing information within their own team. They build a system that works well enough for their teams and start feeling less overwhelmed.

Unfortunately, this is usually the point in time they are given more scope to manage. Maybe they are assigned a second or third team or get promoted to senior manager or director. And that’s when they find out that knowing what’s happening in their team was the easy part. As their scope increases, the area their attention radar needs to cover also grows.

The role of a senior manager varies wildly with the kind and size of the company—much more than the engineering managers. What is always present, though, is an expectation that someone in this role will impact and support more than just their immediate teams and stakeholders. A lot of what a senior manager does is within the context of what Patrick Lencioni calls “first team”: the idea that the leaders in your group, i.e., you, your peers, and your leader, are a team in its own right, a team that will prioritize what’s in the interest of the overall organization and not just your individual groups.

The expectation exists regardless of the company’s maturity in implementing this idea of first team in practice. To fulfill it, senior managers need to be up to speed not just with their teams and projects, but also with what’s happening in peer teams and the whole organization.

Ultimately this means that not only do you need to be looking downward at your teams, you also need to look sideways across the organization.

With this widened aperture, trying to figure out what is happening by reading Slacks, PRs, Google Docs, and Jira becomes humanly impossible. Worse, now that you have a higher rank, people treat your asks for updates as a serious request from a superior. What could’ve been a one-line Slack response turns into a 30-minute walkthrough of an overly rehearsed 40-slide deck. You start being selective of what kind of updates to ask for and whom you ask to avoid people dropping everything they’re doing because “the boss wants a status report pronto.”

What has worked for me

As with so many crafts, learning how to deal with the challenges of growing in scope is often done by experience. While each manager’s path is unique, some practical advice holds true across most cases.

Build recurring information checkpoints

You may loathe the eight-hours-per-iteration-spent-in-planning thing suggested by early versions of Scrum, and I won’t blame you for that. But regular planning sessions, for both short and medium terms, are a good thing. Regularly scheduled planning sessions allow your team to take a break from the daily fog of war and think about the next few weeks or months.

Similarly, a regular daily “stand-up” creates a forum in which everyone is encouraged to broadcast what they are working on and self-organize around blockers.

It doesn’t matter how often they happen or how long they take, but regularly planned rituals like these do wonders for actionable information sharing. Firstly, the team can plan around these ceremonies instead of being interrupted by a manager’s ad-hoc requests for status updates. Even more importantly, they set the expectation that people should share information regularly and give folks “permission” to ask others for updates—something that might sound silly but is very important for less extroverted folks who may need the encouragement of being in a safe space.

Create a UX for information sharing—even if you need to fake it

The only thing worse than not having the information and context you need available in Jira or a document or anywhere else you can read without bothering someone is to have that information be outdated. And both issues are prevalent in organizations of all sizes.

You can try and force your team to be more diligent about keeping information up to date by being explicit that it is actually a core expectation of their role that impacts their performance. Or you can encourage good behavior by reminding them every time you ask for an update that the best way to keep you from interrupting is to make sure the information you need is available in the system.

But, carrot or stick, these will do only so much in improving the quality of information available to you and other stakeholders. The biggest issue preventing people from keeping information fresh isn’t forgetfulness or lack of care - it’s the user experience. I don’t mean the UX of tools like Jira or similar. These usually suck, but people learn to live with them. I am talking about the experience of the process end-to-end.

Typically, a team uses not one but multiple management systems, and they need to be kept in sync. People often spend too much time manually copying, pasting, and linking items between Jira and Trello, Github PRs, and various spreadsheets.

I usually keep a single spreadsheet listing all projects in our portfolio, each with an individual accountable for it. Each line has status and planning information, which goal (usually an OKR) it relates to, and the project name links to its charter. The project charter can be in any tool—e.g., some people have a Google Doc, others an Epic in Jira—as long as its contents follow the standard template.

Once every two weeks, we get together with all project owners for our project portfolio meeting, which is a topic that deserves its own article. I expect the project owner to keep the data on their projects in this spreadsheet always up-to-date, as this is the source of truth.

In return, I take on the administrative burden of using the information in this spreadsheet to populate whatever other systems and processes need this information—e.g., OKR tracking systems, company roadmaps, etc. They don’t have to worry about these; they only need to ensure that the project spreadsheet is current.

As you might imagine, this can be a lot of work. I’ve tried several strategies to avoid spending all my waking hours copying and pasting from this spreadsheet to other systems, everything from writing scripts that keep data in sync to, when I am really lucky, hiring a Chief of Staff to run this process.

Irrespective of your approach, you need to keep in mind that the best thing you can do to improve information quality is to reduce friction in the process.

Have recurring 1:1s not only with reports but also peers

I joke that every book on engineering management out there is basically 200 pages teaching you how to do 1:1 meetings_._ Joke or not, the reality is that holding regular 1:1 conversations with your reports is already a well-established practice in software engineering. Something less common, though, is having regular 1:1s with your peers and your manager’s peers.

Just like with reports, the structure and cadence of these meetings vary a lot from person to person and evolve over time. In my experience, executives prefer having some agenda instead of free-form conversation, while peers from departments that you don’t interact with much like to make it more of an informal chat.

Regardless of structure, regularly holding these conversations is crucial to building relationships. You really don’t want that you only interact with a peer when something is wrong. It is also a forum to discuss events that are not yet time-sensitive or early ideas that might or might not ever see the light of day. Finally, they also create an opportunity for serendipity, as someone might mention in passing something they didn’t think would impact your side of things but turns out is relevant to you and your teams.

Calendaring software sucks, and unless you have an executive assistant, managing so many 1:1s at different cadences is a nightmare. You quickly lose track of who you should be meeting with and how often, forget to remove 1:1s that are not necessary anymore, etc.

I use a spreadsheet as a poor man’s CRM to tackle this. It’s pretty simple:

Besides the agenda shared with all participants, I keep a private document for every person I meet. Whenever something important but not urgent happens related to a person or their teams, I grab a screenshot or link to something that gives me context and paste it into this document, in reverse chronological order. This document is a great way to prevent losing track of things I need to bring to their attention but avoid dropping everything to act on it as they happen.

Avoid name-dropping

When someone is new to a role or a peer group, it is normal to feel awkward asking questions. One intuitive way to deal with the awkwardness is to shift blame, saying that you need this information or that thing done because “I need to report to <my manager> on it soon.” It is a super common defense mechanism, but it causes all sorts of issues.

The biggest issue is that it begs your direct reports or peers to question your role as a leader. Are you adding value, or are you there just to relay information up the totem pole? They are not wrong in asking this question, and if you don’t have a good answer, it might be a good time to consider where you are spending your time as a manager.

But even if they can see the value you add, student syndrome kicks in, and people will eventually send you the information just before you meet with the big boss. You won’t have enough time to digest the information and prepare for the meeting. More importantly, it jeopardizes your ability to do your actual job, which is to understand what is happening across your teams and make sure it’s aligned with the strategy for your division and company.

As tempting as it might be, never justify the need for information—or anything, really—by saying that your boss needs it. It is reasonable that people want to know why you need this information. Sharing context is a great way to have them help you strategize over the best way to achieve the desired outcome instead of just mindlessly giving you a status report. It is an excellent opportunity to improve their understanding of your role and the value you bring. And if you can’t clearly elaborate on that, it is a sign that you must first understand those yourself.

Plug: Where my attention is at

Leaders spend at least two hours every day just catching up with what’s going on first thing in the morning or when returning to their desks after hours of back-to-back meetings. We cannot talk about increasing efficiency in tech and ignore this waste.

My co-founder and I have debated the topic for years and recently decided it is an area worth exploring. We are leveraging the impressive power of the most recent developments in Artificial Intelligence and Large Language Models to build something that was basically impossible just months ago.

Our platform plugs into the tools your team uses and learns not only from the artifacts your team produces but also from how people interact with them and collaborate amongst themselves. We use all this information to build an Interest Graph that we then use to bring relevant and timely updates to you based on topics and projects we know you care about.

And, because we know so much about the tools you use and how you use them, we can offer in-context automation that eliminates a large amount of manual labor that a leader needs to perform daily.

It’s like Github Copilot, but for managers.

Our first release focuses on helping new and seasoned managers at different levels sift through the noise and focus on what is important to them.

We are launching the first public version this Fall, and are currently onboarding partners for our private beta. If you want to see what we’re up to, you can add your name to the waitlist at Outropy.ai. If you want to know more about the company and our vision, you can reach me at phil <at> outropy.ai.

Five takeaways from looking for a new senior role in tech

Mon, 20 Dec 2021 00:00:00 +0000

A few months ago, I left SeatGeek without much of a plan of what to do next. My green card was finally issued in 2021, which means that I didn’t have to scramble to find a new job in forty days. For the first time in the fifteen years I have lived abroad, I could finally take my time without fear of getting on the bad side of immigration authorities. As someone who has been on a work visa for the last fifteen years of my life, this was wild.

At first, I tried the whole funemployment thing, basically when you are not actively looking for a job. I posted a tweet about leaving but did nothing much around job seeking aside from answering a few messages here and there.

I have recently signed with a new place. Before I talk about the new challenges ahead, I want to share five things I learned during this process. While bits and pieces are applicable for any tech role, this article explicitly focuses on senior leadership roles, which were what I was looking for. I define these roles as executive roles for small companies (I would say fewer than 50 engineers) or Vice President of Engineering and above for mid-sized (say 50-500 engineers), or Director and above for larger organizations (500+).

1. It will likely take longer than you expect

More senior roles are usually not evergreen. In recruiting, we use the term evergreen role when talking about positions that are always open, featured on a company’s career page indefinitely. Every company has budget restrictions on how many people they can add to payroll, but the reality of a hot job market means that most of them can always add another back-end/front-end/mobile engineer to their team.

And even if they are not evergreen per se, you will also find a lot of first-level engineering manager roles open at any given time. This happens because companies will need a new manager for every few Individual Contributors (ICs) they hire. Given that companies are constantly hiring ICs, they also need to add new managers regularly.

However, this relationship doesn’t hold as you go higher in the seniority ladder. Senior roles usually open up when someone needs replacement, if a reorg creates some leadership vacuum, when the company has reached a new growth stage, or when it starts a new strategic initiative and needs a leader.

As you might imagine, companies only go through these events every so often in their lifetime. It might be that you are fortunate, and by the exact time you are looking for something, a great role comes up, but it is unlikely.

Worse, people might be looking for a leader way ahead of time, which can be very frustrating. For example, I talked to a mid-sized company CEO about a role under them. In our first call, they explained that their product is being disrupted by competition and needs to change drastically or become obsolete. They thought of me as the perfect fit to lead this new initiative, and I was very excited about it. After a few exploratory chats over Zoom, I wanted to talk about the interview process. Then I realized that there was no actual role—at least not yet. The executive laid out their plan to first fire this one person, then get this other person to fill in for them, then get this other person to change teams… and many more steps that would have created the perfect role for me. When I asked how long they thought it would take, their estimate was one month. Putting aside the Game of Thrones vibe, it’s been three months since, and they haven’t even fired the first person from the list.

In hindsight, a better strategy for me would be to have started having these conversations at least three months before I left my previous job. I already had a feeling my journey there was not going to be that much longer, and when this feeling first kicked in I should have started looking around, even if casually.

2. Independent headhunters and recruiters are a valuable resource

To add another variable to your job search equation, not only do companies only open senior roles when there is a specific need, but they also are usually shy to make them public, especially on job boards. In my experience, small or medium companies only put these openings up if they have been looking for a while or some compliance framework requires that.

Companies do that for various reasons. Sometimes, the imminent departure of a leader might not be public information yet—sometimes even to the person leaving! The company might not want the outside world to know of a new strategic initiative or pivot, even for net-new roles. One of the folks I talked to is moving their business from B2B to B2C, and they don’t want to telegraph the move by having a “Vice President of Engineering, Retail” role open.

So how do you know about open roles in the market? The first step is to reach out to people in your network and let them know that you are looking. This will usually yield a few interesting leads, but the most efficient way is to use headhunters.

When I started in this industry, headhunter meant something specific: a recruiter for senior and/or hard-to-find positions. These days we use the term to refer to any independent recruiter that gets paid handsomely when they fill a position. Even when I am not looking for a new job, I try to at least skim over every recruiter email I get. As you undoubtedly have experienced first-hand, the vast majority of unsolicited messages from recruiters is irrelevant, and badly automated spam. Still, now and then, a recruiter seems to have invested five seconds trying to research you and really thinks the position would be a good fit. These you want to build a relationship with, even if you are not looking for a job yet. I always reply, thanking them for the message and saying that I am unavailable, but I will let them know if anything changes. I also apply a Gmail label to these conversations to quickly find these good eggs when the time comes.

You probably already have some of those reach out to you before. Go on your email and search for “your impressive background,” “opportunity,” and “well-funded startup.” I am sure you have a few of those in your inbox from over the years. Your Linkedin inbox might also be filled with these messages that you have likely completely ignored in the past.

Good headhunters can be an invaluable resource in your job hunt. Not only do they have access to the still-confidential openings we talked about, but also they work in networks. Recruiters share the jobs they are working on with their network and split the commission if someone helps them fill the position. This means that you will get a lot of the same roles from different recruiters, but also that even if that one headhunter you are talking to doesn’t have openings for you, they will likely know of other openings coming through their network.

When it’s time for a new job, I send a note to folks in that Gmail label saying that I am open to new opportunities. Usually, they will try to book an introductory call. Recruiters love phone calls and don’t like doing things over email or text. This means that it is very easy to get overwhelmed by the number of recruiters trying to call you, and we will explore time management a little further down the text.

Introductory calls are usually 30 minutes over the phone or video. Do not let them book you for longer; it is more than enough time. They usually will spend a few minutes telling you about who they are and the recruiting agency they work for, if any. Besides the fluff about how they are different from others and only take on the best openings (they all say that…), pay attention to the type of clients they work with. Are those the right size, industry, etc., you want to explore?

They then ask you for your story. I recommend that you think about this before talking to any recruiter. Create a text document with a description of your professional history, previous jobs, and more significant accomplishments—at this stage, what is much more important than how. Do not forget to add something about why you left each job, especially if you were there for fewer than four years. Then edit repeatedly until it only includes information relevant to the role you want and has a straightforward, linear narrative.

There are a few reasons why I do this. First, I like to force myself to tell my history concisely. It helps ensure that I don’t forget important details or find a rabbit hole that will eat up minutes on an introduction to no benefit.

Then there is the fact that you are playing a game of telephone between recruiters and people from the hiring company. Do not be surprised or frustrated if every new person you talk to about a role asks you to introduce yourself from scratch, even if the recruiter had arguably briefed them. A “canonical” written version that you use repeatedly can help keep your story consistent across various interviews and interviewers.

After the first introduction call, the recruiter will likely send to your email some positions they think would be a good fit for you. Usually, this is a mixed bag. Not only does the recruiter not yet know you that well, but they also will likely add both roles that you are not qualified for to show off and some that are a terrible fit, but they have been trying to fill for ages and might as well spam everyone.

And this is something to keep in mind working with recruiters: they work for the hiring company, not for you.

One recruiter I was working with guided me through the process with a small startup. Over four weeks, I had talked to most people at that company and was waiting for one last call with some engineering leader who, or so I was told, had been on vacation during that time. The invitation for the call never comes, and all I have from the company is radio silence for a week. I reached out to the recruiter, and they told me that everything was ok. They were just going over a big launch that week and a little busy. Following Monday, I get this message:

Hey Phil, just a quick heads up that we had a candidate accelerated through a process with The Company and has accepted an offer. The match for them was very strong and they decided to act quickly, so there was nothing they needed to compare against in their minds. I do appreciate your time on this one and hope we can work together again soon. Did you get a chance to check out That other company? www.that-other-company.com

After some Linkedin stalking, the person hired had already worked with some of the executive team before. I completely understand the move but was very pissed with a wasted week.

This kind of thing happens, and you need to understand that this is a transactional relationship. Still, it is in the recruiter’s best interest to have great relationships with senior candidates, so they will avoid doing anything that will piss you off.

3. Use your project management skills to keep your sanity

Finding a job in a hot market is one of the most challenging projects you will ever manage. You don’t have control over most aspects of the process, and even the influence you have needs to be managed carefully to avoid coming across as a demanding asshole. But the most complicated part is how the scarcity of you looking for one single job amongst many different options creates a textbook Game Theory problem.

These days, I try to be very structured around this effort, which—you guessed it—means I have a spreadsheet for it.

Below is a screenshot of the spreadsheet I’ve used most recently:

I don’t want to make the file available because it matters how one uses it, not the template.

I add every opening sent by a headhunter to the spreadsheet, even those I don’t find interesting.

The most critical data to keep tabs on are:

How excited am I about this role? How much Priority do I want to give it?
How much do I feel the hiring company (not the headhunter) is excited about me?
When was the last update on this process, from either them or I
Who is supposed to take the next step? Is the ball on my court or theirs?

Time allowing, surely I will act on any items blocked on me, but things aren’t that simple.

You need to make sure you have the headspace to prepare and research your tier 1 opportunities. You also need to pay attention to the various other things going on in your life, especially if you still have a full-time job. And, most important, you need to avoid burning out because this is a very stressful process.

Every time I interact with the headhunter or hiring organization, I update the spreadsheet. I use conditional formatting to make the “last update” cell green/yellow/red based on how long the last contact was.

I also use sorting and conditional formatting on the spreadsheet to help me quickly identify the status of the roles that both parties are excited about, which tend to be my high Priority.

The first thing I do every morning is to check the high-priority roles and make sure that I don’t drop the ball in getting back to them and do a check-in if they are taking too long to get back to me.

After whatever actions for the high-priority ones, I go over the other ones in priority order and reassess them. Should they go higher or lower in Priority? Did any new information come that changed how I feel about them?

As a self-imposed SLA, I try never to take longer than 24 hours to reply to tier 1 opportunities, not longer than three days for tier 2, and a week for the rest. This spreadsheet’s value comes from being an easy, visual, process to manage my SLAs.

Speaking of time management, something that has helped me immensely is to use Calendly. Calendly and similar tools allow you to send a link that will enable people to book meetings in your calendar, drastically reducing the back-and-forth of finding a good time for everyone. You will see that many headhunters use it, but you should have your own account and make sure that it is in sync with your personal and professional calendars.

4. Be strategic around your interviews and chats

I am very intentional with how I design recruiting processes for folks I hire, and I try to follow these same general principles to the process when I am on the other side of the table.

My guiding philosophy in both scenarios is that it is impossible to know if a candidate is a good fit for a job. So, with this in mind, instead of trying to validate if it would be a good match, I start from the assumption that it would be and then try to falsify the hypothesis as early as possible.

When looking for a job, I first list what I am looking for and what I don’t want in my next position. Usually, this has the kind of role and titles, the organization’s size, profitable vs. pre-revenue vs. growth-oriented, how many rounds of funding or close to an exit they might be, etc. The current job market for tech is so hot that even if you cannot choose where you will work, you can definitely choose where you will not.

I usually do not share this list with headhunters or hiring companies. I don’t want them to take the list literally and end up missing out on an opportunity that could be actually pretty good, even if not perfect. Also, if they really want me to apply (maybe because the headhunter really needs to show their clients that they are sourcing good candidates!), they will find ways to present whatever role they are working on as a perfect match.

Following this process, when you decide to move ahead with a position someone sent over, you assume this would be a good fit. Your task now is to use every interaction to falsify this assumption, searching for evidence that the role does not fulfill what you have listed as your requirements. Take some time beforehand to think of questions that can help you in this discovery. Keep in mind that it is rarely a good idea to ask directly about subjective topics. People are in sell mode when talking to you. While it is OK to ask how many engineers a company has, or if they intend on getting new funding soon, questions like “what do you think of your engineering culture?” aren’t going to surface helpful information.

I strongly recommend that you keep your questions laser-focused on the list of requirements you wrote, but I do tend to have a few more general questions I ask every person I talk to. My favorite is “What is your current bottleneck? What is the one thing that prevents you from moving as fast as you think you should move”? Then, depending on the answer, I have a follow-up: “If this constraint would magically disappear tomorrow, what do you think would become the next one?” This line of questioning is from the Theory of Constraints and gives you a good idea of how folks work and think. For example, it is common for the answer to be “We don’t have enough engineers”. This is almost always an indicator that the leadership team isn’t as experienced as they might present themselves. Nobody ever wants to hire engineers; there is something they want, and they believe that hiring engineers is the only way to get there—and that is seldom the case.

Something else to falsify as early as possible is where the position lies in the organization. Titles can be very misleading, a company might have a director managing three people while other of similar size have a manager of thirty, but make sure that your new title won’t sound like a demotion or stagnation in your resumé—this might bite you on the back the next time you are looking for a job. In my experience, the best way to find good evidence if the position they have is close to what you want is to find out whom you would report to and who would report to you. Understandably, this might be a little fuzzy in small companies, but make sure that their seniority doesn’t feel misaligned with your expectations. Also, please make sure you spend a considerable amount of time with your boss-to-be during the process.

5. Do not waste your time, but part as friends

This should be a guiding principle when applying for any job, but it is even more important for senior leadership roles. They require massive time investment from busy people such as you and the hiring organization leaders, so being honest and upfront can save everyone enormous time, money, and energy.

Following the process from the previous section, once I realize that a position does not meet the requirements I had listed, I tend to email the headhunter and the hiring organization the next day. I still give it until the following day so that I have some extra time to think about it and avoid a potential knee-jerk reaction to a single lousy interview or something like that, but if I make my mind, I will email them within 24 hours, tops.

There is always the question of how much feedback you want to give the various people you might have talked to during this process. You absolutely should volunteer the primary reason driving your decision (e.g. “I am currently interested in more senior roles/smaller organizations/moving out of the finance industry”), but keep details and secondary reasons to yourself. And, unless the process was an absolute clusterfuck and you want the hiring company to know, I would only send feedback on the process to the recruiter, not people from the hiring company. Remember: you want to keep a good relationship with the headhunter, and getting between them and their client introduces massive risk for no benefit to you.

And also, keep in mind that just because the company doesn’t have a role for you now doesn’t mean that it won’t ever have it in the future. The organization will grow and expand its needs and possibilities. There will be reorgs and departures that will create all sorts of opportunities. So be kind with your words and make yourself available for a regular catch-up and networking.

In fact, in the recent past, I have developed advisor relationships with organizations that were not a good fit. These relationships deserve their own article, but it is something to consider bringing up as you part ways.

How I like to use OKRs

Fri, 31 Jan 2020 00:00:00 +0000

Recently I sent a memo to my teams at SeatGeek setting the scene around changes that I want to see in our OKR and planning processes. I’ve asked a few people from my professional network for feedback on this email, and it seems like this is something many other organizations struggle with. I am publishing a lightly-edited version of the memo below. Hopefully, it will be useful to some people facing similar challenges.

From: Phil Calçado
Date: Thu, Jan 9, 2020, 12:30 PM
Subject: OKRs in 2020

Hi team,

We are already a few weeks into 2020, but many teams are still working on their OKRs. This creates an opportunity for us to iterate on how we use this tool at SeatGeek. As a first step, I want to challenge the way we think about OKRs. For now, I don’t want to perform any drastic changes to the current process. What I want is to give all of you some context to help understand why I might nudge you one way or another now and in the future as we work on our goals and strategy.

There is a lot on OKRs in our wiki and all over the Internet. In this text, I want to focus on real-world usage and the challenges of this powerful tool. That is why I am skipping introductions and assuming that you have some familiarity with what are OKRs and probably have used them at SeatGeek or a previous employer.

One problem I have seen with our OKRs

After being through a few OKR cycles at SeatGeek, I am convinced that we tend to fall into a very common trap: we use OKRs the same way more traditional organizations use Work Breakdown Structure (WBS). To illustrate what I mean by this, I will use an oversimplified illustrative example.

Let’s suppose that I am setting OKRs for my personal life. I decide that one of the most important Objectives I have is “To be healthy.” There are a few different ways to express this Objective in an OKR-based process. If we follow the typical SeatGeek style, we probably will build something like this:

Looking at the above, it might sound like a reasonable plan for someone to be healthy. The problem here is that even if we do all these things we set up to do, we can still be very unhealthy. For example, maybe you reduced your weekday alcohol consumption, but now you drink a lot more sugary drinks over the week, or perhaps you are cooking your own meals, but all you cook is mac and cheese.

Instead, the way I have seen OKRs working well is when you use the Key Results as the test if you have achieved the Objective. Applying this mindset, let’s think about some of the things that are generally accepted as indicators that someone is “healthy”:

This is obviously an oversimplification to illustrate my point—I am not a doctor, and you should not follow anything I say about health—but I’m sure you got the idea.

To me, the most significant benefit of this format is that it focuses on the outcome instead of output. The number of projects, features, RFCs, or bugfixes we build and deploy are irrelevant. The only thing that matters is what material impact these had on the business and the experience of our users, partners, and employees.

It also helps us define what do I mean when we use terms such as “healthy.” Objectives will almost always be annoyingly hand-wavy, and the fact that they are open to interpretation tends to create some friction between teams. In this model, we are trying to define what we mean by “healthy” precisely. Different parties will argue a lot about what should be in it, but once the definition is agreed upon, it becomes a clear contract we all live by.

Another significant advantage of this style is that it gives teams a lot of freedom in how they will achieve that. What you have agreed on doing, i.e. your OKR, is just the what. Whoever is accountable for the OKR should be empowered to explore options for how they will get there. At the beginning of a quarter, teams will begin new projects and initiatives focusing on achieving their OKRs. They will use small and continuous releases to push their work to the users early and often. Still, they might observe that all this work doesn’t really have a material impact on the Key Results the way they thought it would. In a healthy OKR culture, teams in this situation should immediately regroup and pivot, exploring what other projects they should try to achieve these desired results.

OKRs, Roadmaps, and Project Portfolios

One interesting challenge in applying this model is that it often requires familiarity and access to essential metrics— often called KPIs, or Key Performance Indicators. It is perfectly fine, and even expected, for a Key Result to be that the team starts collecting data on some KPI we would like to use for future OKRs.

It is not impossible, though, that an Objective has one or a few Key Results as the delivery of some project or artifact, but this should be seen as a bad smell. It is an indication that we are probably missing some metric that can better reflect the desired outcomes.

Ideally, a team will look at their OKRs and start planning what efforts or projects they should start/keep/stop to achieve the desired results within the timeframe. This is generally called portfolio management, and it is something that I will be working closely with you all regularly.

If you want to learn more about this topic, there are a few good books on OKRs and similar goal-setting processes. My favorites are:

Measure What Matters, the classic book by John Doerr
High Output Management, Andy Grove’s seminal work on management and strategy
The Advantage, which I see as Patrick Lencioni compiling most of his work on management in a single book
OKRs, From Mission to Metrics: How Objectives and Key Results Can Help Your Company Achieve Great Things, an interesting collection of articles on real-world challenges in using OKRs by Francisco H. de MelloAs always, please reach out to me with any comments, feedback, or question.

Cheers

Guiding Principles for Developer Tools

Tue, 30 Jul 2019 00:00:00 +0000

Just like almost anything else in software engineering, we don’t have a precise definition for the term microservice. This lack of formality doesn’t make the term worthless, however. There are a few useful characteristics we can infer whenever someone says they have an architecture that follows this paradigm. One such characteristic of “microservices-based architecture” is that they have a lot of small, independent, pieces of software—the so-called services.

With so many small services to build and manage, I find it useful to think about this as the economics of microservices. Basically, the organization needs to make it “cheaper” to build and operate products following the microservices way than adding “just one more feature” to the monolith.

Applying this mindset, something organizations quickly realize is that they need to invest in some areas usually neglected in more traditional, monolithic architectures. Building on prior art by Martin Fowler, I wrote a detailed article on this. Here is a handy list of areas that require some extra investment before adopting microservices:

Rapid provisioning of compute resources
Basic monitoring
Rapid deployment
Easy to provision storage
Easy access to the edge
Authentication/Authorization
Standardized RPC

These days, cloud providers and open source projects offer great tools to minimize the need for custom solutions for most of the listed above. Nevertheless, it is still the case that an organization needs to build some tooling. Usually, we need some glue code to fill in gaps between off-the-shelf tools, to enforce conventions, or offer a more productive workflow to engineers.

I have spent the last few years building and such tools, for both internal users and as products with paying customer. Sometimes, this work was done by a product engineering team, sometimes by an infrastructure team. To simplify our vocabulary, I will call this type of work platform.

Over time, my experience in platform work has led me to compile a list of principles that like to follow when building developer tooling, which I document in this article.

Know your audience(s)

It is very tempting for teams building dev tools to try and build the tools that they would love to have. Well, just like with any other type of product development, teams building dev tools need to take a step back and understand that they are not the user.

If someone is part of a platform team working on developer tooling, this person likely has interest, skill, and experience on how to work with infrastructure. You should not expect the same from people who are going to use the tools this team creates.

One way to understand whom you are building your tooling for is to run quarterly surveys, in which engineers self-assess their proficiency levels in various technologies used by the organization (e.g., AWS Lambda, microservices, Node.js, Go, MySQL, etc.). Getting people to respond to surveys like this is always challenging, but making a survey anonymous and a self-assessment tends to increase engagement levels.

The survey should be straightforward; here is a screenshot of the survey my team used at Meetup:

Data from a survey like this is subjective and should be combined with other quantitative and qualitative sources of feedback. Still, this is a great way to draw a map that helps you visualize the gaps in skill and experience you might have. The team should use the results to help build and prioritize their backlog and roadmap.

At Meetup, for example, results coming from the survey above showed that most engineers had some level of experience with Serverless technologies such as DynamoDB and AWS Lambda. Surprisingly, only a few people declared to know about fundamental topics such as IAM, VPC, CloudFormation, etc. Based on this split, the platform team decided to build first features that make it easier to use the latter and postpone working on Serverless-specific topics.

Prioritizing engineers not well-versed in infrastructure doesn’t mean that we could ignore infrastructure-savvy folks. There are many different ways to make sure you don’t alienate them from the work, but the most important step the platform team needs to take is to make it clear to infrastructure experts that if you know enough about infrastructure to have strong opinions about its internals, you probably aren’t part of the main audience for the tool.

However, even if they aren’t the target audience, the different cohorts must coexist. It is on the platform team to make an effort to allow the experts to integrate their tools and workflow with the tooling they create. This goal is not always possible, but you’d be surprised with how much common ground can be reached with a little bit of goodwill from both sides.

At Meetup, before we added any major feature to our developer tool, we would create a write-up, often started as an RFC, describing it in detail. It would, for example, describe that a new feature that creates AWS Accounts for users would automatically add such and such roles, with such and such permissions, and follow a specific naming convention. This spec allowed folks like our data science team, who had already invested a lot on their own Terraform-based automation, to make sure that their infrastructure was compatible with the rest of the organization.

Collect and monitor usage metrics

After spending most of my career in product engineering, something that shocked me when I started working on infrastructure products was how little information about our users’ habits and usage of the platform the team had. We relied a lot on asking user feedback, sometimes as user panels, others by using inviting a few users to a usability lab. We never observed what people were doing in their normal day-to-day lives; all we knew was what they told us or what we saw in a lab.

As an attempt to change that, whenever I am building a developer tool, I make sure that we send usage metrics to some analytics database. The platform team can then analyze this data and find interesting insights about how people use their tools in the real world, performing real tasks. By doing this, you may, for example, notice that your users still rely too much on the AWS web interface for something that your command-line tool already provides, that some commands are always run in sequence and could be collapsed in one, or that a given feature users needed to run many times a day is too slow and probably frustrating your users.

Every time a tool is used, it sends to the analytics database at least the full command line invoked by the user, everything that was written to STDOUT and STDERR, and how long the operation took. You might also want to send any relevant environment variables, who is the current user, and from which host is this being executed. Think of this as Google Analytics for your command-line tools.

At Meetup, running our tools with the -v flag showed to users what information was sent to the server (it’s in the last line of the output):

One interesting challenge is that internal tools are unlikely to have enough usage metrics to have statistically significant data. They often fall into what is sometimes called “small data,” which roughly means that the dataset produced is small enough to be understood by humans but not large enough to apply those neat statistical methods that modern product management loves.

That is why, while it is often interesting to analyze the usage metrics of your tooling, it is probably more important to analyze the impact they have. At Meetup, we measured this by tagging every AWS resource touched by our tooling with some metadata that allowed us to see that this particular resource had been created or updated by our product. We could then quickly visualize how much of our infrastructure was managed by our tooling versus using alternative ways. This information was fundamental when defining our projects and priorities for the platform team.

A few practical considerations when implementing metrics for your tooling:

Make sure you add a flag that allows users to bypass sending analytics data to the server. At Meetup this is achieved by the --incognito flag that every command honors
Your tools should never take passwords or other sensitive information as parameters or output them, but in case you absolutely have to do so please make sure that you do not collect this information in plain text in your logs
Failure in sending analytics shouldn’t prevent the tool from working. If it can’t send the data, you might want the tool to save the logs on local disk to be sent later. Whatever you do, though, do not throw an error at the user just because logs can’t be sent to the server

Avoid creating new abstractions, simplify existing ones

When I headed product engineering at DigitalOcean, we were always concerned about how our could we offer sophisticated products to our users without requiring them to read a 200-pages manual to find out if they needed its features at all.

One option to deal with this challenge was to wrap infrastructure-heavy concepts as higher-level abstractions. For example, instead of selling VMs and object storage as separate primitives, we could package them all together as a single product, something like what Google AppEngine did back in the day.

This idea had its appeal, but something that even Google suffered with back in the day (and AWS and others are experiencing as they evangelize Serverless computing), is that every time you do something like this you are not actually removing complexity, you are just replacing existing concepts with a whole new set of abstractions. Even if the older abstractions were complicated, there are probably thousands of StackOverflow questions, tutorials, books, etc. that document and explain them. Irrespective of how much simpler they might be, if you create new abstractions, it is on you to educate your userbase on how to use them.

Instead, we decided to work with existing concepts as much as possible, and simplify how users interacted with them. As an example, the first version of our load balancer product was nothing more than a few VMs running HAProxy and managed by Terraform—nothing that users couldn’t already do on their own. Instead of exposing the complexities of these tools, though, we tried to create a clean user interface that didn’t try to be smart, it just removed any details that weren’t important for the majority of our users:

At Meetup, we have standardized on CloudFormation as our configuration management tool. Unfortunately, CloudFormation’s out-of-the-box user experience is awful. As an example, let’s say that you have a CloudFormation template named standard_user_and_permissions.yaml in your local directory. Here is what we needed to do to run this template using the aws command-line tool:

The only parameter in this very long command-line that is unique to the task is --template-body. Everything else is just metadata that one needs to add accordingly to Meetup’s conventions and standards for AWS.

Considering how often engineers performed this task during their day-to-day work, our platform team decided that this was worth automating. We added a create-stack option to our cloud-tools utility, and the new command looked like this:

When building the feature above, our main goal was to avoid requiring engineers to remember and type each one of the arcane yet super important parameters. We figured out that we could infer everything we needed from things like AWS configuration file, environment variables, and directory structure conventions, which simplifies the user experience drastically.

We could take this a little further, and use more conventions and metadata to completely eliminate the need for engineers to write the CloudFormation templates—let’s be honest, they are mostly copied and pasted around. Even if this could streamline the workflow even further, We have decided to simplify, but not hide, CloudFormation.

One of the reasons for this decision was the educational argument discussed above—we had access to inexpensive or free educational resources and consulting on CloudFormation. Another big reason was that we have realized that the more we hide away a fundamental tool like CloudFormation, the harder it would be for us to adopt new features from AWS. If we use our own abstractions for configuration management, every time AWS releases a new feature our users would have to wait until the platform team adds support to it to our tools.

If we do not shy away from CloudFormation, we would be able to use new features as soon as AWS makes them available—granted, AWS is notorious for not adding new features to CloudFormation until after launch, but it would still take longer to do it ourselves.

Build on top of the existing user experience, do not try to hide it away

SoundCloud started heavily investing in container technology around 2011. This was years before Docker was released, so we had to develop our own tooling. Like most people back then, we used cgroups, Linux namespaces, and SquashFS images to build our container infrastructure. Containers were used only in production, during development folks would use their local machine’s environment and upon deploy, using git push, the platform would package the code as a container and deploy it. It was designed to offer an experience almost identical to Heroku’s, as this slide from a presentation I gave in 2013 shows:

This system has served us for many years and during our most extreme hyper-growth stages. Eventually, though, it became clear that adopting the Docker toolset would be extremely beneficial to us, especially as it would allow engineers to run containers on their development machines smoothly.

As we changed our platform, planned for this change, we faced the familiar challenge of keeping our engineers as productive as possible while we transition our platform to the new technology. One way we have found to achieve that was to invest in automation, creating tools that would make it super easy for people to perform some of our most everyday tasks, even if they had no idea what Docker was or how to use it.

As an example, here is the output of a tool we had that automatically created build pipelines in our Jenkins cluster:

The pipeline tool shown above at work read a manifest file containing some metadata about the project and the Makefile.pipeline, which contains instructions about run to run the build—very similar to the role .travis.yml when using Travis CI.

Something interesting about the tool is that not only it uses the Docker command line as discussed in the previous section, but it also writes to STDOUT the full command line it invoked and the full output returned by the process it has spawned. At first, this was a debugging resource used by the platform team while developing these tools, but for some reason, it was never turned off before releasing the tool to our engineers.

One massive positive impact that this had was that it was a great way for engineers to get acquainted with Docker. I think it would be correct to say that everything I learned about how to use Docker back then was by observing what the tool was doing and how Docker would react.

This accidental feature was something that I have assimilated as a principle for all infrastructure tools.

Every tool we built at Meetup had a built-in “verbose” mode that showed users what AWS commands are being issued and what is returned. For example, if you want to see a list of AWS which OrganizationalUnits (basically a grouping of AWS accounts) belong to which teams, you would typically run this command:

However, if you added the -v flag to the line above, it would output everything that has to do with the AWS commands:

Given how low-level the aws command-line commands tend to be, this tends to be very noisy. That is why it is not enabled by default.

As discussed, exposing users to the ins and outs of the underlying platform is an efficient and inexpensive way to teach by example. Another benefit of this approach is that it is much easier for users to work around problems and get help when things go wrong—especially during partial failures. The user can see exactly what the tool was doing when things went wrong, which helps both them and the platform team understand what steps you need to take to fix the problem.

Rely as little as possible on what is installed on the host or remote servers

A question teams typically have when they start getting serious about building platform tools is what programming language or runtime they should use to build these tools.

I am not interested in lengthy debates about which programming language is the best one—well, not anymore. I believe that an engineer should become familiar with as many programming languages and paradigms as possible and make a decision about which one to use for a specific project based on the constraints under which they work. In my experience, the primary constraint is always on the people side, either picking something that your team can be productive on very quickly and that you can find good candidates when you need to grow your team.

When it comes to building platform tools, though, there is one other constraint that is always present: your tools should have as few moving parts as possible.

Back to the work we’ve done at SoundCloud, we first built most of our Docker-based tools previously discussed here using Bash scripts. As it always happens, at some point Bash becomes hard to scale and test, and we need to pick a new platform. The team had experience in Python and Ruby, so we started building our tools in these two languages. At first, this worked well, as both are very productive and have a vast amount of libraries, testing tools, and real-world examples we could leverage.

Soon enough, though, we started having some issues. Every engineer already had some version of both Python and Ruby installed on their laptops, but the same wasn’t necessarily true for our servers, build boxes, and the laptops from product managers, designers or any other non-engineering folks who might need to perform a small infrastructure task as part of their job.

However, even engineers were having issues. They would need to keep and manage many different versions of these runtimes. Our legacy Rails application required a specific version of Ruby, some new services required their own versions, and our tools would run on another version. Even if tools like rbenv and RVM make it possible to manage these things, way too often people would report problems when using our tools that were caused by the user mistakenly running the tool against the wrong version of runtime or library.

We tried solving this using package managers like APT and Homebrew, but it felt like adding more overhead and friction to our users. We then packaged all of our tools as Docker containers and made it such that every time a tool was invoked, it would just execute docker run on a container image that we had baked. This setup worked ok enough for a while, but it was a massive performance hit for a tool that was supposed to run and finish quickly, and it also required a very long set of configurations and conventions to map networking and filesystem between localhost and the Docker container.

When I was at DigitalOcean, we released a command-line tool to our customers distributed as a single binary. It was a natural choice for us back then, as DigitalOcean builds systems almost exclusively using Go, and this is how binaries are distributed on this programming language. This distribution style was very successful, as all that our users needed to do to use our cloud was download this one executable file, as opposed to the multi-step process that AWS requires when installing their Python-based command-line tool.

Go isn’t the only modern language that can produce reasonably small executable binaries, and other options like Rust are getting more and more traction amongst platform teams. Irrespective of what programming language you pick, make sure that the resulting executable is self-contained, that it doesn’t require users to install any runtime or virtual machine on their computers.

At Meetup, we followed this principle for all of our command-line tools, but we had a big challenge in that our tooling required the aws command-line tool to be installed by the user. We assumed that proper feed and care of the aws tool was a reasonable expectation to have on our engineers, and added a lot of checks to make sure that our tool would detect and let the user know when there was a problem with their local AWS installation—see the health check section below.

It is very common for platform tools to interact with systems like CloudFormation, Terraform, and Kubernetes, which require their users to write configurations on files written in JSON, YAML, or another declarative language. These templates need to be stored somewhere. One approach is to keep them on a remote location, such as an S3 bucket or Maven-style repository. I have found this approach problematic for a few reasons.

Firstly, it adds another moving part to your toolset, which is undesirable. This architecture requires this remote location to be always accessible, which implies high-availability needs, on-call support, incident management, etc.

It also adds some overhead on versioning. Command-line tools make some assumptions about the templates about things like which parameters they expect. If you change the template, you need to think about how this change could impact all possible versions of the command-line installed in laptops, buildboxes and elsewhere.

Both issues arise from the fact that the command-line tool and the templates are highly coupled. In general, it is advisable to keep two highly coupled components together, as part of the same artifact. When it comes to platform tools, my suggestion is that you embed the templates within the command-line binaries, using tools like Packr.

Have a built-in self-check

When I was at Buoyant, our tiny engineering team split our time between working on Linkerd v2 and supporting the hundreds of users of the first version of our Service Mesh. We had an on-call rotation for support, and engineers would rotate on helping our community on forums, Slack, and help with issues they had found.

Something that one finds out when doing user support for open-source products is that you spend most of the time trying to figure out if the issue is caused by something in the user’s environment or in your product. As we were an open-source project, we couldn’t ask for access to the users’ systems and had to rely on asking them questions on a public forum. This slow-paced interaction made the process take forever.

That is why one of the first features we built for Linkerd v2, while it was still called Project Conduit, was a self-check that would try to make sure that some basic requirements were in place. Inspired by Homebrew’s doctor command, we tried to give as much information to the user as possible so that they could maybe fix the problems themselves before asking on the forum.

But even when people couldn’t fix their own issues, the first thing we did when people had issues was to ask them to paste the output of this command. This gave the support engineer a lot of useful information from the beginning, instead of having to ask lots of questions over a long period of time.

This check feature is something that I like to have in my internal tools. Similarly to people working on open-source software, a platform team invest a lot of time helping their users understand issues they might experience. This is a built-in way to help users help themselves or, at least, give the platform team some more context on why a problem might be occurring.

One similar but maybe more important verification is making sure that the user has an up-to-date version of the tools. This is part of the check feature discussed above, but I recommend that this important check should be part of every command.

As an example, here is a failed attempt at creating an AWS account using the tool we built at Meetup. You can see that as part of its normal operation, the tool also queries an S3 bucket that contains the latest version number so that it can compare against its own version:

Ideally, the S3 bucket above should also inform what the minimum acceptable version is. If the current installation of the command-line tool is older than this version, it should reject any commands and ask the user to update the tool. If the current version is older than the most recent release but still higher than the minimal acceptable version, it should display a warning but still execute the command.

Some thoughts on GraphQL vs. BFF

Fri, 12 Jul 2019 00:00:00 +0000

The Back-end for Front-end (BFF) Pattern was originated at SoundCloud. It takes its name from the internal framework we built to make application-specific APIs easier to write and maintain. Since then, it has taken a life of its own, with various articles, books, and open source software that teach, discuss, or implement it.

More recently, another approach to API architecture and design comes in the form of GraphQL. Facebook first developed the technology, and it has quickly become so popular that many startups were created exclusively to build frameworks and tooling around it.

Over the past year or so, I have been asked many times about the relationship between these two. This article is a write-up of my thoughts on the matter.

What is a BFF, even?

I believe that a lot of the questions people have around this topic originate from some misunderstanding of what is the actual goal of the BFF pattern. There is a lot of detail on the background and specifics of the BFF pattern on the original article describing it, but let me try to summarize what I mean by this term.

Let’s take a look at the diagram below:

Option (a) is sometimes called a One-Size-Fits-All (OSFA) API, where we have one (or a few) APIs that serve many applications and use cases. Option (b) is generally called BFF, where each application or sometimes use-case has its own API.

In the OSFA model, we usually have many different applications (sometimes built by third-party developers and business partners) share the same endpoints. Every time that one of such endpoints need to be changed, the engineers from the API Team need to make sure that they won’t break any important use cases, integrations, etc. Sometimes people try to go around this challenge by strictly versioning the APIs, but this not only imposes overhead in terms of governance but also won’t prevent you from running multiple versions of the API at the same time, until every client application is able to update their usage.

Instead of trying to apply some strict and more formal governance process to deal with these challenges, with the BFF approach we try to eliminate the problem altogether by giving the team that owns the client applications full control over the API they use.

Putting it in terms from a dichotomy proposed by Martin Fowler, using a BFF means that even if your API might be a Public interface, it isn’t Published. Even if other applications can reach the API—because it is available on the Internet—they are not supposed to do so and this usage isn’t supported by the API owner. Each application then is free to build and evolve their API as it better suits them, with no need to worry about how this would impact other client applications as there will be none.

Something often overlooked when people talk about BFFs is that this new ownership model fundamentally changes the boundaries around your subsystems. In the OSFA approach, the API is a discrete subsystem meant to be used by multiple applications. In contrast, when you have an architecture based on BFFs, the API becomes part of the client application.

The defining characteristic of a BFF is that the API used by a client application is part of said application, owned by the same team that owns it, and it is not meant to be used by any other applications or clients.

Here is an illustration from the original article:

Where does GraphQL fit in all this?

Notice that there isn’t anything in the description above that says that the endpoints provided by a BFF must be optimized for the client application they now belong to. There is no fundamental reason for the API exposed by one BFF to look any different from your typical OSFA API. Nevertheless, when you make the API part of the application, some coupling with the client is not only expected but desired, as teams use the autonomy as leverage.

At SoundCloud, we saw teams using their newfound control over APIs to perform optimizations that made sense for their specific use cases. For example, the Android team experimented with ProtocolBuffers instead of JSON for their APIs payload, the partnerships team was able to allow for much more generous rate limiting settings for our the API used by the likes of Sonos and Apple, and various teams fine-tuned their caching and CDN usage to serve the particular needs better.

So far, nothing discussed here prevents you from using any flavor of RPC you might prefer. You can follow the recipe above for REST, gRPC, GraphQL, SOAP, or any other combination of wire protocol and architectural style you might favor. Better yet, you can have each application using whatever technology suits them better.

It follows then that it does not make much sense to compare BFFs and GraphQL. You can build your GraphQL APIs as many BFFs or as an OSFA API.

I believe that the reason why people struggle with the relationship between these two related but not mutually exclusive concepts is due to one of the most interesting possibilities that BFFs give to client teams: how to optimize their endpoints and payloads.

To recap, here is how the original article on BFFs explains the challenges teams faced with the OSFA approach we had at SoundCloud:

Below you can see how many requests we used to make in the monolithic days versus the number of those we make for the new web application:

To generate that single profile page, we would have to make many calls to different API endpoints, e.g.:

GET /tracks/1234.json (the author of the track)

GET /tracks/1234/related.json (the tracks to recommend as related)

GET /users/86762.json (information about the track’s author)

GET /users/me.json (information about the current user)

…

…which the web application would then merge to create the user profile page. While this problem exists on all platforms, it was even worse for our growing mobile user base that often used unreliable and slow wireless networks.

As we moved to BFFs and let client teams own their own APIs, they started working on ways to minimize the number of calls needed to do things like render the user profile page mentioned above. Our architecture was heavily RESTful, and GraphQL wasn’t even available yet, so the way we dealt with the issue was to model the endpoints in our API following a Design Pattern called Presentation Model.

When using this pattern, instead of assembling a page from many fine-grained calls to the API as described above, we would model user experience abstractions as their own REST resources. For example, we would have endpoints like /track/123/player.json that returns all data needed to render any of the multiple versions of our player.

It still requires a page to make more than one call to fetch all data it needed to render the whole screen, but the number of requests needed was drastically reduced, from hundreds to a dozen, and the new endpoints were much easier to manage and reuse.

Were GraphQL available back then and had we decided to use it, things would be quite different. In a RESTful API, the Presentation Model needs to be implemented on the server-side, so that we avoid making all those calls from the example above. When we use GraphQL, we don’t necessarily need a Presentation Model at all, and if we do use one, it can be implemented on the client application, as GraphQL makes it possible to get all data needed in a single request.

One challenge in moving this responsibility back to the client is that it increases the amount of logic that you perform at this layer. It is notoriously hard to make sure that several feature teams are well-staffed when it comes to needs such as mobile development. This leads some organizations to prefer a strategy where they perform as much work server-side as possible, keeping the mobile clients simple and mostly dedicated to display logic. You might also find it difficult to push an urgent change when the deployment process for your app requires going through some kind of approval by an app store.

Do we even need BFFs with GraphQL?

But one more fundamental question that pops up when considering using GraphQL in BFFs is: do we need BFFs at all? As discussed, BFFs are not about the shape of your endpoints, but about giving your client applications autonomy. Still, some GraphQL literature insists that this new technology gives so much freedom to the client by allowing them to perform ad-hoc queries that you can safely have a single OSFA API without the drawbacks from REST-based approaches.

I don’t have enough first-hand experience with GraphQL at scale to have a strong opinion here, but two things about this worry me.

The first friction point is that it is hard for me to believe that you can combine the needs of many different applications, owned by different teams, with different users and use cases, in a single schema. Marc-André Giroux, from Github, has a great article discussing the practical challenges of composing (“stitching”) together schemas coming from different domains. Apollo has published some advanced tooling that aims at easing some of these challenges, but just by looking at this slide from James Baxley’s excellent talk at GraphQL Conf 2019 you can see that there are some non-trivial concepts that need to be applied:

Even if someone comes with a simple technical solution for how to compose schemas, I am not sure that having a single schema is a good idea to begin with. Trying to derive a single schema that holds a complete-ish model of your data and can be queried by wildly different applications reminds me too much of an Enterprise Data Model, which enterprise software development was very fond of just a few decades ago.

In this world, organizations would try to come up with one single database schema, often federated across many instances of Oracle and IBM relational databases, that would be the one source of truth for the whole company. Applications would be built around this enterprise schema, and there were documents that acted as data dictionaries, explaining to developers what each field and type meant. Fowler wrote a few paragraphs on why these Integration Databases can be problematic, and I believe these same issues might arise when you have a single GraphQL schema for your API:

An integration database needs a schema that takes all its client applications into account. The resulting schema is either more general, more complex or both - because it has to unify what should be separate BoundedContexts. The database usually is controlled by a separate organization to those that develop applications and database changes are more complex because they have to be negotiated between the database group and the various applications.

The benefit of this is that sharing data between applications does not require an extra layer of integration services on the applications. Any changes to data made in a single application are made available to all applications at the time of database commit - thus keeping the applications’ data use better synchronized.

On the whole, integration databases lead to serious problems because the database becomes a point of coupling between the applications that access it. This is usually a deep coupling that significantly increases the risk involved in changing those applications and making it harder to evolve them. As a result most software architects that I respect take the view that integration databases should be avoided.

I am looking forward to reading more experience reports on both BFF and OSFA APIs built using GraphQL. At the moment, based on my own experience and what I see from folks like Marc-André Giroux, I suggest that an organization currently invested in RESTful BFFs keep their separate APIs and migrate them to GraphQL, instead of trying to jump to an OSFA GraphQL API.

A Structured RFC Process

Mon, 19 Nov 2018 00:00:00 +0000

Maybe you are a new engineering leader at red-hot startup. The founders hired you on account of your previous experience at a successful tech company, they brought you in to take engineering to the next level. After a few weeks of onboarding, you now have a list of changes you want to implement. How do you find a way to propose that without making the old guard feel alienated from the process?

Or maybe you are part of the old guard yourself. You have shown interest in stepping up and leading the engineering team from a scrappy group of people working 7 days a week to a more mature organization. You were promoted to a position where you finally have the ability to tackle the root cause for the growing pains you all are experiencing. One question still remains, though: how can you make sure that your fellow engineers don’t feel that you are imposing your views like a tyrant?

Or it could be that those ideas aren’t even yours. You are a manager worried about the amount of technical debt and frequent production incidents caused by people rushing to implement their ideas withouth having them double-checked by a second pair of eyes. When you casually remind them about the benefits of collaboration, you hear about how they are are afraid that a reviewer will waste everyone’s time pushing for the perfect solution, and we need the first iteration of this thing out as soon as possible.

In sitations like these, you are usually asking yourself how can you foster a culture that is more accepting and kind towards change?

In my experience, one of the most effectives things one can do to achieve that is establish a structured process for feedback on ideas, designs, and architectures. Honoring a long tradition in software engineering, I call this a Structured Request For Comment (RFC) Process.

Introducing an RFC process

Your organization already has various formal and informal ways to share ideas, from formal presentations to casual chatter over lunch or beers. Something I have oserved in the various startups I have worked with is that thse channels tend to break down when the organization reaches something like 70-100 engineers. At this size, people still reach out for feedback from those who they know—for example people who have worked at the organization for a long time or maybe people who joined at the same time and bonded during onboarding—but these networks are more like cliques than peer groups.

This is when, as a leader in the engineering organization, I tend to establish the structured RFC process. RFC stands for Request For Comments. The term has a long history in engineering, but outside formal standard bodies it is normally used to refer to a document describing and idea, written by someone who expects feedback on it from their peers. This kind of interaction happens all the time amongst engineers, but I believe that a well-defined and structured process helps set expectations that is is an expected part of the engineering workflow. It also makes it easier for people to take part on the process, as they don’t have to second-guess if their opinions are welcomed or when to bother more senior people for feedback. Teams tend to use this process to gather feedback on the design for a new system, a strategy for upgrading shared libraries, new coding conventions, changes to the code review process, etc.

After introducing such process in various startups, I have compiled the lessons that my teams and I have learned into an RFC itself, so that the team experience it first hand while discussing if they should adopt it or not. You can find it in full as a Google Document here. Please feel free to copy this format, make whatever changes make sense and use it in your organization.

In the process described above, the author writes a document describing the proposal, following a template that aims at making sure that some fundamental questions are answered before inviting people to give feedback. They will then ask other engineers for written feedback, usually by sending an email to a well-known mailing list. People reviewing the document provide the author with their opinion, anecdotes from previous experience, and facts related to the proposal. This feedback is considered informational, meaning that the authors of the RFC are free to do incorporate it into their proposal or not.

There is no guarantee that the feedback will be ultimately incorporated into the proposal, but we don’t want reviewers thinking that they have wasted their time commenting on it. That is why the process described here requires the authors to acknowledge every piece of feedback given. The authors must also commit to revisiting their final decision at some point in the future, sharing the lessons they have learned.

We have recently introduced this process at Meetup, and in the first few weeks it was already clear that there was demand for something like it:

The document above should contain everything you need to start a structured RFC process on your own. The remainder of this article is an annotated version of it, adding some nuance and historical background that isn’t fully captured in the RFC. It adds some color and background on the key points of this process based on my experience implementing it at ThoughtWorks, SoundCloud, DigitalOcean, and now Meetup.

The annotated RFC

The Header

Authors: Phil Calçado

To be reviewed by: 10/5/2018

Revisit Date: 04/17/2019

State: Feedback Requested

A header like this might look antiquated, but I find still incredibly useful. At a glance, this provides me with who the authors are—important for accountability, which we will discuss later—and a few important dates to keep in mind.

Need

Most of this document was taken verbatim from the one we used to introduce the RFC process at Meetup. It is likely that your organization has some of the needs stated here, but you might want to be more specific about your needs.

A healthy engineering organization demands a culture of asking for and welcoming feedback on our work. In smaller organizations, sharing plans, designs, and decisions is much easier. As we grow, it has become clear that this organic process won’t suffice.

This paragraph acknowledges a common pain point in hyper-growth companies. When your team was small, there was a straightforward way to share ideas between engineering, product, and even founders—just have a conversation! As you hire more people, suddenly engineers find themselves with a feeling that we can summarize as “I don’t know what’s going on anymore.” While RFCs won’t solve all of your problems, it establishes a well-defined process to share and consume information about engineering decisions and ideas.

Currently, various teams already write down their plans and designs in documents that could be usually called an RFCs (Request for Comments). Without shared and clear guidance or process, these vary drastically in format, contents, and objectives. There is also a lot of variance on how these are advertised to other engineers who would be good candidates for feedback givers. At the moment, there is no easy way for an engineer to know what topics are being discussed at a given time, or how could they give input on such decisions.

Your team is already sharing information one way or another. Unfortunately, the lack of a standard for how and where to ask for and give feedback makes it such that these documents often don’t reach people who would be the most helpful or impacted by it in a timely fashion.

At Meetup, for example, our Web Architecture team was planning to build a GraphQL-based API to boost mobile productivity. They had a meeting with our mobile team to share the good news and talk about the project; the expectation as that the mobile team would adore the idea. Instead, the GraphQL proposal was received with confusion voiced as questions like “So.. does that mean we should stop our refactoring of the HTTP clients?” It turns out that the mobile engineers had decided to solve the productivity problem themselves by changing their HTTP client to make it super-productive to use our REST API. They had done some amazing thinking about how to improve the current state of things, and even shared the idea as an RFC. Unfortunately, this RFC was never shared with any other team, and people who own the API platform had no idea that this initiative was going on.

There is also no clarity on collaboration versus decision-making. An RFC process, by definition, is meant to collect feedback on a proposal. There will always be different opinions, and we must encourage people to expose their ideas and have them debated. Nevertheless, we operate in a very competitive landscape and we have no time to waste in analysis paralysis. We believe that speed of iteration beats quality of iteration, and to iterate quickly we absolute clarity about who has the decision-making responsibility on a proposal.

One of the most important aspects of any change management process, especially when trying to increase transparency and engagement, is to avoid design by committee. Whatever feedback gathering process you end up adopting, you must make sure that there is an explicit acknowledgment of who is the decision maker, the one accountable for the outcome and with veto power over it. Feedback givers must always keep in mind that their opinion will be taken in consideration, but there is no guarantee that they will be incorporated into the proposal.

Moreover, existing RFCs and similar documents often get into too much detail about the “how” and not enough on the “what” of the proposal. There are many different ways to materialize an idea, and implementation details are better left to be decided by those who are actually doing the work.

You want to both gather feedback from a diverse audience and make sure that the reviewers aren’t missing the big picture and focusing on implementation details. To achieve that, you need to make sure that your document doesn’t spend too much time on distractions and focuses on the most important aspects of the proposal. When I am coaching managers and leaders, I tend to summarize this as don’t invite people to conversations you don’t want to have with them.

One example of this going bad was when, at SoundCloud, my Platform team published an RFC describing the changes that application developers would face as we moved from our own datacenters to the cloud. The document was full of important and potentially contentious information about how application developers would have to change their mindset about latency, availability, and even simple things like trusting that there was a durable file system in their servers. Nevertheless, the one paragraph that everyone in the company decided to comment on was one that causally mentioned that we would write some tooling in Python because that is the de-facto canonical AWS SDK. This was an implementation detail, completely irrelevant to anyone who wasn’t in that team. Still, they were introducing a new programming language and that sparked a heated debate that went on over the weekend. At SoundCloud, our teams had autonomy to decide whatever tools made sense to them, and the mistake this team made in the RFC as to invite an engineering department full of very opinionated people to give feedback on their programming language preferences.

In summary, we need a clear and simple process that allows people to share their ideas, receive feedback on them, and defines how the decision-making process works.

This sentence summarizes everything that this process tries to address. People need a safe space to get feedback on their ideas, and feedback givers must know how their input will be used.

Approach

The recommended approach to fulfill the needs presented in the previous section is a structured RFC process. In this document, a person or group of people will author a document describing a proposal and asking for feedback on it from the rest of the organization.

As mentioned before, in my own experience a well-managed RFC process can address the needs stated previously. The trick here is that the term Request for Comments means different things to different people. To make it clear what we mean by ot, this section tries to be prescriptive and opinionated about how to build such a process.

Feedback vs. Approval

The RFC process is a tool that can be used during the decision-making process, and everyone is encouraged to share rough and early ideas and proposals as RFCs.

The more polished a document looks, the softer and less impactful reviews tend to be. When facing a well-written document, our brains enter into a sunk cost fallacy mindset, thinking “Ugh, I think this is a horrible idea, but this person has put so much effort into it…“. This leads us to focus on smaller, irrelevant details instead of addressing any elephants in the room. It is generally easier to give more candid and useful feedback on something on its earlier stages, maybe just a list of bullet points and a back-of-the-napkin drawing.

As a leader, it is probably common for people to share with you their plans and ideas over Slack, email, or meetings. When this happens to me, I spend some time listening and asking some preliminary questions about the proposal, but soon enough I say “This sounds interesting, do you mind putting it in a two-page document using the RFC format?”. I tend to work with them for a few iterations on it, antecipating questions that I believe will come from the wider audience, and then ask them to send it to the wider group for peer feedback. If the person is resistant to sharing it widely, I coach them into sharing it first with people they are more comfortable with, and widening the circle until the whole organization is engaged.

Even if the proposed change on the RFC is extremely well-received, it doesn’t mean that it is approved to be worked on or that it will be prioritized. Authors of the RFC must make sure that they have whatever approval or sponsorship they need from management, leadership, stakeholders, collaborators, and their own team before any actual work is done.

The biggest caveat of a peer review process like the one described here is that just because something has gotten good feedback and people may be super-excited to see the change implemented, it doesn’t mean that it is the right thing to do, or that it should be prioritized.

This is where the push for everyone to use RFCs and to publish early work can backfire. It is not uncommon for engineers to try and use the process as a way to sell an idea that hasn’t been approved by their stakeholders or managers. They try to gather support from the other members of staff, transforming something that should be a purely technical matter into a popularity contest.

That is why one needs to make it absolutely clear that RFCs are not a decision-making process. RFCs are merely for feedback on a proposal, and there is no commitment that a well-received RFC will be implemented or that a poorly received one won’t.

Just like with any other engineering problem, it is also helpful to be explicit about any constraints before asking people to find a solution. One way in which I have done this in the past is taking responsibility for writing the Need section of the RFC. You should use that as an opportunity to make sure that not only the technical and functional aspects of what is needed are expressed there, but also explicit acknowledgment of the other constraints one is under. For example, you should make it clear that the desired solution needs to be delivered within a given timeframe, or under some budget.

RFCs are expected for any change that extends beyond a team or department, as it gives the people who would be affected an opportunity to learn more about the change and give feedback.

Thoughout the document, I try to reiterate over and over the idea that RFCs should be used at any time whenever people can benefit from feedback on an idea. This section takes a more prescriptive stance, explicitly setting the expectation that RFCs will be used when a change impacts more than just a team or any other cohesive group of people. This draws a line on what autonomy means in practice, setting a safeguard that is triggered when a team’s decision might impact other individuals.

In my experience, enforcing this rule is seldom necessary. In fact, it is more common that the problem is other around: is not that people need to be told when to write an RFC, they need coaching identifying when this is not the best course of action.

This might sound conter-intuitive. If we are so convinced that the RFC process brings value to the organization, why don’t we want to have RFCs for almost everything? In my experience, there are two main problems to this approach.

The first problem is that it can create an avalanche of RFCs that spam our inboxes. As discussed in the introduction, the RFC process here won’t scale very well if there are too many changes to be reviewed at a given point in time. People get overwhelmed and will quickly disengage from the process.

A second and more dangerous problem I have observed is something that can be mapped to the phenomenon called diffusion of responsibility. That is when engineers start using the RFC process as means to protect themselves from any bad consequences. “Everybody reviewed it and gave their ‘ok’“ feels like an efficient shield to use when asked hard questions. Autonomy doesn’t work without accountability, and if your engineers are using RFCs as an ass-covering tool you probably need to revisit how your culture deals with failure.

One way to tackle this problem is with coaching. I expect my technical leadership, e.g. tech leads, architects, Staff/Principal engineers, etc., to invest a lot of their time in reviewing and helping prepare RFCs. To me, engineering leaders do their job when they are helping others with their RFCs like this, not when they are writing RFCs themselves.

Authorship, Accountability, and Responsibility

The authors of an RFC can be an individual, team, or any other group of people. Being an author means that a person or team sponsors the initiative and are accountable for it.

I usually ask teams to assume collective ownership of the RFCs they produce. While it’s normal or one person or maybe a pair to take the lead on responding to feedback and managing the process, the ownership of an RFC should be treated the same way as they do with code. Every now and then an RFC would be owned by a single person, but this shouldn’t be the norm.

While non-authors may be responsible for implementing the results of an RFC, its authors are accountable for it, as per the definitions below:

The main difference between responsibility and accountability is that responsibility can be shared while accountability cannot. Being accountable not only means being responsible for something but also ultimately being answerable for your actions. Also, accountability is something you hold a person to only after a task is done or not done. Responsibility can be before and/or after a task.

This section makes it explicit that, while the authors might not be the ones actually doing the work of implementing the change, they are accountable for making sure that the RFC process is executed well and, more importantly, for the change being proposed.

The concepts of accountability and responsibility are fundamental to a healthy organization and deserve their own article. If your organization hasn’t yet developed a good understanding of what these terms mean, you migh want to expand this section and include some more details.

Collaboration

RFCs must be sent to a mailing list called rfcs@example.org. All engineers are automatically part of this list, and people from other groups are welcome to join and participate.

These days many organizations are trying to completely switch to real-time communication tools like Slack. I personally prefer an asynchropnous tool, such as email, for the RFC process. I have also seen teams using Github issues and wiki pages for this.

Comments and feedback should focus on the technical content. As long as they don’t impact the content, collaborators should avoid commenting on formatting, writing style and other maybe relevant, but not critical aspects. Such comments can be sent directly to the author to avoid polluting the comment and storming people with notifications.

As a Brazilian citizen who has been working in English-speaking environments for more than ten years, I know first-hand the challenges of having English as a Second Language. While we always welcome feedback as a way to get better at expressing ourselves, and RFC isn’t the best forum for it. It is perfectly fine to ask authors and feedback givers to rephrase a sentence that is a little confusing, but please refrain from using this interaction as a way to find teaching moments.

Similarly, don’t be obsessed with formatting. It is great when RFCs look the same, it makes it easy to quickly parse and check if you shoukld invest time on it, but it isn’t mandatory.

Authors must address all comments written by the deadline. This doesn’t mean every comment and suggestion must be accepted and incorporated, but they must be carefully read and responded to. Comments written after the deadline may be addressed by the author, but they should be considered as a lower priority.

This goes back to the our desire to make sure that people who have invested their time inr eading and commenting on the document don’t feel like they have wasted their time and that their opinions aren’t even going to be taken into consideration.

Something to be aware of is that, in my experience, platforms with in-line commenting such as Google Docs or Github Pull Requests can create a habit of commentiong-as-you go. This can be extremely annoying to RFC authors, as they keep receiving notifications and scriolling through comments that are answered in the document itself if the reviewer just read a few paragraphs more. There are a few technology options that can help with this, such as Github Reviews, but to me this is a behavior better addressed by feedback and coaching.

Every RFC has a lifecycle. The lifecycle has the following phases:

Draft: The authors are working on the RFC before asking for wider feedback

Feedback Requested: The RFC has been sent to the mailing list is waiting for feedback from stakeholders

Active: The deadline for comments on this RFC has passed and the authors have decided to go ahead with it

Abandoned: The authors have decided not to move forward with the changes proposed in this RFC.

Retired: The changes proposed on this RFC aren’t in effect anymore, the document is kept for historical purposes

The lifecycle of an RFC is meant as a tool that people can use to enforce a window in which feedback is expected and create a discrete point when the authors can say “Thanks everyone” and move on, either implementing the changes or deciding that it wasn’t such a great idea.

The draft stage is aimed at creating a safe space for people to gather early feedback on an idea. As mentioned before, engineers can be really resistant to sharing half-baked thoughts until they can defend their opinions and designs from criticism, and this might take a long time. People seem generally more comfortable with sharing something in its early stages if is clearly marked as a draft, though, and this can lead to faster feedback cycles.

I would generally recommend that once an RFC moves away from Feedback requested, it is considered a historical artifact, if not discarded completely. RFCs aren’t great as documentation, once the feedback period is over I usually ask the authors to document any relevant parts somewhere else like a wiki or even a different Google Doc.

Each RFC has a revisit date, by when the authors will update the mailing list on what they have learned since the feedback phase. This is a natural point for an RFC to be retired and a new approach proposed.

The most important lesson that I have learned as a change agent in organizations is that people are much more welcoming to change if they know that the decisions and assumptions will be revisited at some point in the future.

I love the way that Linda Rising describes this as a Pattern in her great book Fearless Change: Patterns for Introducing New Ideas:

You’re getting worn out as you attempt to address the concerns people have about the new idea because it doesn’t look like the questions and objections are going to end anytime soon.

There are people in the organization who are expressing an endless supply of objections to the new idea. It would be a daunting, or even impossible, task to try to ease everyone’s worries before the new idea is adopted.

Fear is often what keeps us talking and questioning but stops us from doing anything. However, even though people may be fearful of change, they usually love to experiment. Change means risk. An experiment is something you can undo and walk away from when you are all the wiser.

Ideas that can be tested on an installment plan are generally adopted more rapidly than those that are not. If people are offered a trial period, they will have the opportunity to experiment with the innovation under their own conditions. This is likely to ease their uncertainties and give meaning to something that was previously seen as only an abstract idea.

It’s more effective to let people convince themselves through sight and touch than to try to convince them with words and logic. For “test purposes” is a convenient label for temporarily transferring “unacceptable” ideas into an “acceptable” category, until such time that the idea can gain the persuasive power to become part of the established way of doing things.

Therefore:

Suggest that the organization, or a segment of the organization, try the new idea for a limited period as an experiment.

Having an expiry date and a commitment from the authors to revisit the decision is one way to implement this Pattern in your organization.

Format

The RFC document itself is where comments and decisions are recorded. It should be a Google Doc, and everyone should have access rights to comment on it.

As mentioned previously, this process can be implemented using various publishing software, from Google Docs to Github Pull Requests. I personally like the idea of Google Docs because it makes it easier to apply the same RFC process outside engineering. Say, for example, that you want to propose a change on the job description for engineers in your organization. If you use a tool familiar to your HR folks you can keep the conversation in a single document, instead of having to translate back-and-forth between what engineers are giving feedback on and an endless email thread with your People Team.

A good RFC will describe the scope and the approach. It should not contain a list of specific tasks or project plan.

This is a soft requirement, trying once more to reiterate that the what is often more important than the how for an RFC. It is perfectly fine to ask for feedback on a project plan, though, but I would suggest that the authors try and divorce the feedback on the objectives from the discussion about the project plan, it schedule, staffing and resources—the latter should derive from the former once that is established.

To avoid overloading the document with implementation details, RFCs should follow the Stanford Research Institute’s NABC model, making sure that they cover four points:

An NABC comprises the four fundamentals that define a project’s value proposition:

Need: What are our client’s needs? A need should relate to an important and specific client or market opportunity, with market size and end customers clearly stated. With DARPA, for example, we are required to state a critical Department of Defense (DoD) need. The market should be large enough to merit the necessary investment and development time.

Approach: What is our compelling solution to the specific client need? Draw it, simulate it or make a mockup to help convey your vision. As the approach develops through iterations, it becomes a full proposal or business plan, which can include market positioning, cost, staffing, partnering, deliverables, a timetable and intellectual property (IP) protection. If we are developing a product, it must also include product specifications, manufacturing, distribution and sales. DARPA usually demands paradigm-shifting approaches that address a specific DoD need (e.g., a 10-times improvement).

Benefits: What are the client benefits of our approach? Each approach to a client’s need results in unique client benefits, such as low cost, high performance or quick response. At DARPA, the benefit might be an airplane that turns faster, goes higher, costs less or is safer. Success requires that the benefits be quantitative and substantially better - not just different. Why must we win?

Competition/alternatives: Why are our benefits significantly better than the competition? Everyone has alternatives. We must be able to tell our client or partner why our solution represents the best value. To do this, we must clearly understand our competition and our client’s alternatives. For a commercial customer, access to important IP is often a persuasive reason to work with us. At DARPA, our competition is usually other research laboratories and universities across the United States. But, whether to a commercial or government client, we must be able to clearly state why our approach is substantially better than that of the competition. Our answer should be short and memorable.

The NABC format was introduced to SoundCloud by Gavin Bell, who has learned about it during his time on research labs. I was very skeptical of it at first, but nowadays it is my go-to format for proposals.

One of my favorite features of this model is the requirement that authors give some thought to alternatives and how they compare to the proposal. Something I like to enforce myself is that every RFC must consider the alternative of doing nothing. Every change requires investment of time, energy, and resources, and before implementing anything new we should consider what happens if we don’t do anything a all.

Benefits

The first significant benefit of the approach described above is making a clearer distinction between decision-making and feedback gathering. With a clearly appointed accountable team, we can create a disagree and commit culture. We will carefully hear all positions and reckons from everyone, but ultimately a decision will be made by a specific person or team. Once the decision is made, everybody, irrespective of any differences in opinion during the RFC process, will commit to implementing and championing the decision. If it turns out that the decision wasn’t a good one, the revisit date on the RFC is there to make sure another discussion will be held in the near future.

Another important benefit of the proposed RFC process is openness. We have fantastic engineers, and we need to use our collective knowledge as leverage. None of us is as smart as all of us. To make collaboration work, we need to make it easy for all engineers to see what RFCs are being proposed and we need to make it a safe environment to collaborate, where comments focus on factual benefits and tradeoffs.

The NABC format is an industry tool used for making structured ‘pitches’. Using this tool will likely lead us to discuss the what without losing ourselves in the ocean of technical detail.

These paragraphs summarizes a lot of what I have discussed in this annotated version, but in a concise way aimed at the reviewer. I find myself referring back to it a lot when people start off-topic discussions on RFCs.

Competition (or Alternatives)

Do Nothing

We should consider the option of not making any change and keeping the ad-hoc model we currently have for RFCs.

The main issues with this option were described in the Need section of this document. Unless something changes, the problems there will remain.

The “Do Nothing” option for the RFC process is highly contextual, but something that I believe most organizations will face is that, in order to keep communication manageable, people will either communicate in silos or stop discussing their ideas altogether. Even if you don’t adopt this process in particular, you should consider implementing an alternative.

Adopt IEEE RFC Model as-is

Although any collaborative development process will have feedback as a core component, the name RFC was made popular by the process used by the IETF to document fundamental standards for what eventually became the Internet. We could follow the IETF RFC model, and maybe even require authors to use terms like MUST, SHOULD, and MAY as formally specified by RFC2119 to avoid ambiguity.

The main reason to avoid this style is that IETF RFCs have evolved into “the Internet documents of record”, containing “very detailed technical information” about standards that browser vendors and network middleware need to implement. These documents will impact the whole industry and hence warrant a complex publishing workflow. The process we propose in this document, on the other hand, is about putting forward an idea as early as possible and receiving feedback on it by a wide audience. With this goal in mind, a less formal process like the one described here is preferred.

I’ve always been fascinated with RFC2119, so much that more than ten years ago I used it as a model when writing one of the first unit testing frameworks available for Clojure. After a few attempts at using the “official” RFC framework, though, I have found that even if you simplify the workflow it is very hard for people to be productive when bounded by it. Moreover, there is often no need for an RFC to be so precise, the authors are often the only people implementing the change and reviewers will benefit more from a more fluid, conversational prose than focusing on strict use of keywords.

Use Architecture Decision Record

Michael Nygard published a model to document and manage change in software architecture called Architecture Decision Record (ADR). Its motivation, format, and lifecycle are very similar to what this document proposes.

Nygard’s model is specialized in software architecture work. This is reflected in its usage of engineering tools such as repositories and Markdown files, which only make sense in a software project. We want the RFC process to be a tool useful in areas other than software development, which makes harder to implement some of the more specialized areas of the process. Nevertheless, ADRs can be used together with the RFC process described here when developing software systems.

I haven’t personally used ADR as proposed by Michael Nygard, and I am very interested in hearing experience reports from folks who have tried it. At the moment, I am not convinced that it is a good replacement for the RFC process described here. People often bring it up when reviewing the RFC process, though, so I wanted to address it from the beginning.

Acknowledgments

Etel Sverdlov, Vitor Pellegrino, José Muanis, Thompson Marzagão, Danilo Sato, Douglas Campos, and Vinícius Baggio Fuentes gave feedback on drafts of this article.

Revision History

11/20/2018 - First published

Layering Microservices

Mon, 24 Sep 2018 00:00:00 +0000

At Meetup, we are going through the oh-so-familiar path of splitting a monolithic system into microservices. The work on this started a few years ago, and the team has made sure that most of the microservices prerequisites were in place before we take any further steps. I joined the team this summer to help with planning and executing the architecture changes that are required to take us to the next level.

As we go through this process, one aspect of software architecture that is constant in our day-to-day is the use of Layers to organize our components. Layering is a technique that hasn’t been discussed as much when it comes to microservices. In this article, I want to review the application of the Layers pattern in a services architecture, and also discuss two layering strategies and how they have been fundamental to me when migrating from monolithic to microservices architectures.

Layers in Service-Oriented Architecture

I believe that Layers are one of the most useful tools in software architecture. They help group components and define how dependency and communication chains happen between them.

Frank Buschmann and his collaborators wrote the most comprehensive description of Layers in software (that I am aware of) in their seminal work Pattern-Oriented Software Architecture, Volume 1, published in 1996. But even before that, Meilir Page-Jones had previously used the concept to describe an Object-Oriented runtime, although he used the word domains to refer to each layer. I particularly like using Martin Fowler’s description of Layering from his book Patterns of Enterprise Application Architecture:

When thinking of a system in terms of layers, you imagine the principal subsystems in the software arranged in some form of layer cake, where each layer rests on a lower layer. In this scheme the higher layer uses various services defined by the lower layer, but the lower layer is unaware of the higher layer. Furthermore, each layer usually hides its lower layers from the layers above, so layer 4 uses the services of layer 3, which uses the services of layer 2, but layer 4 is unaware of layer 2. (Not all layering architectures are opaque like this, but most are—or rather most are mostly opaque.)

Layers then are groupings of components stacked on top of each other. The word component here is a placeholder for whatever the abstraction unit you are working with, e.g., classes, functions, services, etc.

The most well-known implementation of the Layer pattern is probably the networking stack, including its most popular implementation TCP/IP. This choice is often credited for the flexibility and consequent longevity of TCP/IP, making it possible to extend them in ways unforeseen when they were first designed.

If Layering is about grouping components and stacking them, there is still the question of what criteria to use when grouping components. In fact, in the same book quoted above Fowler says:

[…] the hardest part of a layered architecture is deciding what layers to have and what the responsibility of each layer should be.

Considering our focus on services, one could aggregate components in Layers based on the tech stack they use, their expected availability or many other criteria. Even within an engineering organization, different teams (e.g., infrastructure, appsec, application development, cost management, etc.) will likely have different approaches to these groupings depending on what traits are more interesting to them.

Thus, there are infinite combinations in which one can aggregate services into Layers. It is doubtful that a single layering model will be enough to understand every aspect of your architecture, as each one focus on one particular viewpoint. You will be using a combination of layering models to manage a complex architecture.

After building a few microservices architectures, I’ve found out two layering schemes that are invaluable in understanding and managing such highly-distributed architectures. They are so widely applicable that I will refer to them as Architecture Patterns.

Pattern: Clay-to-Rocks Layering Model

Even amongst services that have similar reliability or security characteristics, e.g. services that implement business logic, we find that they are not all the same in many other vital aspects.

Let’s consider a fictional example based on our work at Meetup. At our main consumer website, Meetup.com, there are lots of different user flows. Let’s focus on the features touched by our users when someone is looking at a profile. They might be looking at their own user profile or at someone they met at an event. They might also check out a group’s profile, to see if they provide the kind of experience that the user is interested in.

In a sophisticated microservices architecture, it is common for each one of the flows above to have their own user-case focused microservices, in turn, invoke lower-level microservices that have data on groups, users, events, etc.

The consumer website isn’t the only way users interact with us, however. So that brands and business like Google or DigitalOcean can organize their multiple meetups across the globe, we offer a product called Meetup Pro. One common use case for a Pro user is to get an overview of what events are scheduled across their groups. This is also modeled as a microservice in its own, accessing a few lower-level services.

Following this scenario, we have some services that are use-case driven, offering data that corresponds almost one-to-one with what the user sees on screen, and some that are more raw, meaning that its data needs to be processed, filtered, and aggregated before it can be presented to users in a meaningful way.

When we look at services through this lens, we start to see some strong correlation between how user-case driven or raw a service is and how often engineers change them over the product’s lifecycle. How many times have you seen the user profile page on social networks like Facebook or Twitter get a facelift since you first joined these networks? They surely look very different now from just a year ago. But, if you think about it, how often has the actual data there changed in a significant way, like when Facebook implemented its “real name” policy, or when Twitter made some profiles “verified”?

In product development, the closer to the customer a piece of software is, the more often it changes. The services on the top of the stack are where product managers and marketers want to improve the experience, where designs need to be refreshed every few months, and where most of the experimentation happens. They naturally experience more churn than other services, and this gives us an opportunity to optimize components at this layer for fast-paced change.

Components at the bottom of this diagram, on the other hand, don’t change that often. Of course, at some point someone added an attribute to a group or to a user that wasn’t there before, but this was often a big deal, surrounded by careful change management and a migration strategy from the previous to the new state.

This dichotomy is big enough to justify its own layering model. I like to call this Clay-to-Rocks:

In this model, we group services based on how frequently we expect them to change. Clay is a nickname for software that is expected to change often, usually driven by the constant changes that a modern software product requires to stay relevant. Software at this layer isn’t meant to be brittle or unreliable, but the people building it will often prioritize iteration speed over performance or resiliency.

Rocks are how we call the underlying software that enables many different use cases, the software that is so close to the core business that it will probably only change if the business model changes. Many other services depend on services from this layer, which means that they should be built and maintained with resiliency and performance in mind.

Services are usually born as clay, as the team is experimenting with new products and features. If the experiment finds product/market fit, they are usually moved down as more and more newer products and features start building on them.

When migrating from monolithic to microservices

Acknowledging the differences between rocks and clay is a common trait of most successful migration projects I’ve been part of. When organizations get to a point in their journey where they consider splitting the monolith, they usually have a stable core product but find it hard to iterate on new features or experiments quickly. In most cases, this has to do with how both clay and rocks share a single system, the monolith.

In such a scenario, the development cycle happens at a slow pace because even the smallest change to a feature at the clay layer can inadvertently affect one of the rocks and take the whole thing down. Code reviews, manual testing, slow rollouts, and many other change management techniques need to be added to a process, which makes the feedback cycle longer and longer.

It is very common for organizations at this stage to organize their engineering teams around a big effort to “split the monolith,” extracting services from it. In principle, the plan looks simple:

Unfortunately, I have never seen such an effort go well. Things usually go well as long as we deal with extracting clay services. In fact, logic at this layer tends to be so thin and coupled to the user experience that these can be often rewritten with a nice UX refresh.

The real problem reveals itself when people attempt to extract the rocks. Not only do these have stricter non-functional requirements, but there are also so many other subsystems that depend on them that it becomes almost impossible to remove one of these things without rewriting half of the monolith.

One approach that I have had more success with, something that classic Monolith-to-Microservices cases such as Twitter or SoundCloud have done, is to focus on extracting your clay objects and not worry at first about your rocks. What you should do instead is to expose these objects internally, building something that is sometimes called a backdoor API.

With an approach like this, one can extract the rocks over time, while still iterating on your product. It is very common that you never actually get rid of the monolith, but over time it becomes less and less part of the critical path, as the team either extracts objects or the business needs to change, and the new domain is implemented as microservices from the beginning.

Pattern: Edge-Internal-External Layering Model

One important perspective when visualizing distributed application architectures is to be able to place services based on where they live in the network.

Most architectures will have a variation of the model below:

Where the user interacting with an application through something like a web page, mobile app, or API will generate inbound traffic to your services. Irrespective of how many services you might need to fulfill the task at hand; the user request usually hits a single service. This service is often called an API Gateway, and it is responsible for figuring out which of your many microservices to call in response to this specific request.

Something that might not be clear from the diagram above is that the API Gateway will often delegate some of its own responsibilities to other components. Unless you want to build a monolithic API, concerns like user authentication, geolocation, rate limiting, and A/B testing should be separate services.

These auxiliary services are different from your typical microservices in many interesting ways. Not only don’t they implement application logic, directly related to your core business, but they often have stricter requirements related to availability and scalability. Another trait these components share is that they deal with data coming from the outside world and need to sanitize it before forwarding it to internal service. This means that they have to apply a fair amount of defensive programming.

Because these services are under somewhat strict requirements, performing changes to them tends to require a more careful process—e.g., you might want to execute performance tests before deploying modifications to these critical-path systems or a security audit before changes to authentication logic. You will probably need a more sophisticated approach to deploy changes at this layer, maybe with green/blue deployments, as any downtime here will take your whole product offline. All this makes the development cycle of components at this level slower than other services, as there is a higher risk of wide-reaching incidents.

Although the overhead is justifiable for these special-case components, we definitely do not want such a slow-moving pace for our regular microservices. One way to help an organization understand which components have the stricter requirements versus which ones are your usual fast-moving pieces is by applying a layering scheme that I like to call Edge-Internal-External:

In this model, we explicitly model the services described above as what I call the Edge layer. They are the entry point that receives requests from users and does everything required to translate them into requests within your architecture safely.

We then have services in the Internal layer. Those will be the vast majority of your microservices, and they can make a lot more assumptions about their clients and environments, including that those requests have been sanitized, have metadata for distributed tracing, etc.

There are also services in the External layer. These are services that our systems talk to but are not developed by us or deployed in a way that we can control, usually third-party services.

The Edge Layer itself might be implemented in many different ways. Several vendors offer all-in-one options that allow people to outsource this completely, products like API Gateways or Service Meshes. Organizations working at a higher scale or with more complex requirements might want to build and own at least parts of their Edge architecture themselves. This is especially true if they are not happy with the monolithic nature of the products available in the market and want to apply a microservices architecture to this Layer.

Given the focus on availability and performance for components at this Layer, it is common that the team that owns it isn’t a Product Engineering team, but falls under platform or infrastructure.

When migrating from monolithic to microservices

Similarly to the Clay-to-Rocks Layering Model, the criteria used by the Edge-Internal-External model has some correlation with grouping components by how hard to change they are. Namely, the Edge layer often requires stricter change management, similar to the rocks in the other model. Nevertheless, these models aren’t the same, and there are some subtle yet fundamental differences between them.

One way in which such differences can arise is when migrating from monolithic to microservices architectures. The Clay-To-Rocks model suggests that you leave your rocks inside the monolith for as long as you have to. In an Edge-Internal-External layering scheme, though, we have the Edge as a high leverage point, meaning that a small effort applied here can cause a massive improvement throughout the whole systems.

A very popular approach when using the Edge to drive an organization away from monolithic systems is by using the Strangler Pattern, first cataloged by Martin Fowler and described in more detail in a seminal article (and diagram) by Paul Hammant:

The basic idea behind a strangler is that you put a piece of middleware between the user and the legacy system. At first, the middleware will redirect all requests it receives to the legacy system, and return its responses to the user. You can then incrementally write replacements for subsystems of the legacy and deploy them into production. The middleware is smart enough to redirect traffic destined to that subsystem to the new implementation (often by inspecting the URL requested) while still redirecting all other traffic to the legacy system. Eventually, more and more subsystems get written, and the new versions replace the whole legacy application.

In our Edge-Internal-External model, the Edge layer offers an intuitive place for such a strangling point. A widespread approach to microservices migration is to start by removing this layer from the monolith. At this stage not only you can slowly extract logic from the monolith in their own microservices without changing any of your client applications.

Another advantage of this strategy is that you can also make sure that any new features are already implemented as microservices and still have access to vital features such as authentication and caching. In my experience, the biggest challenge in a large refactoring effort such as microservices adoption is to make sure that while a team is extracting logic from the monolith you don’t have other teams adding to it. This pattern offers you a way to clean-up your old systems without blocking people from working on new features.

The complexity of managing complexity

So much of software architecture is about keeping complexity under control. Layers can be a great way to contain entropy around your system, but sometimes it happens that teams fall in love with the pattern and start overdoing it. When using Layers, I recommend that you first start by applying a few simple models like the above. Any model with more than three or four layers is a bad smell to me—maybe you are trying to bundle together two different layering models?

Another aspect of using an architectural pattern is making sure that all engineers understand the why and how of it. It does not matter how many fancy diagrams you have buried down some Confluence page if your team doesn’t appreciate your layers they will either completely ignore it or spend a lot of time debating if a given service should be on layer X or Y.

Like any other tool in enterprise architecture, layers are only useful when they are simple and widely understood.

Acknowledgments

Etel Sverdlov, Vitor Pellegrino, José Muanis, Thompson Marzagão, Brian Gruber, Danilo Sato, and Douglas Campos gave feedback on drafts of this article.

Revision History

09/24/2018 - First published

Buoyant

Wed, 09 Aug 2017 00:00:00 +0000

As I recently wrote about here, I have left DigitalOcean after almost two years building the Product Engineering organisation there. I didn’t have any immediate plans about what would come next, but luckily my roll-off period was calm enough, and in the past month I was able to spend a lot of time hanging out with my family, preparing for my first Brazilian Jiu-Jitsu competition, and exploring many possibilities of what to do next.

The break has also given me a lot of time for reading and writing around distributed application architectures (aka. microservices) and the many new pieces required to enable them. Earlier this year I started cataloguing patterns of microservices architecture. There are very similar initiatives and even books being written on this topic, but more than just creating a list of every pattern used across the industry, my intention with this series of articles is to document and make accessible the tools, processes, and techniques I have tried and seen myself.

The latest article from this series was on a bleeding-edge topic: Service Meshes. I was first introduced to the core ideas of what would be called Service Mesh in 2015 when I met William Morgan and Oliver Gould, the founders of Buoyant. We first met in person at FinagleCon 2015, where Oliver presented on Service Discovery.

At that time we at SoundCloud were going through some of the pain points Oliver mentioned. We were moving away from simple A Records to SRV records and facing the choice between building yet more code to deal with the joys of DNS on the JVM or bite the bullet and move to Consul or Zookeeper. Buoyant was working on this problem, and I was excited to see what would come next.

Fast forward several years and Buoyant’s mission has crystallised as the service mesh, a network layer that can automatically add various resiliency and governance patterns to your services. They have a small team of engineers who have experienced the pain of distributed architectures first-hand and are working on the hard problem that is the gap between raw infrastructure and distributed applications.

This is an incredible opportunity to be part of the folks creating the next wave of infrastructure components, and I am happy to announce that I have joined the team, still based in New York City.

Pattern: Service Mesh

Thu, 03 Aug 2017 00:00:00 +0000

Since their first introduction many decades ago, we learnt that distributed systems enable use cases we couldn’t even think about before them, but they also introduce all sorts of new issues.

When these systems were rare and simple, engineers dealt with the added complexity by minimising the number of remote interactions. The safest way to handle distribution has been to avoid it as much as possible, even if that meant duplicated logic and data across various systems.

But our needs as an industry pushed us even further, from a few larger central computers to hundreds and thousands of small services. In this new world, we’ve had to start taking our head out of the sand and tackling the new challenges and open questions, first with ad-hoc solutions done in a case-by-case manner and subsequently with something more sophisticated. As we find out more about the problem domain and design better solutions, we start crystallising some of the most common needs into patterns, libraries, and eventually platforms.

What happened when we first started networking computers

Since people first thought about getting two or more computers to talk to each other, they envisioned something like this:

A service talks to another to accomplish some goal for an end-user. This is an obviously oversimplified view, as the many layers that translate between the bytes your code manipulates and the electric signals that are sent and received over a wire are missing. The abstraction is sufficient for our discussion, though. Let’s just add a bit more detail by showing the networking stack as a distinct component:

Variations of the model above have been in use since the 1950s. In the beginning, computers were rare and expensive, so each link between two nodes was carefully crafted and maintained. As computers became less expensive and more popular, the number of connections and the amount of data going through them increased drastically. With people relying more and more on networked systems, engineers needed to make sure that the software they built was up to the quality of service required by their users.

And there were many questions that needed to be answered to get to the desired quality levels. People needed to find ways for machines to find each other, to handle multiple simultaneous connections over the same wire, to allow for to machines to talk to each other when not connected directly, to route packets across networks, encrypt traffic, etc.

Amongst those, there is something called flow control, which we will use as our example. Flow control is a mechanism that prevents one server from sending more packets than the downstream server can process. It is necessary because in a networked system you have at least two distinct, independent computers that don’t know much about each other. Computer A sends bytes at a given rate to Computer B, but there is no guarantee that B will process the received bytes at a consistent and fast-enough speed. For example, B might be busy running other tasks in parallel, or the packets may arrive out-of-order, and B is blocked waiting for packets that should have arrived first. This means that not only A wouldn’t have the expected performance from B, but it could also be making things worse, as it might overload B that now has to queue up all these incoming packets for processing.

For a while, it was expected that the people building networked services and applications would deal with the challenges presented above in the code they wrote. In our flow control example, it meant that the application itself had to contain logic to make sure we did not overload a service with packets. This networking-heavy logic sat side by side with your business logic. In our abstract diagram, it would be something like this:

Fortunately, technology quickly evolved and soon enough standards like TCP/IP incorporated solutions to flow control and many other problems into the network stack itself. This means that that piece of code still exists, but it has been extracted from your application to the underlying networking layer provided by your operating system:

This model has been wildly successful. There are very few organisations that can’t just use the TCP/IP stack that comes with a commodity operating system to drive their business, even when high-performance and reliability are required.

What happened when we first started with microservices

Over the years, computers became even cheaper and more omnipresent, and networking stack described above has proven itself as the de-facto toolset to reliably connect systems. With more nodes and stable connections, the industry has played with various flavours of networked systems, from fine-grained distributed agents and objects to Service-Oriented Architectures composed of larger but still heavily distributed components.

This extreme distribution brought up a lot of interesting higher-level use cases and benefits, but it also surfaced several challenges. Some of these challenges are completely new, but others are just higher-level versions of the ones we discussed when talking about raw networks.

In the 90s, Peter Deutsch and his fellow engineers at Sun Microsystems compiled “The 8 Fallacies of Distributed Computing”, in which he lists some assumptions people tend to make when working with distributed systems. Peter’s point is that these, might have been true in more primitive networking architectures or the theoretical models, but they don’t hold true in the modern world:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous

Denouncing the list above as “fallacies” means that engineers cannot just ignore these issues, they have to explicitly deal with them.

To complicate matters further, moving to even more distributed systems—in what we often call a microservices architecture—has introduced new needs on the operability side. We discussed some of these in detail before, but here is a quick list of what one has to deal with:

Rapid provisioning of compute resources
Basic monitoring
Rapid deployment
Easy to provision storage
Easy access to the edge
Authentication/Authorisation
Standardised RPC

So while the TCP/IP stack and general networking model developed many decades ago is still a powerful tool in making computers talk to each other, the more sophisticated architectures introduced another layer of requirements that, once more, have to be fulfilled by engineers working in such architectures.

As an example, consider service discovery and circuit breakers, two techniques used to tackle several of the resiliency and distribution challenges listed above.

As history tends to repeat itself, the first organisations building systems based on microservices followed a strategy very similar to those of the first few generations networked computers. This means that the responsibility of dealing with the requirements listed above was left to the engineer writing the services.

Service discovery is the process of automatically finding what instances of service fulfil a given query, e.g. a service called Teams needs to find instances of a service called Players with the attribute environment set to production. You will invoke some service discovery process which will return a list of suitable servers. For more monolithic architectures, this is a simple task usually implemented using DNS, load balancers, and some convention over port numbers (e.g. all services bind their HTTP servers to port 8080). In more distributed environments, the task starts to get more complex, and services that previously could blindly trust on their DNS lookups to find dependencies now have to deal with things like client-side load-balancing, multiple different environments (e.g. staging vs. production), geographically distributed servers, etc. If before all you needed was a single line of code to resolve hostnames, now your services need many lines of boilerplate to deal with various corner cases introduced by higher distribution.

Circuit breakers are a pattern catalogued by Michael Nygard in his book Release It. I like Martin Fowler’s summary for the pattern:

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.

These are great simple devices to add more reliability to interactions between your services. Nevertheless, just like everything else they tend to get much more complicated as the level of distribution increases. The likelihood of something going wrong in a system raises exponentially with distribution, so even simple things like “some kind of monitor alert if the circuit breaker trips” aren’t necessarily straightforward anymore. One failure in one component can create a cascade of effects across many clients, and clients of clients, triggering thousands of circuits to trip at the same time. Once more what used to be just a few lines of code now requires loads of boilerplate to handle situations that only exist in this new world.

In fact, the two examples listed above can be so hard to implement correctly that large, sophisticated libraries like Twitter’s Finagle and Facebook’s Proxygen became very popular as means to avoid rewriting the same logic in every service.

The model depicted above was followed by the majority of the organisations that pioneered the microservices architecture, like Netflix, Twitter, and SoundCloud. As the number of services in their systems grew, they also stumbled upon various drawbacks of this approach.

Probably the most expensive challenge, even when using a library like Finagle, is that an organisation will still need to invest time from its engineering team in building the glue that links the libraries with the rest of their ecosystem. Based on my experiences at SoundCloud and DigitalOcean I would estimate that following this strategy in a 100-250 engineers organisation, one would need to dedicate 1/10 of the staff to building tooling. Sometimes this cost is explicit as engineers are assigned to teams dedicated to building tooling, but more often the price tag is invisible as it manifests itself as time taken away from working on your products.

A second issue is that the setup above limits the tools, runtimes, and languages you can use for your microservices. Libraries for microservices are often written for a specific platform, be it a programming language or a runtime like the JVM. If an organisation uses platforms other than the one supported by the library, it often needs to port the code to the new platform itself. This steals scarce engineering time. Instead of working on their core business and products, engineers have to, once again, build tools and infrastructure. That is why some medium-sized organisations like SoundCloud and DigitalOcean decided to support only one platform for their internal services—Scala and Go respectively.

One last problem with this model worth discussing is governance. The library model might abstract the implementation of the features required to tackle the needs of the microservices architecture, but it is still in itself a component that needs to be maintained. Making sure that thousands of instances of services are using the same or at least compatible versions of your library isn’t trivial, and every update means integrating, testing, and re-deploying all services—even if the service itself didn’t suffer any change.

The next logical step

Similarly to what we saw in the networking stack, it would be highly desirable to extract the features required by massively distributed services into an underlying platform.

People write very sophisticated applications and services using higher level protocols like HTTP without even thinking about how TCP controls the packets on their network. This situation is what we need for microservices, where engineers working on services can focus on their business logic and avoid wasting time in writing their own services infrastructure code or managing libraries and frameworks across the whole fleet.

Incorporating this idea to our diagram, we could end up with something like the following:

Unfortunately, changing the networking stack to add this layer isn’t a feasible task. The solution found by many practitioners was to implement it as a set of proxies. The idea here is that a service won’t connect directly to its downstream dependencies, but instead all of the traffic will go through a small piece of software that transparently adds the desired features.

The first documented developments in this space used the concept of sidecars. A sidecar is an auxiliary process that runs aside your application and provides it with extra features. In 2013, Airbnb wrote about Synapse and Nerve, their open-source implementation of a sidecar. One year later, Netflix introduced Prana, a sidecar dedicated to allowing for non-JVM applications to benefit from their NetflixOSS ecosystem. At SoundCloud, we built sidecars that enabled our Ruby legacy to use the infrastructure we had built for JVM microservices.

While there are several of these open-source proxy implementations, they tend to be designed to work with specific infrastructure components. As an example, when it comes to service discovery Airbnb’s Nerve & Synapse assume that services are registered in Zookeeper, while for Prana one should use Netflix’s own Eureka service registry for that.

With the increasing popularity of microservices architecture, we have recently seen a new wave of proxies that are flexible enough to adapt to different infrastructure components and preferences. The first widely known system on this space was Linkerd, created by Buoyant based on their engineers’ prior work on Twitter’s microservices platform. Soon enough, the engineering team at Lyft announced Envoy which follows a similar principle.

The Service Mesh

In such model, each of your services will have a companion proxy sidecar. Given that services communicate with each other only through the sidecar proxy, we end up with a deployment similar to the diagram below:

Buoyant’s CEO William Morgan made the observation that the the interconnection between proxies form a mesh network. In early 2017, William wrote a definition for this platform, and called it a Service Mesh:

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.

Probably the most powerful aspect of his definition is that it moves away from thinking of proxies as isolated components and acknowledges the network they form as something valuable in itself.

As organisations move their microservices deployments to more sophisticated runtimes like Kubernetes and Mesos, people and organisations have started using the tools made available by those platforms to implement this idea of a mesh network properly. They are moving away from a set of independent proxies working in isolation to a proper, somewhat centralised, control plane.

Looking at our bird’s eye view diagram, we see that the actual service traffic still flows from proxy to proxy directly, but the control plane knows about each proxy instance. The control plane enables the proxies to implement things like access control and metrics collection, which requires cooperation:

The recently announced Istio project is the most prominent example of such system.

It is still too early to fully understand the impacts of a Service Mesh in larger scale systems. Two benefits of this approach are already evident to me. First, not having to write custom software to deal with what are ultimately commodity code for microservices architecture will allow for many smaller organisations to enjoy features previously only available to large enterprises, creating all sorts of interesting use cases. The second one is that this architecture might allow us to finally realise the dream of using the best tool/language for the job without worrying about the availability of libraries and patterns for every single platform.

Acknowledgements

Monica Farrell, Rodrigo Kumpera, Etel Sverdlov, Dave Worth, Mauricio Linhares, Daniel Bryant, Fabio Kung, and Carlos Villela gave feedback on drafts of this article.

Revision History

03/08/2017 - First published
05/08/2017 - Incorporated feedback

Authors:	Phil Calçado
To be reviewed by:	10/5/2018
Revisit Date:	04/17/2019
State:	Feedback Requested