S.T.A.R. and Issues

Nov 11, 2017 10:24 · 1997 words · 10 minutes read meta

STAR is a technique used to interview potential candidates. It stands for Statement, Task, Action, Resolution. Interviews using this method ask a candidate to describe a significant experience and organize their response into the format described. The interviewee provides a statement that describes the overall situation or challenge then walks through how it was resolved by describing: What the task was or the target; What specific actions or steps were taken to achieve that target; and whether or not the target was achieved.

Whether nor not this method accurately identifies the right candidates to hire is debatable, but aspects of this method can give us insight into how we can improve reporting solutions for bugfixes and post mortems.

S.T.A.R. Applied

When resolving a bugfix, we’re essentially doing the same thing that’s required of candidates answering questions in the STAR format. Some situation occurred that was of some significance (a bug in our logic was found, incorrect output was generated, there was a service disruption, etc). The task is to resolve the issue, we have to take a set of actions to resolve it, and then report on the result. This sounds like common sense, but following this instantly rules out giving reports like:

I applied a patch and the issue is resolved now, marking this ticket as closed

Comments like this on a ticket serve no purpose. The Statement and Task parts of the issue are already clear. What needs to be clarified are the Action and Resolution sections. A response like the one above has no list of actions. It’s impossible to verify that the issue was resolved in the right way. Additionally if there is a regression caused by this issue, the person assigned the task of fixing the regression won’t know what was done last time. This is reminiscent of digging through forum posts trying to find an answer to a problem only to find that somebody reports “the problem fixed itself” or a similarly vague answer.

With the goal in mind of answering questions related to the Action and Resolution sections, lets look at an example of a bugfix and a response that a programmer might give:

Example

The bug report:

Twice now in the past couple of weeks we have confirmed that application clients are not updating the database records correctly. The problem seems to resolve itself after a while but it’s taking much longer than. We have attempted to debug X and Y and can confirm that functionality Z is working but the database records are still not being updated.

Example instances from logs:

Instance #1 on April 3rd 8 AM

Instance #2 on April 3rd 10 AM

And the fix:

The root cause is apparent after looking at the update statements generated from before and after the 0.10.1 update.

Version 0.10.0 and before:

SAMPLE QUERY REDACTED

Version 0.10.1

SAMPLE QUERY REDACTED

These are sample queries. In the unit tests we use epoch time (Jan 1st 1970) for these statements which is why we have time_field <= from_unixtime(0) and statements like that. Those sections can be ignored. The difference is in the WHERE clause:

" AND (time_field <= from_unixtime('0') OR time_field IS NULL)" \

Versions of application previous to 0.10.1 did not correctly append time_field IS NULL in an OR clause and would only update columns where time_field was already set. Looking backwards, it appears that when LINKED-JIRA-ISSUE-CAUSING-BUG-999 was completed the logic was correct, but it’s hard to trace when this stopped working. Furthermore, this bug was definitely not intermittent as I originally thought. It prevented all functionality related to feature, however secondary-product was compensating which made the issue seem intermittent. This has been fixed by reverting the query to the version used in 0.10.0 including the OR clause.

Although I can’t say when exactly this stopped working, the ultimate cause for the logic breaking was a lack of test cases around these statements which was fixed as of LINKED-JIRA-ISSUE-FIXING-BUG-1000. Without test cases to cover all statements, the logic was very brittle and resulted in all changes being risky. We now verify that the correct update statements are being generated for each case (5 cases total). This will prevent improper updates to this logic in the future.

Actions and Resolutions

In the example above, the actions taken were:

  1. Roll back the query logic to a previously known good version
  2. Verify the fix by adding a suite of unit test to cover cases
  3. Deploy the patch to production

The final resolution was to roll back to a previously known good state instead of deploying a patch that would change the logic further. Any regressions were prevented by adding additional test cases that did not exist before (and had they existed this bug would not have manifested).

This takes care of the Action and Resolution sections in a very generic way, but I want to talk briefly about writing for your audience and how that might impact your decision on what to report and with what granularity.

Audience

When documenting bugfixes or doing post mortems we’re usually talking about a few different interested parties:

  • Programmers
  • Managers
  • Customers

Each one of these parties is going to want different things to be relayed in the communication:

Programmers are the easiest audience to write for. If you are a programmer, just think what you would want to know about this issue if you had to solve it for the first time. What questions would you likely want to ask about the actions taken? Surely you would want to have a diff of the source code before and after the issue was fixed, that gets you the “how” part of the question and also helps you see why there was an issue in the first place. If somebody used an ill performing algorithm or messed up some logic then you can see it in the diff. The other thing you want to know is “why”. Why was the issue resolved in this specific way? Why not use these other solutions?

First we should always link the issue with version control to pull the diff of the changes. Some issue tracking software will do this automatically but if yours doesn’t, be sure to put at least the SHA-1 in a comment somewhere so someone can look up the changes. We should also provide a high level overview of the fix to explain what factors influenced our decision at the time we fixed the bug. This shouldn’t be a tracing of the source code or anything similar. It should read similar to how you would explain the issue and the solution you came up with to another engineer, or how you would explain it to yourself in your head. It may even be helpful to grab one of your co-workers for 5 minutes and explain it to them, then write that explanation in the report.

Managers are a little more difficult to write to, but there is a set of common things that almost every manager wants to know about an issue:

  • How severe is/was it (complete outage? data loss? slowdown for some customers?)
  • What was done to mitigate it (high level)
  • Did we fix the issue
  • What are the next steps
  • What was done to keep the issue from regressing

For the most part we answer the first question while the issue is still in progress. If it hasn’t already been documented in the bug report we can document it there to give additional context to the rest of the report. The mitigation strategy should be fairly high level, keep it short to something like “We deployed a new version of the code” (as in the example above) or “We rolled back to version 0.10.1. The mitigation strategy is not a fix for the actual bug necessarily, it’s about getting the functionality of the application running again as quickly as possible.

In the example we didn’t know when the logic started working incorrectly (which is a problem in and of itself) so the only choice was to identify the incorrect queries and then roll them back to a previous implementation.

The next two steps are arguably the most important and the ones that I most often see missed. If there are any next steps for resolving the issue (such as recovering lost data) then they MUST be documented in the bugfix or post mortem. It’s extremely easy for someone to just write that something was fixed and close the ticket out then and there. Weeks later when a customer comes asking about data that’s missing/corrupted because of a closed issue, nobody is going to be happy. If recovery operations or next steps will take a long time, or are blocked by something (maybe purchasing or approvals) then it’s ok to split these out into additional tickets or requests to the appropriate parties. Just make sure to link them back to the original issue report to provide more context.

Finally the question most frequently asked is “how do we keep this from happening again?” This ties into how the issue occurred in the first place. If we didn’t have enough capacity to provide the expected throughput for queries, did we provision more capacity? If a bug surfaced in production, why was it not caught in our unit tests/integration tests/exploratory tests? If the design of some subsystem was incorrect for a certain use case, how did we patch it? Can we redesign that subsystem better?

Managers will likely have to give reports up the chain depending on the severity of an issue, so write with that in mind. Your words or the context you give about an issue could be read or discussed by people much higher up than you or your team. With that in mind write as clearly and accurately as possible for this audience.

Depending on how big your company is, you may not have to write to customers at all. Large companies will put out public statements about incidents that are written and reviewed by professionals, so you might not have to worry about it. If you do have to provide an incident report to a customer, I believe it’s helpful to write to answer the same types of questions that Managers would want answered. Try to strike a balance between providing enough context that the customer can get a basic understanding of what the issue was and what deficiency in the system caused it to begin with. Then state what actions are being taken in order to prevent the issue from happening again.

Another example (from Amazon)

The issue summary report from Amazon about the US-EAST-1 disruption is a good example of the way you might report an outage. Since this is a public release, it’s written to a customer audience. What this press release does really well:

  • Establishes a clear timeline, showing that the engineers understood the series of events leading up to the outage
  • Explains the actions taken to resolve the issue
  • Doesn’t dig into really technical details of what operational tools were used and how they work
  • Explains how Amazon will prevent this issue from reoccurring in the future (again without being too technical)

And it’s likely that any internal documents produced as a result of the postmortem for this outage have the same level of clarity while identifying several more technical recommendations.

Clarify Your Actions and Resolutions

When you resolve something you shouldn’t write vaguely about what you did, or simply mark a ticket as closed because the issue is resolved. You have to understand the narrative and context in these situations, whether you are the original reporter, a manager, or a future programmer trying to figure out how something was resolved. If you close a ticket, take a few more minutes to write up at a minimum:

  • A clear list of actions that you took
  • The resolution
  • What you are doing to prevent it in the future or next steps

And help everyone going forward (maybe even your future self).