Honestly, for code that controls millions of dollars worth of business and could...

lmkg · on May 17, 2012

"Changing one line of code took 6 days!" is a pretty common WTF story. It comes up every so often, and we all get a chuckle at the ponderous rate of change in large organizations.

However, "Changing one line of code broke everything for 6 days!" is also a common WTF story. As a rule of thumb, you can't avoid both of them, because avoiding one entails the other. I wouldn't fault a company that chooses the first story over the second.

The problem is that things that scale up don't always scale down. The processes implied aren't that bad, for handling large commits to a large process. They're obviously overkill for small changes. But just because the processes are a negative effect on this change in particular does not mean that the processes are a net negative overall. More flexible processes could also end up being more trouble than they're worth, due to allowing (encouraging) sloppy work.

_3u10 · on May 17, 2012

Or you could just hire responsible people and ask them to use their judgement, you don't have to choose one or the other you can instead apply judgement as to whether this is going to screw shit up or not.

Honestly, though the best way to handle this situation is with a phone call to the CEO/IT Director getting him to write you an email allowing you to override all the checks to get this in ASAP.

Once you're known for clearing things with hire ups, QA will start being reasonable. You have to keep in mind that most people in QA are type C personalities and will be a stickler for inane rules.

http://www.buzzle.com/articles/type-c-personality.html

phillmv · on May 17, 2012

Feh.

For starters, "just hire responsible people" can't be applied retroactively. Not to mention that… most people are responsible, or rather, the road to hell is paved with good intentions.

Secondly, it's an incredibly difficult thing to scale. That works great in startups, but by the time you have things like "HR Directors" you're no longer capable of having insight into the rest of the organization.

There is a lot to be said about how bureaucracies are the formalization of common sense, and a lot more to be said being able to scale processes out such that you don't end up in situations where stakeholders have no visibility into the work performed.

I just feel that the answer is more complicated than "hire A players".

crazygringo · on May 17, 2012

> Or you could just hire responsible people and ask them to use their judgement

If you know a fail-safe way of only hiring responsible people who always have good judgment... then I hope you're being paid millions of dollars in consulting fees, because nobody has really figured that one out in a general way yet!! :)

Also, those people tend to be expensive... sometimes it's far more cost-efficient to hire less-perfect people, but put rules in place to minimize any catastrophic damage they can create, at the expense of efficiency.

_3u10 · on May 17, 2012

Following the rules to the letter also produces bad outcomes.

The issue is not that using judgement is perfect it's that using judgement usually produces a better outcome than blindly following a rule.

Yes, rules should be put in place for the most extraordinary circumstances or places that good judgement is known to fail. The rule has to prevent the bad thing to be effective while also not creating something worse.

nknight · on May 18, 2012

In this case, there was someone with basically good judgement, but they were operating from high within their hierarchical framework, which inherently induces delays. In retrospect, Philip, David, or both, should have been copied on the ticket and directly overseen the policy violation that they deliberately set in motion.

Simply ordering policy violations from the throne doesn't work in practice. Either the person who made the decision or a trusted lieutenant who understands what's going on has to see it through to the end, otherwise the bureaucracy in the middle will simply continue on its preprogrammed course.

aangjie · on May 18, 2012

I think the reason this fails to happen(philip/davic overseeing the ticket till production) is because of this (Heads I win Tails you lose)principle http://www.ribbonfarm.com/2011/10/14/the-gervais-principle-v...

njharman · on May 18, 2012

> Or you could just hire responsible people and ask them to use their judgement,

Or, just as easily, you could sprinkle magic fairy dust over your servers to solve provlems.

Locke1689 · on May 18, 2012

Just because they're smart doesn't mean they don't make mistakes. If you allow one line changes to pass without review, eventually someone's going to make a breaking one-line change.

VikingCoder · on May 17, 2012

Did you miss this part:

> It [sic] we don't do this right away, we'll have to have a layoff.

As a company, it's great to have a process for incremental changes, but you also need to have a process for critical hot-fixes.

You're a phone company, and adding a new feature should take months. I get that.

You're a phone company, and because of an unforeseen Cinderella problem, 10% of your customers can't make phone calls. At all.

It is unacceptable to have a one-size-fits-all process, no matter what your business is.

saraid216 · on May 17, 2012

> As a company, it's great to have a process for incremental changes, but you also need to have a process for critical hot-fixes.

And as soon as you do, everyone thinks that their problem is a good candidate for the critical hot-fix path. At the very least, you now have to sink minutes per day into arguing them down from using it. Worth it? Possibly. But not always.

konstruktor · on May 18, 2012

If you have a single (actually) business critical incident that cannot be resolved, the company will fail. It's called risk management. You spend minutes per day arguing with idiots to not have the company fail in an emergency.

nknight · on May 18, 2012

You need a good operations team with the authority and willingness to say "no" with overrides coming only from senior management, the authority to say "yes" for the obvious stuff, and a direct line to senior management for grey areas they're not comfortable making a call on by themselves.

The operations team will spend a few months saying "no" a lot and justifying their decisions to management. Eventually it will slow to a trickle except for a few really stupid people who lack reading comprehension and any sense of pattern detection. Management will eventually get tired of it and forbid the stupid ones from making hot-fix requests.

VikingCoder · on May 18, 2012

Agree completely.

Step One is "Act Intelligently."

No process is going to help you skip Step One.

lilyball · on May 17, 2012

What's a Cinderella problem? I'm assuming it's a reference to the midnight deadline.

VikingCoder · on May 17, 2012

The code behaves differently after some event happens.

Midnight. Y2K. Daylight Savings Time change. Leap Day. Congress changes when Daylight Savings Time happens and one component in the system didn't know about it. Year 2038 problem.

tripzilch · on May 17, 2012

2038 is going to be a fun one, isn't it? :)

VikingCoder · on May 17, 2012

Dr. Peter Venkman: This city is headed for a disaster of biblical proportions.

Mayor: What do you mean, "biblical"?

Dr Ray Stantz: What he means is Old Testament, Mr. Mayor, real wrath of God type stuff.

Dr. Peter Venkman: Exactly.

Dr Ray Stantz: Fire and brimstone coming down from the skies! Rivers and seas boiling!

Dr. Egon Spengler: Forty years of darkness! Earthquakes, volcanoes...

Winston Zeddemore: The dead rising from the grave!

Dr. Peter Venkman: Human sacrifice, dogs and cats living together... mass hysteria!

Mayor: All right, all right! I get the point!

pestaa · on May 18, 2012

Funny quote. I love that dogs and cats living together came after the dead rising from the grave. Not sure I'd put them in this order. :)

vorg · on May 17, 2012

ICL George let operators flick a switch to make the current day have 60 hours until the switch was flicked back.

benjaminwootton · on May 17, 2012

Reliability and quality are obviously important, but you have to question whether all these steps are actually helping you to achieve that.

All of these items in the story add overhead and reduce agility, but potentially do very little to improve quality:

  - Mandatory paper trail via fields on the issue tracker
  - The need for documented internal sign offs 
  - Standards and policies with no clear reason for the existence
  - 'Micro' Code reviews
  - Excessive permissions and security processes internally
  - No pragmatism with regards to testing the change

This kind of stuff is a constant and substantial overhead on getting anything done. You need to look at these processes very critically and be very sure of the benefits in terms of quality before introducing it.

HeyLaughingBoy · on May 17, 2012

- Code reviews

I have to take issue with this one. Code (and Requirements, and Design) reviews, when done properly probably have the greatest impact in improving product quality. No, I don't have time to look up the SEI research on the subject, but it has been consistently shown to be true.

Part of the problem I had with the article was that Shirley was rejecting code for non-specific reasons. You can't say "it's not written down anywhere." If it's important enough to reject the code, it's important enough to write down.

benjaminwootton · on May 17, 2012

I like code reviews for higher level concerns - new class hierarchies, big pieces of refactoring, new areas of functionality etc.

However, I don't see as much value in reviewing low level changes such as one liners or odd bug fixes.

Making code reviews mandatory is too much overhead and too inflexible for me.

kisielk · on May 18, 2012

We have mandatory code reviews for all changes and I think it works quite well. A person making a one or two line bugfix may think it's quite innocuous but it may have a bigger impact on other parts of the system than they realize. Many eyes help reduce the likelihood of that.

cpeterso · on May 17, 2012

Mozilla and Google require code reviews of all code changes.

Here are some notes about Firefox's code review process:

https://wiki.mozilla.org/Firefox/Code_Review

https://wiki.mozilla.org/User:GavinSharp/Code_Review

crazygringo · on May 17, 2012

I dunno... the odd bug fix has the potential to really screw things up, when the person fixing it isn't the person who wrote it originally, or it's been months since then.

bobbles · on May 17, 2012

A few notes on those points:

1) The fields are probably mandatory because so many times routine entries were not filled correctly, and even MORE overhead was required going back and forth between teams to find out all of the details. If there is a way to avoid this overhead, again, how do you stop people abusing it?

2 & 3) Yeah.. bad overhead.

4) A sanity check is ALWAYS a good idea, whether its a fresh intern or the most seasoned programming in the company making a change.

5) They security processes might seem excessive but once again are probably in place due to issues in the past that have required the review/signoff be there to stop major disruptions getting through.

6) As someone who worked in QA for a while, there is no way in a hell someone can say "push this change through" without a strong questioning of why. QA is there (in most cases) to be that last line of defence. Often letting people know just how many people are really going to be affected by this change.

Being on a testing team often gives you a broader oversight of how all of the components of the code work together, while people working on one project can sometimes get tunnel vision on getting their thing out the door.

Edit: Just like to point out, I'm agreeing with you

benjaminwootton · on May 17, 2012

I really like to see a pragmatic relationship between Dev and QA where you reach agreement on the best way to build confidence in a change.

QA should absolutely be able to push back on Dev with regards to quality and relevant test cases, but likewise Dev often need to be able to direct the testing and mutually agree on the scope that will get the change signed off adequately.

ajross · on May 17, 2012

Exactly. The point bengl3rt was making, I assume, is that cavalier avoidance of process is a bad thing because it allows mistakes to happen that would be caught. That much is true.

But the assumption that's wrong is that all processes avoid mistakes. Clearly they don't.

And some of the examples here are just plain cargo cult misapplications of good ideas. The point behind code review is to catch design flaws in new code. You never want to demand refactoring of existing code in order to patch features, that hurts, it doesn't help.

theoj · on May 17, 2012

This quote is appropriate here:

Skilled people without a process will always find a way to get things done. Skill begets process. But process doesn’t beget skill. Following a recipe won’t make you a great chef – it just means you can make a competent bolognese. Great chefs don’t need cookery books. They know their medium and their ingredients so well that they can find excellent combinations as they go. The recipe becomes a natural by-product of their work.

From http://the-pastry-box-project.net/cennydd-bowles/2012-march-...

vacri · on May 17, 2012

Er... great chefs do need cookery books. They may not refer to them as often, but you won't find many chefs out there without a collection of cookery books. They still use recipes for things they're not familiar with.

Besides, skill may beget product, but it doesn't necessarily beget process.

theoj · on May 18, 2012

>> Besides, skill may beget product, but it doesn't necessarily beget process.

Oh boy, talk about not seeing the forest for the trees. For skill to beget a successful product, you must follow a process to get from nothing to product. So in the course of creating a successful product you have automatically created a process -- the process to build the product. That process may be used one time, or multiple times -- but it's still a process.

vacri · on May 18, 2012

Not seeing the forest for the trees? Hell, if we're going to use that loose a definition, everything is a process, regardless of skill level. Even if you end up without a product, to get the steaming pile of crap you abandoned, you went through a process.

HeyLaughingBoy · on May 17, 2012

Demand, no, but often it is the best idea.

Having just (as in hours ago) been through the wringer of adding new features to a mass of spaghetti that had not been touched in 10 years, the only way I could maintain my sanity (and have some assurance beyond regression testing) that I wasn't breaking something, was to refactor the existing code and then insert my changes.

ajross · on May 17, 2012

Sure. That's the point behind refactoring, it makes changes to existing code flow more smoothly. But the case here doesn't fit that at all: the change as described was a change in configuration (that just happened to be stored in a code variable), yet it was being reviewed as if it were a new feature being added through development. That's the "cargo cult" part -- refactoring in the course of development is good. Rules demanding refactoring are bad, because they hit false positives (in this case, a high priority configuration change).

aslakhellesoy · on May 17, 2012

And more importantly: a complete lack of trust within the organisation.

benjaminwootton · on May 17, 2012

Agreed. 19 times out of 20 people will do the right thing.

Therefore, process like this unnecessarily slows down the organisation 95% of the time.

Much better to trust people and implement very specific and very carefully designed safety nets such as fast rollbacks.

In most cases, the benefits of just trusting people will massively outway the costs of the occasional slip up.

viraptor · on May 18, 2012

If these are the proportions then I really disagree. And actually these are the proportions I see (or better). If you have 19 out of 20 changes done correctly, that means probably 1 every week will fail. Now the question is how will it fail - spelling mistake? single request rejected? single request failed? customer data lost?

If you let too many simple issues through, you're likely to find yourself completely failing when a very simple thing breaks. First you'll find that your logging is not correct, so you'll need to fix that first and reproduce, then you'll find that the rollback doesn't quite work and you've got some bad data to fix manually, then you'll find that this is actually a simple error masking some really nasty bug, etc. etc. I've seen that once or twice and I really believe in the broken windows theory now. I'd be more glad with a reasonable slowdown, than people pushing ahead instead of stopping to think about long-term issues.

Yes, I'm the guy who rejects ~4 out of 5 changes during code review initially (on average). Then again, I'm the guy who gets woken up when things fail - not everyone does, but it really gives you the appreciation of why you want to test those exception branches. I hope that people will be as strict about stuff I submit too.

aslakhellesoy · on May 17, 2012

> Also, I can do a lot of damage in one line.

Especially if it takes a week to revert it. You reap what you sow.

webcowboy · on May 17, 2012

I'm glad this is being said.

I would add that crappy processes and tools are usually in place because the management/leadership in an organization has made them too brittle. While you don't want to wild-west every "critical" issue, bad spots in a process will naturally highlight themselves over time.

Let your tools and processes constantly evolve instead of blindly subscribing to a methodology, and you'll be in much better shape.

MartinCron · on May 17, 2012

If you say any change longer than four lines is subject to this process, then people will work on trickling in new stuff in increments of four lines at a time

With proper test automation and continuous deployment, trickling new stuff in increments of four lines at a time is the right way to do it.

archangel_one · on May 17, 2012

I agree that it's worth being careful on mission-critical code, but this isn't being careful, it's just box-ticking. Ed said he'd tested the actual change, but most of the time is taken up by shifting it out to some parameters file, lengthening its name and waiting for permissions.

helmut_hed · on May 18, 2012

No one seems to have pointed this out yet, but the code that runs your manufacturing line is indeed "mission critical". Furthermore it's likely that your customers demand that you have strict processes in place for changes, and can actually audit you at their pleasure. So being so bureaucratic about it is not surprising.

The Web 2.0 world works very differently, of course.

archangel_one · on May 18, 2012

To clarify, I'm not suggesting that it's not mission critical, just that I don't believe that that kind of process is actually making things any safer.

j_baker · on May 17, 2012

Indeed. But that doesn't mean that you have to like taking 6 days to write one line of code.

That said, 6 days isn't so bad as long as we're talking about the time it takes to write the code and put it into production. It's another thing to take 6 days to put the code into source control.

tamersalama · on May 17, 2012

In many cases it's trivial to rollback. Why would you fight change if changing the change is as easy?

In this case process overrode rationale; and was cheered as a result on its own.

akg · on May 17, 2012

Agreed. I think the main problem in this story however is the the red-tape between people and not so much the process itself.

Ask Homer, well Homer is not in so ask Marge, well I don't have access to Marge. Delays like this that come from artificial communication barriers can be easily avoided if they worked on improving communication throughout the organization. It would probably have saved them 1-2 days, if not more.

baha_man · on May 18, 2012

As mentioned above, Homer and Marge are servers rather than people.