We need to accept that unforeseen regressions and late changes have consequences. Slipping the full Fedora release schedule when we don’t meet our release criteria is a good way to show that and maintain a baseline of quality for our releases. Working backwards from important milestones and starting earlier is how we ship on time.
At the April 1, 2010 Board IRC Q&A session John McDonough asked if we track the reasons for schedules slips and respond accordingly. I track the differences between our planned and actual major milestone dates. Exactly why a milestone slips is not specifically tracked and there are usually differing opinions for the exact reasons. Generally we slip because we cannot resolve blocker bugs in time to adequately retest them and stage the content for the coming Tuesday release day–articulated recently by Adam. The QA team is doing a great job tracking our experiences for Fedora 13 in living retrospective page.
This got me thinking more about the things detailed below.
The Release Schedule Affects Everyone
Each Fedora release has its themes and challenges. Fedora 13 was the first release to have detailed release criteria and set schedule methodology from the beginning. We also started with a contingency plan of slipping the release milestone and subsequent milestones by one week if we missed a milestone. For the Fedora 13 Alpha, we absorbed the slip, hoping for the best and justifying the change based on the newly implemented No Frozen Rawhide “not taking away developer time.”
This phrase doesn’t make sense to me. I wasn’t at the meeting so perhaps I missed the full context. The Fedora release schedule is bigger than whether or not we are “taking time away from developers.” It is also important to factor in the time period between test releases to clear blocker lists, the amount of PR and public testing soak time a test release gets, and how much an already compressed schedule is being compressed more.
Stop Iterating on the Scheduling
As we construct each new release schedule we try to factor in the lessons from the release before. This means no two release schedules are ever the same. They are often very similar, but the task durations change and the methodology gets tweaked. We’ve reached the point where we need to stop tweaking and run with a fixed schedule methodology for more than one release.
Given how long it takes us to get used to new processes, holding our schedule methodology constant for a couple releases and taking a break from experimenting might yield better insights into how to do our releases better and build the schedule accordingly.
I’m proposing a Fedora 14 schedule that follows the same methodology as Fedora 13.
Schedule Milestones are Not a Suggestion
If we really want our releases to be on time we must give interim milestones and tasks just as much value as the big ones. If we plan to compose a release candidate on Thursday so QA has six days to test before the Go/No-Go meeting, we should make a bigger deal when the release candidate isn’t ready until Monday or Tuesday. If history shows that we rarely if ever have a solid release candidate on the day it is scheduled we should start earlier than Thursday to create it. We already have a “Test Compose” milestone scheduled a week earlier to address this, but it suffers from the same approach.
It’s all about working backwards–just like we do for every day life events that must start or happen by a certain time. We work backwards, we start earlier, we build in buffers and contingency plans, and we say “no” even if other people don’t understand.
Consistently Slipping the Slip
Our schedules are constructed very tightly to maximize development time at the front of the schedule. This results in tasks scheduled for the shortest time possible in the rest of the schedule. It doesn’t make sense to think that when something goes wrong we can suddenly make up the time. Rarely, if ever, has Fedora “made up time” on the schedule. It is part of scheduling mythology that “time on a software schedule can be made up.”
Historically, failing to to “slip the slip” catches up with us in the end. As a distribution we underestimate the marketing and PR value the Alpha and Beta releases bring. With a short length of three weeks for each it doesn’t make sense to shorten them so that the final release date can be on time.
Reverting to Last Known Good
For as long as I can remember Fedora has been a time based release. Most successful time based releases have stricter practices around “last known good” content and rolling back to it when regressions are introduced. We don’t revert very often because it is usually deemed more disruptive to roll-back to the “last known good” than to keep the new package and fix the regression.
This is part of our process we should fix. It could go a long way towards reducing our need to slip. No Frozen Rawhide helps to make sure less broken stuff gets in, but it does not address how to fix broken stuff when it does get in.
April 16, 2010 at 9:52 pm
I have to disagree a lot with what you wrote there:
* The main reason we usually don’t make up for the time we slip is that we don’t even try! It’s BECAUSE you and a few other folks always argue that slips should be propagated all along that we don’t make up for lost time. Instead, what often happens is that we then get ANOTHER slip ON TOP of the previous one, whereas if we had absorbed the first slip, we would have ended up only with the second one. But then you’d argue we should have known all along that we needed the extra time, when it’s really a completely separate slip that happened, and when your way of working would have lead us to accumulate BOTH slips.
* Somewhat relatedly, if the goal is to release on day X, the right way to get there is to schedule for X-N where N is some slack time, NOT for day X. Slips will invariably happen and move you to day X, give or take a few days. It doesn’t matter if the schedule is unrealistic, it’ll slip anyway. It just puts the pressure on people.
* Reverting stuff is in many cases the wrong solution to a problem. The issue should be fixed instead. Not only does the reversion process itself cause practical issues (such as Epoch inflation), but a component is not isolated, there are interdependencies in our distro. Often, other components have been updated to expect the updated component. In any case, they’ll only have been tested with the new version, not with the one you’re going to revert to. So reversion has a non-neglectable risk of breakage as well.
April 14, 2010 at 2:49 pm
This all makes good sense. I especially like the idea of sticking to the same methodology for >1 release to learn more. We’ve been doing such great iteration on the schedule for so many releases that we have to be approaching something that really deserves another release as-is to shake out more kinks + really learn from it.
Interesting to come back to the discussion on last known good. I recently wrote about the six-month release cycle that I mistakenly thought was adopted by Fedora from other projects — it’s really a child of the time-based release and schedule that Red Hat Linux used to run. In this post by Havoc Pennington, he identifies that history while having a discussion that might seem familiar. In particular, being willing to fall back to last known good may be a key to doing a time-based release schedule.
Overall, the rigor around scheduling is helping us to actually process out the ways the FOSS methodology works against releasing on time.