Apple Handling of Non-Reproducible Bugs

The most frustrating thing about developing in Apple’s ecosystem today, for me at least, is bugs that are difficult to reproduce.

I have two separate issues right now where customers write to me because they’re having a problem, and I can’t reproduce that problem. In both of these scenarios, asking the customer to reboot their phone fixes the problem.  I’ve seen other companies do the same.

This should never happen. An app should never be able to get the system into a state where OS level functionality (in my case, iCloud document sync, and email) stops working in such a way that the system needs to rebooted to fix it. It’s an OS bug.

I’ve attempted to get these bugs through Apple’s Radar process, but it always seems to stop dead with the fact that I can’t give then a reproducible scenario. Occasionally they’ll ask for logs, and then the bug gets closed as a duplicate.

A lot of the problems we’re seeing with iOS 8 are not easily reproducible, and I wonder if this isn’t a sign of a bigger problem with the bug reporting system and it’s handling of problems that are difficult to reproduce.

The iMessage problem that plagued so many people for years is a perfect example. I haven’t seen this happen in Yosemite, yet, but for at least two major OS releases, there’s been a problem where some users find message delivery unreliable. It wasn’t just me; it’s not hard to find people talking about iMessage delivery reliability issues.

How did this bug survive for so long? I don’t know Apple’s internal processes, but it seems like these difficult-to-reproduce problems fall through the cracks, and persist for far longer than they should.

It’s often not clear who should own these bugs. If iCloud sync stops working, whose bug is it?  There are probably half a dozen subsystems involved here, and coordinating reproducing the bug and fixing it is no easy task. And the problem is it’s probably a task with no explicit owner; it belongs to whoever it’s assigned to at the time, but once they get around to investigating it and figuring out that it seems to be a bug somewhere else, the bug gets reassigned and the process starts over.

I’ve been thinking about how I’d solve this problem, and my proposal is that once an issue reaches a certain level of notoriety, it should be assigned to a person whose job it is to own that bug. Someone who is outside the various teams involved, and can follow the bug wherever it leads. This person would be the owner of maybe 5 bugs at a time, and that’s their full time job – to contact customers who are having the problem, arrange for instrumented builds to capture information about when the problem happens, whatever it takes.

Apple is suffering a pretty severe reliability hit right now with iOS 8 and all the problems that are plaguing people. I’m sure the teams are busy enough just fixing the issues they can reproduce, but that’s what makes these other issues last so long. There’s always a bigger fire to put out than a bug that’s affecting a tiny percentage of users, but at Apple’s scale, that tiny percentage of users is still a lot of people.