пятница, 3 апреля 2009 г.

Fixing Reliability Issues in IE8

Original: Fixing Reliability Issues in IE8

In a previous post, Andy wrote about some of the new features we introduced to improve reliability in Internet Explorer, such as Loosely Coupled IE and Automatic Crash Recovery. These features help minimize the impact of reliability issues (such as crashes and hangs) once our users encounter them, allowing them to return to their original browsing state as soon as possible.

From an engineering perspective, our goal is to minimize the occurrences of these issues in the first place. In today's post I'd like to walk you through the various approaches we use to identify, prioritize and address reliability issues in IE8. Each of these approaches covers different angles of the product and is equally useful in ensuring a reliable experience for different types of users.

End-user Feedback

Users are our greatest resource for feedback pertaining to reliability issues. We receive lots of information from our feedback channels about specific problems that users are experiencing. We also leverage Windows Error Reporting, which sends details of user-reported crashes and hangs back to Microsoft. The accumulation of these reports helps us understand overall browser reliability in customer environments and enables us to identify our top issues.

In last month's post, I described how we used a failure curve to identify and fix reliability issues for IE8 on the Windows 7 Beta. We follow a similar approach when we ship a new IE release. Below you can see the failure curve for the top crashes and hangs caused by IE in the IE8 Beta cycle. The data is based on snapshots taken 50 days after each IE8 pre-release. The green bars indicate failures that have been fixed in IE8.

 chart of IE failures. Each bar is a failure issue. It is sorted by number of occurences. The graph is in the shape of the right side of a bell curve.

A failure curve typically shows that a small number of failures impact a large number of users. These are represented by the leftmost bars on the curve and are most likely crashes in mainstream scenarios. The issues in the "long tail" are generally encountered in specific hardware and software configurations. While these bugs are important to fix, we focus our efforts on the most impactful issues first and systematically work our way down the failure curve. The top remaining issues that are not addressed in time will be fixed and shipped to customers in future security updates. To date, we have fixed 80% of all the reported crashes and hangs in IE8.

Feature Testing

Reliability testing starts from the feature level and helps maximize product quality from the ground up. The owners for each feature run automation on their new code to identify and fix key stability issues. These people are also the most prominent users of their respective features, and would be able to encounter any issues that are not discovered via automation. Therefore, the majority of stability issues for a certain feature are identified and addressed during this phase.

In addition, teams leverage several mechanisms that scan through and identify potential faults in their source code, such as buffer overruns, memory leaks, or uninitialized memory. Many of these faults could have become top crashes and hangs in the public, but our teams work to fix the issues before they are even encountered internally.

Internal Product Usage

Amongst the IE team and other divisions throughout Microsoft, thousands of people are using the latest versions of IE8 every day. One critical benefit of this internal network is that we can work directly with the employee who encountered a crash to investigate and fix it quickly, often debugging directly into their machines.

Internal users send us crash and hang data through Windows Error Reporting as well. We use their data to fix the most impactful issues seen in recent builds. The shape of the failure curve generated via internal data can be a good approximation of what we see from the public as well.

For example, many of our internal users upgraded to the Windows 7 Beta build several weeks before it was publicly available. The data generated from the extensive Beta-testing allowed us to create a preliminary failure curve. We began to investigate a handful of crashes that were clearly the most impactful issues. By fixing these crashes ahead of time we stayed ahead of our failure curve, and were able to swiftly address the remaining issues after the Beta was released.

Lab Reliability Testing

To discover crashes in more far-reaching usage scenarios, we perform a series of tests on lab machines to continuously monitor for new issues in recent builds. We employ two common techniques often referred to as Stress and Long Haul testing.

Stress Testing    
Stress testing the browser is critical to identifying architectural bugs and other hidden issues. Our goal is to be able to stress-test IE8 for a defined period of time without interruptions. To measure this, we rapidly navigate to websites and perform user actions such as opening and closing tabs and windows. We hope to identify important issues that may not be seen from regular browsing behavior. Over the course of a day, it's possible for our stress tests to navigate the same browser instance to over 100,000 web pages!

Long Haul Testing
Part of our vision for IE8 reliability is to allow users to run IE for as long as they desire without interruptions. We devise different types of tests to simulate long term usage of the browser and measure how long the tests can run before experiencing a crash or running out of memory. We then engage with teams to get the top issues fixed as soon as possible. We also run tests to track memory consumption after long periods of usage.

Summary

Each of the above approaches has played an important role in driving reliability improvements in IE8. In fact, we are able to quantify these improvements using various tests and analyses, and compare our results to IE7. For example, we are now able to run stress tests continuously for over 12 hours on IE8, compared to less than 8 hours on IE7. Also, we found that IE8 uses up 90% less memory than IE7 after running continuously for 24 hours under Long Haul testing.

We're convinced that our approaches to identifying reliability issues have generated significant stability improvements to IE8. We encourage you to try out Internet Explorer 8 today and send us your feedback. We rely on your collective feedback to determine the most important reliability issues to address in future releases. If you experience a crash or hang please be sure to let us know by submitting an error report.

Thank you for your help in ensuring that IE8 is such a great product!

Herman Ng
Program Manager

Комментариев нет: