Back to table of contents

Credit: public domain


Andrew J. Ko

The first application I ever wrote was a complete and utter failure.

I was an eager eighth grader, full of wonder and excitement about the infinite possibilities in code, with an insatiable desire to build, build, build. I'd made plenty of little games and widgets for myself, but now was my chance to create something for someone else: my friend and I were making a game and he needed a tool to create pixel art for it. We had no money for fancy Adobe licenses, and so I decided to make a tool.

In designing the app, I made every imaginable software engineering mistake. I didn't talk to him about requirements. I didn't test on his computer before sending the finished app. I certainly didn't conduct any usability tests, performance tests, or acceptance tests. The app I ended up shipping was a pure expression of what I wanted to build, not what he needed to be creative or productive. As a result, it was buggy, slow, confusing, and useless, and blinded by my joy of coding, I had no clue.

Now, ideally my "customer" would have reported any of these problems to me right away, and I would have learned some tough lessons about software engineering. But this customer was my best friend, and also a very nice guy. He wasn't about to trash all of my hard work. Instead, he suffered in silence. He struggled to install, struggled to use, and worst of all struggled to create. He produced some amazing art a few weeks after I gave him the app, but it was only after a few months of progress on our game that I learned he hadn't used my app for a single asset, preferring instead to suffer through Microsoft Paint. My app was too buggy, too slow, and too confusing to be useful. I was devastated.

Why didn't I know it was such a complete failure? Because I wasn't looking. I'd ignored the ultimate test suite: my customer. I'd learned that the only way to really know whether software requirements are right is by watching how it executes in the world through monitoring.

Discovering Failures

Of course, this is easier said than done. That's because the (ideally) massive numbers of people executing your software is not easily observable. Moreover, each software quality you might want to monitor (performance, functional correctness, usability) requires entirely different methods of observation and analysis. Let's talk about some of the most important qualities to monitor and how to monitor them.

These are some of the easiest failures to detect because they are overt and unambiguous. Microsoft was one of the first organizations to do this comprehensively, building what eventually became known as Windows Error Reporting (Gelrum et al 2009). It turns out that actually capturing these errors at scale and mining them for repeating, reproducible failures is quite complex, requiring classification, progressive data collection, and many statistical techniques to extract signal from noise. In fact, Microsoft has a dedicated team of data scientists and engineers whose sole job is to manage the error reporting infrastructure, monitor and triage incoming errors, and use trends in errors to make decisions about improvements to future releases and release processes. This is now standard practice in most companies and organizations, including other big software companies (Google, Apple, IBM, etc.), as well as open source projects (eg, Mozilla). In fact, many application development platforms now include this as a standard operating system feature.

Performance, like crashes, kernel panics, and hangs, is easily observable in software, but a bit trickier to characterize as good or bad. How slow is too slow? How bad is it if something is slow occasionally? You'll have to define acceptable thresholds for different use cases to be able to identify problems automatically. Some experts in industry still view this as an art.

It's also hard to monitor performance without actually harming performance. Many tools and services (e.g., New Relic) are getting better at reducing this overhead and offering real time data about performance problems through sampling.

Monitoring for data breaches, identity theft, and other security and privacy concerns are incredibly important parts of running a service, but also very challenging. This is partly because the tools for doing this monitoring are not yet well integrated, requiring each team to develop its own practices and monitoring infrastructure. But it's also because protecting data and identity is more than just detecting and blocking malicious payloads. It's also about recovering from ones that get through, developing reliable data streams about application network activity, monitoring for anomalies and trends in those streams, and developing practices for tracking and responding to warnings that your monitoring system might generate. Researchers are still actively inventing more scalable, usable, and deployable techniques for all of these activities.

The biggest limitation of the monitoring above is that it only reveals what people are doing with your software, not why they are doing it, or why it has failed. Monitoring can help you know that a problem exists, but it can't tell you why a program failed or why a persona failed to use your software successfully.

Discovering Missing Requirements

Usability problems and missing features, unlike some of the preceding problems, are even harder to detect or observe, because the only true indicator that something is hard to use is in a user's mind. That said, there are a couple of approaches to detecting the possibility of usability problems.

One is by monitoring application usage. Assuming your users will tolerate being watched, there are many techniques: 1) automatically instrumenting applications for user interaction events, 2) mining events for problematic patterns, and 3) browsing and analyzing patterns for more subjective issues (Ivory & Hearst 2001). Modern tools and services like Intercom make it easier to capture, store, and analyze this usage data, although they still require you to have some upfront intuition about what to monitor. More advanced, experimental techniques in research automatically analyze undo events as indicators of usability problems (Akers et al. 2009); this work observes that undo is often an indicator of a mistake in creative software, and mistakes are often indicators of usability problems.

All of the usage data above can tell you what your users are doing, but not why. For this, you'll need to get explicit feedback from support tickets, support forums, product reviews, and other critiques of user experience. Some of these types of reports go directly to engineering teams, becoming part of bug reporting systems, while others end up in customer service or marketing departments. While all of this data is valuable for monitoring user experience, most companies still do a bad job of using anything but bug reports to improve user experience, overlooking the rich insights in customer service interactions (Chilana et al. 2011).

Although bug reports are widely used, they have significant problems as a way to monitor: for developers to fix a problem, they need detailed steps to reproduce the problem, or stack traces or other state to help them track down the cause of a problem (Bettenburg et al. 2008); these are precisely the kinds of information that are hard for users to find and submit, given that most people aren't trained to produce reliable, precise information for failure reproduction. Additionally, once the information is recorded in a bug report, even interpreting the information requires social, organizational, and technical knowledge, meaning that if a problem is not addressed soon, an organization's ability to even interpret what the failure was and what caused it can decay over time (Aranda & Venolia 2009). All of these issues can lead to intractable debugging challenges.

Larger software organizations now employ data scientists to help mitigate these challenges of analyzing and maintaining monitoring data and bug reports. Most of them try to answer questions such as (Begel & Zimmermann 2014):

The most mature data science roles in software engineering teams even have multiple distinct roles, including Insight Providers, who gather and analyze data to inform decisions, Modeling Specialists, who use their machine learning expertise to build predictive models, Platform Builders, who create the infrastructure necessary for gathering data (Kim et al. 2016). Of course, smaller organizations may have individuals who take on all of these roles.

All of this effort to capture and maintain user feedback can be messy to analyze because it usually comes in the form of natural language text. Services like AnswerDash (a company I co-founded) structure this data by organizing requests around frequently asked questions. AnswerDash imposes a little widget on every page in a web application, making it easy for users to submit questions and find answers to previously asked questions. This generates data about the features and use cases that are leading to the most confusion, which types of users are having this confusion, and where in an application the confusion is happening most frequently. This product was based on several years of research in my lab (Chilana et al. 2013).

Next chapter: Evolution

Further reading

David Akers, Matthew Simpson, Robin Jeffries, and Terry Winograd. 2009. Undo and erase events as indicators of usability problems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 659-668.

Jorge Aranda and Gina Venolia. 2009. The secret life of bugs: Going past the errors and omissions in software repositories. In Proceedings of the 31st International Conference on Software Engineering (ICSE '09). IEEE Computer Society, Washington, DC, USA, 298-308.

Begel, A., & Zimmermann, T. (2014). Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering (pp. 12-23).

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report? In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering (SIGSOFT '08/FSE-16). ACM, New York, NY, USA, 308-318.

Chilana, P. K., Ko, A. J., Wobbrock, J. O., & Grossman, T. (2013). A multi-site field study of crowdsourced contextual help: usage and perspectives of end users and software teams. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 217-226).

Parmit K. Chilana, Andrew J. Ko, Jacob O. Wobbrock, Tovi Grossman, and George Fitzmaurice. 2011. Post-deployment usability: a survey of current practices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, New York, NY, USA, 2243-2246.

Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. 2009. Debugging in the (very) large: ten years of implementation and experience. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP '09). ACM, New York, NY, USA, 103-116.

Ivory M.Y., Hearst, M.A. (2001). The state of the art in automating usability evaluation of user interfaces. ACM Computing Surveys, 33(4).

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). ACM, New York, NY, USA, 96-107.


Software Engineering Daily, Performance Monitoring with Andi Grabner

Software Engineering Daily, The Art of Monitoring with James Turnbull

Software Engineering Daily, Debugging Stories with Haseeb Qureshi