"All interactions magazine articles are posted with the permission of ACM."
© 2002, Association
for Computing Machinery. Permission to make digital copies of part or all of
this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and full citation on the first page.
Reprinted with permission from interactions magazine, 9[2], 11 - 16, March/April,
2002.
interactions magazine
Volume 9, No. 2 (March/April, 2002) Pages 11 - 16.
"Why do Version
1.0 and not release it?"
Conducting Field Trials of the Tablet PC
|
Susan
Dray & David Siegel
Dray & Associates, Inc. Minneapolis, MN USA |
Evan
Feldman & Maria Potenza
Tablet PC User Research Microsoft Corporation Redmond, WA USA |
The Challenge of New Technology Design
There have been many discussions at conferences and in journals on the merits of different user-centered design (UCD) methodologies to address different types of design problems, and how they fit into the design life cycle. To make a generalization, there is a tendency to recommend ethnographic work at the start of the design process, and iterative usability testing in the lab as the design progresses. The assignment of ethnographic methods to the pre-design phase is almost a given in the field, while there is a tendency to see usability testing as the key source of design improvements as the design evolves. Arguments for these UCD methods are well known to this community, and there is no question the adoption of these methods is a tremendous step forward over traditional software development approaches.
However, in this article, we look at a situation that highlights potential limitations of this approach, one in which there was both a clear research need and a strong business case for a hybrid approach, integrating ethnography and usability throughout the development process. We call this approach "ethnographic field trials." Based on our experience, the arguments for such an approach and its cost-justification are the strongest when introducing innovative technology. We will illustrate this with the example of a series of field trials we carried out over 1 ½ years to study 3 different iterations of the Microsoft Tablet PC prototype. We hope to follow up in subsequent articles with more detailed discussion of methodological issues and examples of ways in which the studies influenced the evolving design.
A Case Study: The Microsoft Tablet PC
As introduced in the Microsoft Website (www.microsoft.com/tabletpc/), the Tablet PC is envisioned as the next-generation mobile business PC. It is expected to be available from leading computer makers in the second half of 2002. The Tablet PC runs the Microsoft® Windows® XP Tablet PC Edition and features the capabilities of current business laptops, including attached or detachable keyboards and the ability to run Windows-based applications. The biggest innovation of the Tablet PC is that it extends pen and speech capabilities to the power of a full-function laptop. Using a special stylus or pen, it allows the user to create, manipulate, and manage handwritten electronic documents ("notes") and also allows the use of handwritten electronic annotations on imported documents. Users can also interact with Windows-based applications via the pen or speech, even in the absence of a keyboard. The Tablet PC is intended to allow knowledge workers to use the PC in a variety of new ways and settings in which the use of a keyboard would be impractical. The hope of the design team is for Tablet PC to be adopted widely, and fully integrated into people's work. For this to happen, it must not only be highly usable, but also perceived to be extremely useful, and more beneficial than current options.
Ongoing Traditional User-Centered
Design Activities
In order to develop and refine the initial concept for the Tablet PC, the Microsoft
team performed ethnographic studies and contextual inquiry to understand knowledge
worker tasks, document management, and note-taking practices. They also studied
knowledge worker characteristics for development of Personae, and other elaborated
user profiles. Market research also used a variety of methods to characterize
the target market. The team did many traditional usability studies in the lab,
evaluating the usability of features during development, and made many significant
changes in design based on these results.
These research methods provided indispensable input into the development process, but they were insufficient to answer some of the most crucial questions, which were more global in nature. Will the Tablet PC be usable and useful in the workplace, and will users successfully adopt it? Will it be integrated into the workplace and work practices of users who explore it on their own? In what ways will users have to change their practices to exploit new capabilities, how motivated will they be to do so, and what obstacles will they encounter? We also wanted to understand usability at various points in the learning curve and at various stages in the adoption process. Was the Tablet PC useful and usable enough initially to allow users to work with it immediately, while continuing to explore and utilize additional functionality over time? How much of this functionality was used (or needed) once users were more experienced? For each of these questions, we wanted to assess utility and usability, and the complex relationship between them. In short, we needed to fully understand the user experience with the new device, in the natural setting, over time, and to evaluate separate features and design elements in the context of the overall user experience.
To answer these questions, we needed information based on studies of users interacting with Tablet PC prototypes. Ethnographic data without interaction with Tablet PC was of limited use. Such studies provide essential clues as to user needs and contextual factors that are likely to influence the acceptance, usefulness, and usability of a design, but these clues are somewhat indirect. This is especially true of new technology; with functionality previously unimagined, new tasks and scenarios enabled, and a change to the overall structure of existing work patterns. These questions also could not be answered primarily based on self-report methods, in which users describe perceived frustrations with their existing tools and practices, and react to the product concept. Self-report is likely to be influenced by how "cool" various features sound, but often does not accurately predict actual user behavior.
Iterative usability testing alone could also not provide answers. Common usability methods tend to provide fragmentary information, when what was needed was comprehensive evaluation of the user experience over time. Usability tests are clearly indispensable, but inevitably use tasks constructed to probe specific design features and functions. Even when task scenarios are very well-designed and based on user data, they do not necessarily capture spontaneous intentions and motivations of the user at the time. Also, they present tasks in isolation, rather than as part of the overall workflow and demands encountered by a user during the workday. Thus, the tasks must be chosen in a selective and prioritized manner, and are not likely to capture the full range of usage. In addition, usability testing gives a snapshot of ease of use at a particular point in the learning curve, typically the beginning. While we can often extrapolate from this, it may not be enough to predict users' experience and satisfaction over time, especially when tasks are perceived as novel and users expect to go through a learning and discovery process.
Many of these issues are inherent to the introduction of new technology. The risk of not addressing these questions directly is particularly high considering the huge investment in launching the new technology. This is especially true when actual usage of the capabilities is unknown. Furthermore, when the new technology has a broad vision or is a general-purpose tool, like the Tablet PC, people are likely to use it in a wide variety of individualistic ways, combining its capabilities in various ways to carry out real life tasks. Thus, significant interaction effects exist in which the usability of one element influences the usability of others, and the exact combinations will be different for different users and tasks. These are hard to capture in the lab.
The Tablet PC's success does not just depend upon the ability of individual users to carry out tasks in isolation, but on the fit between the uses it lends itself to and the work context, including the social dynamics, work practice, and technical infrastructure. In light of these factors, understanding the user experience calls for a more holistic evaluation process. We needed a method that could look in an integrated way at discoverability, utility, usability, and fit with the context, and we could only evaluate these things in the natural setting if we wanted to capture the range of variation we expected.
How to proceed?
Microsoft faced a limited set of traditional choices, none of which were realistic,
given the need. One alternative was to release the product and watch market
reactions while getting feedback from early adopters. This is probably the most
common response and the riskiest. If the product doesn't meet the user's expectations
or doesn't fit well into their existing or modified work process, then the product
won't be used. Redesigning the product is expensive from a development perspective,
especially if core functionality is not appropriate and sometimes many iterations
of the product in the market are necessary to get closer to meeting the user's
needs and expectations. Often however, these iterations will not happen because
a product does not even get a subsequent release if it performs poorly initially.
While usability tests and traditional ethnography help to minimize this risk,
many of the questions about adoption and use remain elusive and at the mercy
of the market.
Another alternative was for the team to gather usability data from the beta release even though a beta test is almost always a bug-fix release. However, beta testers are rarely typical users of a new technology and the information is probably too late to affect design. Often this information becomes the starting point for design changes for subsequent releases.
The Tablet PC team was not satisfied with any of these traditional approaches. Instead, they wanted to get feedback and observe use over time before they released the product. They wanted to do this with "real" knowledge workers in "real" companies doing "real" work, to understand the technical and personal challenges that customers would face, and to fix them before launch. They knew that this called for a new approach, and a significant investment in time and dollars. However, the risk of not addressing these questions directly was perceived as particularly high in light of previous attempts with this type of technology.
Study Requirements
It was obvious that only field research would be appropriate. The study would
also have to be longitudinal, to allow for studying usage and usability over
time. In addition it had to be done in a way that allowed the expected wide-range
of variation in usage to evolve. The methodology had to be open enough to allow
for shifting the focus to follow interesting trends or to address gaps in our
knowledge as they became evident. It had to allow for pre-planned structured
evaluation activities in order to get equivalent data on key design features
across users. It also had to allow for extensive non-structured interaction
with the Tablet PC in order to provide a picture of naturally evolving usage
patterns and usability in context. Because the study had to combine different
methodologies with different degrees of structure, managing and interpreting
the very heterogeneous qualitative data would be a major challenge.
Guided by all these considerations, we worked closely together over an almost 2 year period to conduct a series of three field trials at a variety of different companies to evaluate successively redesigned prototypes of the Tablet PC. The field trials helped identify issues in usage and usability that occurred during long term use. We started with the rudimentary prototype with only a small subset of the ultimate technology for the first field trial (as a dry run for later trials), and then with the two subsequent trials used more sophisticated prototypes.
The Field Trial Process
We hope to discuss the study process and methodology in more detail in a later article, but here we can share some of the key considerations:
Participants. We learned that we could not rely on traditional screeners and self report for recruiting participants. Instead, we needed to do an in-depth study of candidates' jobs and work practices. This allowed us to understand the baseline work practices, while balancing the sample in a variety of ways and ensuring there was a rich spectrum of work styles. It's crucial to also have a large enough sample of participants so that the individual differences and interaction styles can be thoroughly examined. The first study (pilot) had 19 participants, the second study had 21 and the third study which was geared to examine changes made since the first study only had 7.
Visits. In the first trial, we visited participants twice in one week. In the second and third trials, we began with an initial selection visit with all candidates. We then visited each person 5 times over 4 weeks in the second trial, and 7 times over 6 weeks in the third trial. Each visit lasted approximately 2 hours. The visit protocols varied, but included training, scripted usability tasks, observation of the participant using the Tablet PC in various work situations, artifact walkthroughs, and semi-structured interview. In addition to the scheduled visits, we carried out brief impromptu visits with all users at various times.
Observers. Rotating teams of Microsoft Tablet PC team members observed all visits. We had to balance the need for exposure of the team to real users and the need for real-time input into the development process with the risk of premature closure based on incomplete results.
Training of users. In deciding how much support and guidance to give users, we had to balance our desire to simulate typical conditions (in which users do not necessarily read manuals or receive corporate training) with the need to make sure that users did not simply hit a dead end beyond which we would learn nothing more. We addressed this by providing users with a basic initial training that alerted them to the potential capabilities of the Tablet PC and prepared them to begin exploration. Beyond that, we used situations in which users needed support and guidance as opportunities for more detailed usability investigation, while also providing the needed assistance.
Debriefing. We followed each visit with an intensive debrief during which we collected observations from all team members who were present. At the end of each day, we conducted a daily debrief, during which we identified key observations, themes, and hypotheses. We synthesized observations on at least a weekly basis, to identify trends, to provide input to the development team, and to refine the study focus.
What we learned
In this process, we were able to discover a great deal about how people integrated the Tablet PC into their work life. We had an increased opportunity for what we called "opportunistic findings" - things which just happened to occur when we were present, either in our scheduled meetings or when we dropped in to see how people were doing. Over the course of the studies, we developed a personal rapport with each of the individuals and gained their trust, which made it easy for them to share their frustrations and triumphs with us. Some of the participants worked together and helped each other to learn their newly-discovered "tricks." In this way, we were able to trace the development of new work patterns, and to see the collaborative learning curve first hand. We identified not only design changes, but also infrastructure and training issues, allowing the team to proactively address these before fielding the final product.
We discovered, not surprisingly, that integration of a new technology evolves over time, confirming our decision to study users over weeks rather than hours or days. We were able to keep track of stages in the learning and adoption process, and identify obstacles at different points. As people used the Tablet PC to do their real jobs, to create and save their own files, we watched them discover new ways of using the Tablet PC and saw the actual scenarios of usage as they evolved. We were then able to apply this learning directly into the product design.
The Tablet PC team has likened the Field Trials to shipping a first version (V1) of the Tablet PC and learning about the problems from 47 users rather than from millions of initial users. Because the team got to observe the visits - either in person or via videotapes - they gained a much better sense of the users, their tasks, and their concerns. Most features were completely re-architected. Both functionality and specific design elements were modified to improve usability, utility, and fit with the users' work lives. We believe the end result is that the real product that will ship later in 2002 will be more usable and more useful to the actual users.
Boxed text:
Some Process Hints for Field Trials
o Pilot the process. If you are new to doing Field Trials, we suggest that you run a pilot test to nail down the methodology before embarking on a full-blown trial. We found that this really helped because the integration of a variety of methods is tricky and because it took some time to pin point the best combination of tools.
o Recruiting is trickier than with "traditional" studies. Because we had to be absolutely sure we were recruiting people who really fit the target market, we devised a two-step recruiting process. First, Microsoft identified target companies, and worked with people in those companies to identify a pool of potential candidates. We then interviewed each of them twice, first by phone, and then in person, to understand their job, their technical expertise level, and their interest in participating. We were able to use this information to select a subset to participate in each trial. Motivation is particularly critical in this type of study, because the participants are committing a significant amount of time to the process. Therefore, it is also critical that their bosses buy-in to their participation.
o Decide up-front how you will handle, sort and analyze the data. You will probably have more data than you have ever had to sift through before. Plus, you will be incredibly busy collecting more data daily. Therefore, it is useful to consider carefully how you will handle the data. We used an Access data base based on an exhaustive data structure that we developed before we started the research, as well as ample places to put data (and create new categories) that didn't fit any of these existing categories. This allowed us to sift relatively rapidly through thousands of factoids of data on the fly.
o Logistics, logistics, logistics. As critical as logistics are in any ethnographic trial, they are even more critical when you are doing a field trial. Because each visit is an investment in a long-term relationship with the user, it is very important that these be coordinated, that the team be on-time, and briefed ahead of time if there are visiting members so that the user's time is used most effectively. A central calendar and tight coordination was required in these studies, especially when we had multiple teams going to several locations on a given day.
o Debrief long and often.
We discovered that it was crucial to have several types of debrief. Thorough
debriefing immediately after each visit allowed us to be sure we had captured
all of the relevant facts, from all observers. Often this debriefing would take
nearly as long as the visit itself. An additional debriefing at the end of the
day allowed us both to adjust our procedure and to generate hypotheses about
trends that we would watch for in future visits. A comprehensive debrief at
the end of each week allowed us to combine the data from multiple teams who
had participated in visits that week, and to synthesize the main findings from
that week's research agenda.