Data analysis as a discourse

At the Libre Graphics Meeting 2008 in Wroclaw, just before Michael Terry presents his project ingimp to an audience of curious GIMP developers and users, we meet up to talk more about ‘instrumenting GIMP’ and about the way Terry thinks data analysis could be done as a form of discourse. Michael Terry is a computer scientist working at the Human Computer Interaction Lab of the University of Waterloo, Canada and his main research focus is on improving usability in Open Source software. We speak about ingimp, a clone of the popular image manipulation programme GIMP, but with an important difference: ingimp allows users to record data about their usage in to a central database, and subsequently makes this data available to anyone. This conversation was also published in the Constant publication Tracks in electr(on)ic fields.

Maybe we could start this conversation with a description of the ingimp project you are developing and why you chose to work on usability for GIMP?

So the project is ‘ingimp’, which is an instrumented version of GIMP, it collects information about how the software is used in practice. The idea is you download it, you install it, and then with the exception of an additional start up screen, you use it just like regular Gimp. So, our goal is to be as unobtrusive as possible to make it really easy to get going with it, and then to just forget about it. We want to get it into the hands of as many people as possible, so that we can understand how the software is actually used in practice. There are plenty of forums where people can express their opinions about how GIMP should be designed, or what’s wrong with it, there are plenty of bug reports that have been filed, there are plenty of usability issues that have been identified, but what we really lack is some information about how people actually apply this tool on a day to day basis. What we want to do is elevate discussion above just anecdote and gut feelings, and to say, well, there is this group of people who appear to be using it in this way, these are the characteristics of their environment, these are the sets of tools they work with, these are the types of images they work with and so on, so that we have some real data to ground discussions about how the software is actually used by people. You asked me now why GIMP? I actually used GIMP extensively for my PhD work. I had these little cousins come down and hang out with me in my apartment after school, and I would set them up with GIMP, and quite often they would always start off with one picture, they would create a sphere, a blue sphere, and then they played with filters until they got something really different. I would turn to them looking at what they had been doing for the past twenty minutes, and would be completely amazed at the results they were getting just by fooling around with it. And so I thought, this application has lots and lots of power, I’d like to use that power to prototype new types of interface mechanisms. So I created JGimp, which is a Java based extension for the 1.0 GIMP series, that I can use as a back-end for prototyping novel user interfaces. I think that it is a great application, there is a lot of power to it, and I had already an investment in its code base so it made sense to use that as a platform for testing out ideas of open instrumentation.

What is special about ingimp, is the fact that the data you generate is made by the software you are studying itself. Could you describe how that works?

Every bit of data we collect, we make available: you can go to the website, you can download every log file that we have collected. The intent really is for us to build tools and infrastructure so that the community itself can sustain this analysis, can sustain this form of usability. We don’t want to create a situation where we are creating new dependencies on people, or where we are imposing new tasks on existing project members. We want to create tools that follow the same ethos as Open Source development, where anyone can look at the source code, where anyone can make contributions, from filing a bug to doing something as simple as writing a patch, where they don’t even have to have access to the source code repository, to make valuable contributions. So importantly, we want to have a really low barrier to participation. At the same time, we want to increase the signal-to-noise ratio. Yesterday I talked with Peter Sikking, an information architect working for GIMP, and he and I both had this experience where we work with user interfaces, and since everybody uses an interface, everybody feels they are an expert, so there can be a lot of noise. So, not only did we want to create an open environment for collecting this data, and analysing it, but we also want to increase the chance that we are making valuable contributions, and that the community itself can make valuable contributions. Like I said, there is enough opinion out there. What we really need to do is to better understand how the software is being used. So, we have made a point from the start to try to be as open as possible with everything, so that anyone can really contribute to the project.

I have started analysing the data, and I think one of the things that we realised early on is that it is a very rich data set; we have lots and lots of data. So, after a year we’ve had over 800 installations, and we’ve collected about 5000 log files, representing over half a million commands, representing thousands of hours of the application being used. And one of the things you have to realise is that when you have a data set of that size, there are so many different ways to look at it that my particular perspective might not be enough. Even if you sit someone down, and you have him or her use the software for twenty minutes, and you videotape it, then you can spend hours analysing just that twenty minutes of videotape. And so, I think that one of the things we realised is that we have to open up the process so that anyone could easily participate. We have the log files available, but they really didn’t have an infrastructure for analysing them. So, we created this new piece of software called 'StatsJam', an extension to MediaWiki, which allows anyone to go to the website and embed SQL-queries against the ingimp data set and then visualise those results within the Wiki text. So, I’ll be announcing that today and demonstrating that, but I have been using that tool now for a week to complement the existing data analysis we have done. One of the first things that we realized is that we have over 800 installations, but then you have to ask, how many of those are really serious users? A lot of people probably just were curious, they downloaded it and installed it, found that it didn’t really do much for them and so maybe they don’t use it anymore. So, the first thing we had to do is figure out which data points should we really pay attention too. We decided that a person should have saved an image, and they should have used ingimp on two different occasions, preferably at least a day apart, where they’d saved an image on both of the instances. We used that as an indication of what a serious user is. So with that filter in place, then the '800 installations' drops down to about 200 people. So we had about 200 people using ingimp, and looking at the data this represents about 800 hours of use, about 4000 log files, and again still about half a million commands. So, it’s still a very significant group of people. 200 people is still a lot, and that’s a lot of data, representing about 11000 images they have been working on, there’s just a lot.

From that group, what we found is that use of ingimp is really short and versatile. So, most sessions are about fifteen minutes or less, on average. There are outliers, there are some people who use it for longer periods of time, but really it boils down to them using it for about fifteen minutes, and they are applying fewer than a hundred operations when they are working on the image. I should probably be looking at my data analysis as I say this, but they are very quick, short, versatile sessions, and when they use it, they use less than 10 different tools, or they apply less than 10 different commands when they are using it. What else did we find? We found that the two most popular monitor resolutions are 1280 by 1024 and 1024 by 768. So, those represent collectively 60% of the resolutions, and really 1280 by 1024 represents pretty much the maximum for most people, although you have some higher resolutions. So one of the things that’s always contentious about GIMP, is its window management scheme and the fact that it has multiple windows, right? And some people say, well you know this works fine if you have two monitors, because you can throw out the tools on one monitor and then your images are on another monitor. Well, about 10% to 15% of ingimp users have two monitors, so that design decision is not working out for most of the people, if that is the best way to work. These are things I think that people have been aware of, it’s just now we have some actual concrete numbers where you can turn to and say, now this is how people are using it. There is a wide range of tasks that people are performing with the tool, but they are really short, bursty tasks.

Every time you start up ingimp, a screen comes up asking you to describe what you are planning to do and I am interested in the kind of language users invent to describe this, even when they sometimes don’t know exactly what it is they are going to do. So inventing language for possible actions with the software, has in a way become a creative process that is now shared between interface designer, developer and user. If you look at the ‘activity tags’ you are collecting, do you find a new vocabulary developing?

I think there are 300 to 600 different activity tags that people register within that group of ‘significant users’. I didn’t have time to look at all of them, but it is interesting to see how people are using that as a medium for communicating to us. Some people will say, Just testing out, ignore this! Or, people are trying to do things like insert HTML code, to do like a cross-site scripting attack, because, you have all the data on the website, so they will try to play with that. Some people are very sparse and they say ‘image manipulation’ or ‘graphic design’ or something like that, but then some people are much more verbose, and they give more of a plan, This is what I expect to be doing. So, I think it has been interesting to see how people have adopted that and what’s nice about it, is that it adds a really nice human element to all this empirical data.

I wanted to ask you about the data, without getting too technical, could you explain how these data are structured, what do the log files look like?

So the log files are all in XML, and generally we compress them, because they can get rather large. And the reason that they are rather large is that we are very verbose in our logging. We want to be completely transparent with respect to everything, so that if you have some doubts or if you have some questions about what kind of data has been collected, you should be able to look at the log file, and figure out a lot about what that data is. That’s how we designed the XML log files, and it was really driven by privacy concerns and by the desire to be transparent and open. On the server side we take that log file and we parse it out, and then we throw it into a database, so that we can query the data set.

Now we are talking about privacy… I was impressed by the work you have done on this; the project is unusually clear about why certain things are logged, and other things not; mainly to prevent the possibility of ‘playing back’ actions so that one could identify individual users from the data set. So, while I understand there are privacy issues at stake I was wondering… what if you could look at the collected data as a kind of scripting for use? Writing a choreography that might be replayed later?

Yes, we have been fairly conservative with the type of information that we collect, because this really is the first instance where anyone has captured such rich data about how people are using software on a day to day basis, and then made it all that data publicly available. When a company does this, they will keep the data internally, so you don’t have this risk of someone outside figuring something out about a user that wasn’t intended to be discovered. We have to deal with that risk, because we are trying to go about this in a very open and transparent way, which means that people may be able to subject our data to analysis or data mining techniques that we haven’t thought of and extract information that we didn’t intent to be recording in our file, but which is still there. So there are fairly sophisticated techniques where you can do things like look at audio recordings of typing and the timings between keystrokes, and then work backwards with the sounds made to figure out the keys that people are likely pressing. So, just with keyboard audio and keystroke timings alone you can often give enough information to be able to reconstruct what people are actually typing. So we are always sort of weary about how much information is in there. While it might be nice to be able to do something like record people’s actions and then share that script, I don’t think that that is really a good use of ingimp. That said, I think it is interesting to ask, could we characterize people’s use enough, so that we can start clustering groups of people together and then providing a forum for these people to meet and learn from one another? That’s something we haven’t worked out. I think we have enough work cut out for us right now just to characterize how the community is using it.

It was not meant as a feature request, but as a way to imagine how usability research could flip around and also become productive work.

Yes, totally. I think one of the things that we found when bringing people into to assess the basic usability of the ingimp software and ingimp website, is that people like looking at things like what commands other people are using, what the most frequently used commands are, and part of the reason that they like that, is because of what it teaches them about the application. So they might see a command they were unaware of. So we have toyed with the idea of then providing not only the command name, but then a link from that command name to the documentation – but I didn’t have time to implement it, but certainly there are possibilities like that, you can imagine.

Maybe another group can figure something out like that? That’s the beauty of opening up your software plus data set of course. Well, just a bit more on what is logged and what not… Maybe you could explain where and why you put the limit and what kind of use you might miss out on as a result?

I think it is important to keep in mind that whatever instrument you use to study people, you are going to have some kind of bias, you are going to get some information at the cost of other information. So if you do a video taped observation of a user and you just set up a camera, then you are not going to find details about the monitor maybe, or maybe you are not really seeing what their hands are doing. No matter what instrument you use, you are always getting a particular slice. I think you have to work backwards and ask what kind of things do you want to learn. And so the data that we collect right now, was really driven by what people have done in the past in the area of instrumentation, but also by us bringing people into the lab, observing them as they are using the application, and noticing particular behaviours and saying, hey, that seems to be interesting, so what kind of data could we collect to help us identify those kind of phenomena, or that kind of performance, or that kind of activity? So again, the data that we were collecting was driven by watching people, and figuring out what information will help us to identify these types of activities. As I’ve said, this is really the first project that is doing this, and we really need to make sure we don’t poison the well. So if it happens that we collect some bit of information, that then someone can later say, Oh my gosh, here is the person’s file system, here are the names they are using for the files or whatever, then it’s going to make the normal user population weary of downloading this type of instrumented application. This is the thing that concerns me most about Open Source developers jumping into this domain, is that they might not be thinking about how you could potentially impact privacy.

I don’t know, I don’t want to get paranoid. But if you are doing it, then there is a possibility someone else will do it in a less considerate way.

I think it is only a matter of time before people start doing this, because there are a lot of grumblings about, we should be doing instrumentation, someone just needs to sit down and do it. Now there is an extension out for Firefox that will collect this kind of data as well, so you know…

Maybe users could talk with each other, and if they are aware that this type of monitoring could happen, then that would add a different social dimension…

It could. I think it is a matter of awareness, really, so when we bring people into the lab and have them go to the ingimp website, download and install it and use it, and go check out the stats on the website, and then we ask questions like, what kind of data are we collecting? We have a lengthy concern agreement that details the type of information we are collecting and the ways your privacy could be impacted, but people don’t read it.

So concretely… what information are you recording, and what information are you not recording?

We record every command name that is applied to a document, to an image. Where your privacy is at risk with that, is that if you write a custom script, then that custom script’s name is going to be inserted into a log file. And so if you are working for example for Lucas or DreamWorks or something like that, or ILM, in some Hollywood movie studio and you are using ingimp and you are writing scripts, then you could have a script like 'fixing Shrek’s beard', and then that is getting put into the log file and then people are going to know that the studio uses ingimp. We collect command names, we collect things like what windows are on the screen, their positions, their sizes, we take hashes of layer names and file names. We take a string and then we create a hash code for it, and we also collect information about how long is this string, how many alphabetical characters, numbers, things like that, to get a sense of whether people are using the same files, the same layer names time and time again, and so on. But this is an instance where our first pass at this, actually left open the possibility of people taking those hashes and then reconstructing the original strings from that. Because we have the hash code, we have the length of the string, all you have to do is generate all possible strings of that length, take the hash codes and figure out which hashes match. And so we had to go back and create a new scheme for recording this type of information where we create a hash and we create a random number, we pair those up on the client machine but we only log the random number. So, from log to log then, we can track if people use the same image names, but we have no idea of what the original string was. There are these little 'gotchas', things to look out for, that I don’t think most people are aware of, and this is why I get really concerned about instrumentation efforts right now, because there isn’t this body of experience of what kind of data should we collect, and what shouldn’t we collect.

As we are talking about this, I am already more aware of what data I would allow to be collected. Do you think by opening up this data set and the transparent process of collecting and not collecting, this will help educate users about these kinds of risks?

It might, but honestly I think probably the thing that will educate people the most is if there was a really large privacy error and that it got a lot of news, because then people would become more aware of it because right now – and this is not to say that we want that to happen with ingimp – but when we bring people in and we ask them about privacy, Are you concerned about privacy?, and they say No, and we say Why? Well, they inherently trust us, but the fact is that Open Source also lends a certain amount of trust to it, because they expect that since it is Open Source, the community will in some sense police it and identify potential flaws with it.

Is that happening?
Are you in dialogue with the Open Source community about this?

No, I think probably five to ten people have looked at the ingimp code – realistically speaking I don’t think a lot of people looked at it. Some of the GIMP developers took a gander at it to see how could we put this upstream, but I don’t want it upstream, because I want it to always be an opt-in, so that it can’t be turned on by mistake.

You mean you have to download ingimp and use it as a separate program? It functions in the same way as GIMP, but it makes the fact that it is a different tool very clear.

Right. You are more aware, because you are making that choice to download that, compared to the regular version. There is this awareness about that. We have this lengthy text based consent agreement that talks about the data we collect, but less than two percent of the population reads license agreements. And, most of our users are actually non-native English speakers, so there are all these things that are working against us. So, for the past year we have really been focussing on privacy, not only in terms of how we collect the data, but how we make people aware of what the software does. We have been developing wordless diagrams to illustrate how the software functions, so that we don’t have to worry about localisation errors as much. And so we have these illustrations that show someone downloading ingimp, starting it up, a graph appears, there is a little icon of a mouse and a keyboard on the graph, and they type and you see the keyboard bar go up, and then at the end when they close the application, you see the data being sent to a web server. And then we show snapshots of them doing different things in the software, and then show a corresponding graph change. So, we developed these by bringing in both native and non-native speakers, having them look at the diagrams and then tell us what they meant. We had to go through about fifteen people and continual redesign until most people could understand and tell us what they meant, without giving them any help or prompts. So, this is an ongoing research effort, to come up with techniques that not only work for ingimp but also for other instrumentation efforts, so that people can become more aware of the implications.

Can you say something about how this type of research relates to classic usability research and in particular to the usability work that is happening in Gimp?

Instrumentation is not new, commercial software companies and researchers have been doing instrumentation for at least ten years, probably ten to twenty years. So, the idea is not new but what is new, in terms of the research aspects of this, is how do we do this in a way where we can make all the data open? The fact that you make the data open, really impacts your decision about the type of data you collect and how you are representing it. And you need to really inform people about what the software does. But I think your question is… how does it impact the GIMP’s usability process? Not at all, right now. But that is because we have intentionally been laying off to the side, until we got to the point where we had an infrastructure, where the entire community could really participate with the data analysis. We really want to have this to be a self-sustaining infrastructure, we don’t want to create a system where you have to rely on just one other person for this to work.

Collecting data is not hard. The challenge is to understand the data, and I don’t want to create a situation where the community is relying on only one person to do that kind of analysis, because this is dangerous for a number of reasons. First of all, you are creating a dependency on an external party, and that party might have other obligations and commitments, and might have to leave at some point. If that is the case, then you need to be able to pass the baton to someone else, even if that could take a considerate amount of time and so on. You also don’t want to have this external dependency, because of the richness in the data, you really need to have multiple people looking at it, and trying to understand and analyse it. So how are we addressing this? It is through this StatsJam extension to the MediaWiki that I will introduce today. Our hope is that this type of tool will lower the barrier for the entire community to participate in the data analysis process, whether they are simply commenting on the analysis we made or taking the existing analysis, tweaking it to their own needs, or doing something brand new.

In talking with members of the GIMP project here at the Libre Graphics Meeting, they started asking questions like, So how many people are doing this, how many people are doing this and how many this? They’ll ask me while we are sitting in a café, and I will be able to pop the database open and say, A certain number of people have done this, or, no one has actually used this tool at all. The danger is that this data is very rich and nuanced, and you can’t really reduce these kind of questions to an answer of N people do this, you have to understand the larger context. You have to understand why they are doing it, why they are not doing it. So, the data helps to answer some questions, but it generates new questions. They give you some understanding of how the people are using it, but then it generates new questions of, Why is this the case? Is this because these are just the people using ingimp, or is this some more widespread phenomenon? They asked me yesterday how many people are using this colour picker tool – I can’t remember the exact name – so I looked and there was no record of it being used at all in my data set. So I asked them when did this come out, and they said, Well it has been there at least since 2.4. And then you look at my data set, and you notice that most of my users are in the 2.2 series, so that could be part of the reasons. Another reason could be, that they just don’t know that it is there, they don’t know how to use it and so on. So, I can answer the question, but then you have to sort of dig a bit deeper.

You mean you can’t say that because it is not used, it doesn’t deserve any attention?

Yes, you just can’t jump to conclusions like that, which is again why we want to have this community website, which shows the reasoning behind the analysis. Here are the steps we had to go through to get this result, so you can understand what that means, what the context means, because if you don’t have that context, then it’s sort of meaningless. It’s like asking, what are the most frequently used commands? This is something that people like to ask about. Well really, how do you interpret that? Is it the numbers of times it has been used across all log files? Is it the number of people that have used it? Is it the number of log files where it has been used at least once? There are lots and lots of ways in which you can interpret this question. So, you really need to approach this data analysis as a discourse, where you are saying, here are my assumptions, here is how I am getting to this conclusion, and this is what it means for this particular group of people. So again, I think it is dangerous if one person does that and you become to rely on that one person. We really want to have lots of people looking at it, and considering it, and thinking about the implications.

Do you expect that this will impact the kind of interfaces that can be done for GIMP?

I don’t necessarily think it is going to impact interface design, I see it really as a sort of reality check: this is how communities are using the software and now you can take that information and ask, do we want to better support these people or do we…For example on my data set, most people are working on relatively small images for short periods of time, the images typically have one or two layers, so they are not really complex images. So regarding your question, one of the things you can ask is, should we be creating a simple tool to meet these people’s needs? All the people are is just doing cropping and resizing, fairly common operations, so should we create a tool that strips away the rest of the stuff? Or, should we figure out why people are not using any other functionality, and then try to improve the usability of that? There are so many ways to use data I don’t really know how it is going to be used, but I know it doesn’t drive design. Design happens from a really good understanding of the users, the types of tasks they perform, the range of possible interface designs that are out there, lots of prototyping, evaluating those prototypes and so on. Our data set really is a small potential part of that process. You can say, well according to this data set, it doesn’t look like many people are using this feature, let’s not much focus too on that, let’s focus on these other features or conversely, let’s figure out why they are not using them…Or you might even look at things like how big their monitor resolutions are, and say well, given the size of the monitor resolution, maybe this particular design idea is not feasible. But I think it is going to complement the existing practices, in the best case.

And do you see a difference in how interface design is done in free software projects, and in proprietary software?

Well, I have been mostly involved in the research community, so I don’t have a lot of exposure to design projects. I mean, in my community we are always trying to look at generating new knowledge, and not necessarily at how to get a product out the door. So, the goals or objectives are certainly different. I think one of the dangers in your question is that you sort of lump a lot of different projects and project styles into one category of 'Open Source'. 'Open source' ranges from volunteer driven projects to corporate projects, where they are actually trying to make money out of it. There is a huge diversity of projects that are out there; there is a wide diversity of styles, there is as much diversity in the Open Source world as there is in the proprietary world. One thing you can probably say, is that for some projects that are completely volunteer driven like GIMP, they are resource strapped. There is more work than they can possibly tackle with the number of resources they have. That makes it very challenging to do interface design, I mean, when you look at interface code, it costs you 50% or 75% of a code base. That is not insignificant, it is very difficult to hack and you need to have lots of time and manpower to be able to do significant things. And that’s probably one of the biggest differences you see for the volunteer driven projects, it is really a labour of love for these people and so very often the new things interest them, whereas with a commercial software company developers are going to have to do things sometimes they don’t like, because that is what is going to sell the product.