How Can We Best Monetize Data—And Do So Responsibly, Ethically and Without Sacrificing Journalistic Standards?

Session Facilitator(s): AmyJo Brown, Ryann Grochowski Jones

Day & Time: Friday, 3-4pm

Room: Ski-U-Mah

Hi everyone, welcome to the session, How Can We Best Monetize Data Responsibly, Ethically, And Without Sacrificing Journalistic Standards? My name is Ryann Grochowski Jones, I’m a data recorder at ProPublica in New York.

And my name is AmyJo Brown and I work at War Streets Media. We are building a news publication in Pittsburgh from scratch.

So Amy Jo and I both have an interest in monetizing data at ProPublica last February, February 2014, we opened a data store, which had been a long time—in the works for a long time basically at our organization it’s a place where we both put up for free download any datasets that we received through FOIA or right to know requests, and then we also versions of those datasets that we have cleaned, combined, verified, and all of that good stuff.

And we have been cleaning and working a lot with campaign finance data, locally, and some other voter registration records and other records like that, so we are actually just starting to work through a lot of these questions about what do we want people to allow download in bulk for free, what do we want to license possibly through, you know, an NPI and when should we be charging for the data, considering that it is public data, so where those lines are. So what we were hoping to do during this session is we’re going to kind of cover and discuss a lot of the topics that we have found ourselves sort of working through and we want to have everybody kind of brainstorm how they might solve some of these problems. So the first thing that we wanted to do, on all of your tables are post-it notes and sharpies, and for the next five minutes, we wanted you to brainstorm with your group a bit about what public datasets you would put in a data store. And we want to do it, we’ve got city data, county data, and state data, so we want you to think about it in that context, so if the two tables over in the far left here can consider city data, so what public data at the city level do you think you might want to put in a data store and if the middle section can think about at the county level, what county data might be available that you think should go into a data store and you guys can think of it at the state level. So if you guys just want to do that for a few minutes and as you gather your post-its, feel free to put them up there and put them on the appropriate poster.

[group activity]

One more minute. [group activity]

OK, so we’re going to have each group come up and put their post-its on the right poster and we want you to come and just share with the group the list that you came up with. So who wants to go first?

Gina, I’m going to give you the mic and –

Thank you. OK, well we just started brainstorming, we found that county data is a little bit squishy, because there’s often city limits, but you want to be inclusive of population for land areas for coverage for public services like fire or police or what have you, so that’s kind of what this list is based on. County election results, county employee salaries, emergency response times for like fire, law enforcement, you know, arrest records if the Sheriff’s Department is what oversees that area. Water usage, school data, enrollment, performance, dropout rates, liquor and business licenses, property values, and county records like vaccination. So that’s what we came up with.

Thank you. You guys need us?

It’s actually interesting, there’s a lot of overlap, it’s probably going to be true. There’s a lot of fuzziness between all these different things. So same things, license, crime incidence, we tried to think about things like weather not necessarily on a climatological level, but also effects, like in New York City, the flood maps and things like that. Taxi data, municipal calendar data. We just totally brainstormed, public geodata. Municipal government voter data. Neighborhood graphics, nobody else knows what a neighborhood is especially the city and of course the demographics of those neighbors, so those shapes and demographics. Geodata on parks and then things like restaurant inspection, things like that.

Thank you, and this group over here?

Again, a lot of overlap. let’s see, school test scores, good to look at that on a state level. Inspections of many different kinds. We talked about infrastructure inspections and day care inspections. Oh, yeah, so lots of different kinds of inspections, it would be nice if we could get health stats on state level, right, like CDC for the whole state instead of the whole country, election results, state budget, stuff on weather would be interesting. Highway accidents, workplace safety inspections and accidents, personnel information about various state agencies, legislative voting record, bill tracking on legislative level, crime stats which I think was mentioned in both other categories, but you know, crime stats on a state level, good, because then you can do some compare and contrast. Pension fund and other financials there, pension payments. Illegal drug data. Welfare data, who does the state pay to pay out social worker payments.

We barely scratched the surface.

Yeah, give us like another day or so.

I was going to say five minutes wasn’t enough?

[laughter]

So the next exercise that we are going to do is to, on your tables you all should see some dot stickers, green and yellow, and you should each get three of each, so if anyone needs an extra, raise their hand, we can come around. But what we’re going to do with those, so now you have to prioritize what you’re going to put in your data store, because you have limited resources and limited people to get these datasets and clean them for you. So we have two ways that you’re going to prioritize, the first way is with the green sticker. Which data would you consider to be most valuable to have on hand for the newsroom? So you’re going to clean a dataset, so maybe what would a newsroom really want to have on hand during a breaking news event or during election coverage, so examples, use of force was one that has come up where, you know, a reporter may want to be able to check those records during a breaking news event, so it would be great to have those on hand, so there’s a good example. And then the yellow is, what do you think is going to be the most marketable to an outside party? So this could be research, businesses, anybody like where would you place the highest value? So you get three stickers there, as well.

And the stickers are not mutually exclusive. You can, for the same dataset, if you believe that it is both very valuable to a newsroom and for marketing purposes, just all to a third party, you can use both a green and a yellow sticker. So we’ll give you another five minutes or so, feel free to get up and go over to the list and sticker the post-it notes in –

I have a question, is it in any category or just your category?

Not necessarily one per category? Yeah, you just have three stickers.

Yes, you can use your stickers however you’d like. [group activity]

As we’re finishing up putting our stickers on our datasets, does anybody want to share with us which datasets they chose for which categories and why?

Raise your hand and we’ll come to you.

I—well, my yellow stickers, I feel like I have more experience with, and I chose—I chose like the contracts and procurement at city and state levels. I do a lot of industries that want contracts and to learn about like what they should spend their resources on going after is very valuable to them. so I did that, and what was the second one I did? At the county level. No, wait, whatever. Those are the most important things, because I think the other one I did, everyone agreed with me.

Anyone else want to share why they chose what they did?

One of the ones I put a yellow on is restaurant inspections because I think that’s one of the things that is not out there well enough and yet everybody would love to see it. Of course there’s the one flaw that this data is not always super—it’s got a lot of hiccups to it, but still the concept, I think people would buy it, so I think they’d spend money on it.

Yeah, the green ones, I think of mine matched a lot of other folks like on the county level the school enrollment data because I think that has value to realtors and businesses you know, knowing where kids are.

I thought that climate and weather data at the state level would be useful for companies that wanted to do sort of like broader forecasting for agriculture, things like that.

Did anybody put a green and yellow stick or on the same dataset?

Nobody? Really? That is interesting to me.

I think that was more out of like a vague sense of like trying to spread things out.

True. Yeah, when you only have six total stickers, you can’t be blowing them all on one dataset, I suppose.

I think inspections deserves both.

Well, Amy Jo, why don’t you say what the consensus was.

So I grabbed the ones that had the most stickers and the one I found was property values and it’s mostly yellow to one green. So I know we just kind of talked a little bit but in terms of the property values, because there’s so much yellow there, if you put one of your yellow stickers here, can you tell us a bit of who would be interested in that data? Who would be the audience that you think would buy it?

You can just shout it out.

Zillow.

Zillow? They’re already getting it from somebody else.

Well, the question would be what would be most valuable. What is most valuable. So it’s like already proven.

There’s all kinds of people that want to aggregate and show you like—I mean even more house hunting for like an apartment or something, seeing the comps around the area for like how much an apartment would be or something, I don’t know.

So individuals might be interested in that data, as well?

Yeah.

Do you think that an individual might pay to download that?

Yeah, maybe as part of a service or something.

Depends on how much it costs. As a home buyer, you know, it’s a moment where you’re a little bit sensitive and you’d be willing willing to do it. I think there’s maybe—I don’t want to necessarily be necessarily cynical is if you can correlate if you have address data, correlate it to identity, then you can correlate it back to property value, I think there are a lot of interests there that would pay a lot of money for that.

Throw out some examples? I was thinking maybe insurance? Might be interested in knowing how much your house is worth? Credit companies? Who else?

I mean marketers in general could use that as a proxy for your overall economic status.

I mean they already kind of do at a neighborhood level, but –

I mean this data is already available for free. I mean I own a house and the amount of stuff that I get in my mailbox that is only there because can got scraped off of property rolls is kind of staggering, and very varied in what people are actually trying to sell me. I think to make it something that has sort of market value, it’s less about making the data available than about the format in which it is available. Making sort of like—if companies that want to use it are struggling to get data from the sort of official source and transform it into an API that’s actually useful to them, then being the service that is that API could be valuable.

Expanding on that, if we layered on census data and some other things to add some value to it that’s not already in there, I think that’s the only way that—just the value so I’m the one who put the green up there, and I—I’ve been thinking a lot about this data, because here in Minnesota, this data was under essentially lock and key for a long, long time, they charged a penny a parcel, which you’re talking a thousand dollars for one county so we never got it. This past year somebody convinced all the counties to unlock it and now it’s free. You download the whole twin cities in one swoop so we’re starting to think, you know, what can we do with this and I think it’s got a lot of newsroom value and probably more than the commercial right now, because the people who would use it commercially or pay for it are already doing it through they’re just getting it directly.

What about state contracts? A lot of yellow dots on that one, as well. Who would be the market or who do you think would buy state contracts if you offer that in a database or be interested in it.

Companies that want to get state contracts.

Or subcontracts with somebody who already has a state contract?

[laughter]

Yes. People who consult to help companies get subcontracts.

Even businesses who want to contract out something, they want to see how much state pays.

Market analysis?

Yeah.

There was specifically mentioned that welfare contracts might be of interest. Who would be interested in the welfare contracts, perhaps commercially?

I think that was a green sticker thought that didn’t win a green sticker race in the end.

Would there be academics that might be interested in that dataset?

I’m thinking like watchdog groups, there’s a lot of—you know recently looking at how the Red Cross behaved in Haiti, and kind of, I think the last decade or so there’s been a lot of consciousness about how groups are carrying out missions with money.

The next two are all green.

I know they are. So tell us the crime incident data, this is all green. Anyone want to kind of talk a little bit about how in a now room, why that might be of a high value to a newsroom?

It’s fascinating. It’s a timeline page generator, it’s about me. I can compare myself to my neighbors and be afraid or be less afraid. So it’s both useful and fascinating, so I think a newsroom likes things like that 2002 so much of what we write about is crime, the stories that we write is crime, so having that data gives us a chance to put all that in context and give the readers the broader context rather than you know, somebody just killed their neighbor yesterday, you know, right?

Several major stories in New York have emerged from analysis of arrest data of course in the last couple of years. I think it’s become a pretty reliable source of actual investigative news.

It’s also a great way to measure how well you feel your local government is doing at one of the most basic things it’s supposed to do.

We have election data, which I think came from the state page, but I’m pretty sure there’s interest in city, county, state, and we decided not to do national, but obviously national. However that one is all green.

So what similar to what we just did with crime data, what do you think are the major reasons why it would be more valuable for a newsroom to have access to this dataset, versus perhaps some of the others?

Because we have to pay for it now and we shouldn’t.

So true.

But kind of like with crime, so much journalistic coverage is about political activity and the horse race and having access to that data allows you to sort of make analysis conjecture, you know, look for where things are changing or put that into—put that story in context.

Are things like a requirement at this point? Like of the things that people and data come together, it’s like, where is your up to the minute election coverage? More than like it’s the best thing to have, it’s kind of like you are forced to have it.

What about city budget and contracts? We’ve got a good mix there, so where are the commercial audiences do you feel like for that, and what would be the primary news room interest?

Well, the newsroom interest is just, you know, watching the power and making sure that nothing untoward is happening, that people aren’t making deals with their—your brother-in-law, or that kind of thing. kickbacks or other undue relationships.

Wait, you live in Chicago?

Yeah.

[laughter]

What do you know about kickbacks, being from Chicago?

Whose brother-in-law are you?

What about the commercial aspects? Who do you think outside of the newsroom might be interested in that data?

I would say I think in general with budgets, you know, in most cases those are public things, but but then commercial and news search scenarios, understanding what they mean is kind of a key thing. I do think from a business standpoint, anticipating, you know, new opportunities that contract with cities, but you know, if a budget line item significantly increased year after year, you know, you might want to start putting some resources into, say, like waste management, because you know that your city is going to spend more money on waste management.

OK, so now that we have kind of done our exercise about what datasets are potentially out there for both newsrooms and for general audience and how we might prioritize or assign value to them, we have come up with three different topics that I know Amy Jo and I have, when we met to talk about this session, things that she and I both have either struggled with or discussed or just honestly don’t have answers for, and you know, we wanted to open up a greater discussion for some of these, so we have three broad topics with some questions here. Liability is one. What liability concerns might come up? What if somebody downloads your data for free, or you sell them your clean data and you screwed up the cleaning and they reported on it or used it for something? Should you as the news organization and the seller of the data be liable? Privacy and redaction concerns, Amy Jo, you want to talk about that?

Yeah, so this is the idea of you know, considering are you going to put all of your raw data up for download or are you going to hold some back?

Under what situations might you do that? Home addresses for example?

A lot of people would be commercially interested in getting a dataset, say campaign finance or something like that that have home addresses in it. Are you as a news organization comfortable in releasing that information and if you are, what would be the guidelines for that? So the idea here is when we break up into groups, we want you guys to kind of think through where some of these problems come and how you if you were a news organization might put some boundaries or guidelines around those decisions.

And then our last question is to charge or not to charge. Obviously our session is called monetizing data, so clearly people see some value in it, but I know when my editors came to me at ProPublica saying that they wanted to start a data store, I was at first very wary that of. I understood the value of what we were doing and that we’ve spend a lot of time and a lot of man and woman hours cleaning these datasets, but I was like also open data is good, and we eventually came to kind of a compromise that I think worked out but we’ll talk about that when we go back to our discussions but I’d love to hear other thoughts. You know, if you do have data that you want to sell, how do you decide to put a price tag on it?

So do we want to –

Yeah, and I was just going to say, part of this exercise with a list of datasets and how we prioritize, think of those as you’re talking at your table, so it kind of will ground you in some of the questions that you might have to ask. You can look at the data that you’ve already sort of said, oh, we’re going to consider this, so if there are fields or things in the property databases that you think shouldn’t be released, you can use those in how you think they inform your overall strategy. But we aren’t really thinking about it per dataset, but you have to do some guidelines overall for how you’re going to run your data store. Does that make sense?

All right, so why don’t we have this side of the room take the liability issue. The middle section take the privacy and redaction issue, and you guys over there take the to charge or not to charge, and we’ll present thoughts here in the end. We’ll give you about ten minutes or so to kind of discuss these. [group activity]

OK, we’re going to come back together as a larger group. I’m going to pass the mic to each group. If the people in each smaller group wants to pick a speaker and kind of just tell us what you discussed and all the solutions to the problems that we’ve put up here that we can, you know, all take advantage of your smart and intelligent minds. So this group talked about liability issues. I’m going to give the mic to Joe.

So I don’t think there’s any order to this, but, we talked about things like computational and editorial errors, like if you’re aggregating data, and you mishandle, missing values or if you’re doing sort of cleanup and maybe leads to normalization, if you’re just manually doing stuff with the data to make it easier for people to use, and you just make errors that are problematic. We talk about the right to release data, maybe and some of the data you have you’re’ aallowed to release it in some places European Union different than America, for example. Of course, personally identifying information. Libel, I know for example republic cans have challenges for dollars for docs where they have the wrong data and whether or not they are right. Emergent errors, remind me what? Is that like the New York taxi thing? It’s the idea that you might, by releasing your data release New York City taxi data was released without the data of the drivers but it was possible to sort of reverse engineer who was driving the cabs by linking cab drivers to Instagram photos and things like that. I think I got all the things, right? Yeah.

Thanks. That actually dovetails great into privacy and redactions team.

Our topic did dovetail a little bit with some of the liability issues, although I think what we discussed is there are sort of two categories, there are sort of legal concerns and then there’s ethical and editorial decisionmaking that may, you know, you could—there are things that you can dough that are perfectly legal but that you might not want to do from an ethical or editorial standpoint. I think we discussed at the beginning, depending on where you get the information from to begin with, so if you’re getting it from a public records request or a right to information request, there may be redactions already or there should be, in some cases things should have been redacted and are not so we discussed that a little bit and how you would handle that from an editorial standpoint. I think we discussed data that otherwise be individualized or pertain to specific people, perhaps redacting addresses and things like that, one of the important points that was raised is to just know your data and to know whether or not—how large the set was, I think if there is only two or three people in a specific area, then that data can be used to identify those people, even though on its face it doesn’t, is that problematic, and knowing whether or not that’s an issue and being sort of consistent about how you decide what information, private personal individual private information you sort of release.

Thanks, and our cash money team? Who would like to –

Jump in if I miss something. I guess we—I’m not sure we came to any major conclusions, but we talked about a couple things that we all seem to agree that if you have data that’s not easily available somewhere else and/or if you’ve added some value to it that really raises the ante in terms of whether you could charge for it or not, we also talked about the idea of having different prices for journalists versus nonjournallists, you know, trying to make money off the commercial people so you could give it away free to the others. We also kind of got into, you know, if you did want to go down this route, one of the problems you might encounter is that, OK, so let’s say you have some data and somebody wants it and you say, oh, it’s, you know, we’re going to charge that company for it or whatever, one-off kind of deal, but then what if you want to do this on a regular basis and you want to market it and they need billing and marketing and do you have that kind of infrastructure set up and most of us probably do not. And then the kind of last thing we raised was, you know, we wondered could we add a data consulting piece on top of the selling it, that you offer consultation to people who are buying it as a way to maybe add a little more to the price or something. Did I miss anything?

Does anyone from any of the tables have something they want to add to another group’s topic or a question that they have about running a data store or using data that they would like the group to kind of discuss?

With both liability and privacy and redactions, it seems like a lot of the anecdotes I heard in the summaries came from kind of processed and interpreted uses of data, and less from just I got this, I cleaned it, if someone else wants it, here. And like just wondering if people thought the liability or kind of, I don’t know, I got in your face kind of consequences might be less in the just pass-through of unprocessed or cleaned data, rather than, you know, making a piece and putting it out there that has unintended consequences.

So do you mean like how the data is actually used by a third party?

Whether it applies the same for like say the taxis if you made the pretty taxi thing and people were like I know where he lives, versus just saying I cleaned up this taxi thing, if anyone wants it, you know, 200 bucks, do you think there’s a liability or consequences are as bad in the second case?

Anyone want to?.

Thank you. About the monetization whether to charge or not, so different basis of providing the data, right, so can give the data, is it raw data, so as a service, if you ser of it, it adds some value, I don’t know do you have an API with different end points with different pieces, maybe it adds a value to the service so you can charge for those.

So in that case, what would be the questions or issues that you might have about how to structure that API or that licensing agreement?

I mean depends upon how ites used like what kind of application you want to use it and also what difference upon the data is structured, right? So if you have a—housing data, you don’t want to—it’s millions of records, and you want to provide an easy way to get to a specific geographic boundary or something like that, so you structure the data accordingly, like state, county, city, and then you coordinate with these parameters and it gives you a subdataset for you to consume. That way you don’t have to work on the huge dataset. I mean your client whoever uses the data may not have all the resources that you have to handle this huge dataset, so we are giving something in short little bits for them to consume.

And to go with what Mary Jo had said about different prices for different consumers, that was one of the conclusions/compromises we came to with our data store. The idea of it really came out of a lot of commercial entities coming to ProPublica, and asking for access to more of our proprietary databases like dollars for docs, where it only exists because we made it. And so if you go to our data store, actually, there are no prices for commercial use. Instead, if you want to buy a news data commercially, you end up getting in contact with our president who works out a pricing structure, because—one of my main concerns was I don’t want to make journalists pay for this data, you know, we are all journalists here, and also like I said before, open data, I want to make sure that everybody has access to this. So if we get FOIA data and we end up cleaning it, we will sell the clean data at a smaller price for journalists, a higher price for researchers, and then depending on the commercial usage, we work out a kind of price thing there, but then any of that data that isn’t proprietary that we didn’t make ourselves, that we have an actual raw dataset, we have it available for download and that I think actually has been our greatest success, because I don’t really have updated numbers but I think as of last year, we had about upwards of 80,000 downloads of free data, which for me, and again, we ask for information, we’re kind of curious who’s using our data, but we don’t require it and it’s a lot of journalists, but it’s a lot of other, you know, random researchers, commercial things, so my worries about kind of tamping down open data really were assuaged by seeing how great that was that we had so many free downloads.

So pot point of charging, one of the first cases was a nursing home resource and one of the things I learned pretty quickly was that my newsroom was not prepared to be a data manager. Editors don’t care. Like I mean you guys have clearly put a lot of energy into developing practice and a processer on it, but a lot of newsrooms just are not constituted right now to see the value and go to the kind of work it would take to be a legitimate value-added reseller of data, because it’s usually not sexy. I mean it I think it can be done but I think right now there’s a lot to learn to be good at it.

And it is time consuming as kind of the point person for data store at ProPublica, and I mean it varies from week to week but I do have to spend at least an hour or two a week answering questions, if people have downloading our data, if they have questions on our data, you know, it’s great to start a data store like this, but you also have to keep in mind, somebody is going to have to do the work and it is going to take time out of, you know, doing journalism certainly.

So I work at the chronicle for higher education. We also sell some of our data, specifically the data that I’ve been working on most recently was sold, and so I many very interested to hear that obviously about ProPublica’s selling of the data, but the way that it works for us is I work on the data, will make a data dictionary and have a record layout and I specifically ask that if any journalists wants to look at the data, has to come and talk to me and I will walk them through the data source and then we also have like our advertising and marketing folks are the person that handles it and she has a written email and how much it’s going to cost and the explanation to also reach out to me if they have specific questions and we also have that pricing that way, so –

Right, yeah, the writing of the read me and data dictionaries takes a lot of time.

Yeah and I do have one internally and also externally, so the formulas that I used I do make sure that’s open to everyone but I also wonder like you guys’ thoughts on how much transparency in those documents do you put? Like I give the exact column names that I used and the exact columns that I summed to do it and the exact queries that I ran to get that exact column. It’s—I mean in some cases the documentation is really long and I really don’t know if journalists are also reading this documentation and I wonder about that.

Well, are people calling you with questions?

Sometimes, but not a lot.

Then your documentation is great.

Yeah, that’s a great question. Unfortunately we are out of time. That hour went by very quickly. But I loved our conversation. It was great. If you can –

Yeah, thank you all very much. We’re going to have the photos of these, so we’ll have the list and the priorities posted—we’ll work with SRCCON to make sure if anybody wants to look through that list and how people prioritized and those will be available in addition to the transcript. And if you’d like to continue the conversation with either of us, we have our contact information on the etherpad that SRCCON has put out, so we love talking about data and data stores, so we’re happy to continue the conversation. Thank you:

[applause]

[session ended]

Session Transcripts

A live transcription team captured the SRCCON sessions that were most conducive to a written record—about half the sessions, in all.

How Can We Best Monetize Data—And Do So Responsibly, Ethically and Without Sacrificing Journalistic Standards?

Session Facilitator(s): AmyJo Brown, Ryann Grochowski Jones

Day & Time: Friday, 3-4pm

Room: Ski-U-Mah