Machine Learning—How Useful Is it for Journalism?

Session Facilitator(s): Steven Rich

Day & Time: Friday, 11am-noon

Room: Thomas Swain

We’re just going to give it another minute or two, let people get breakfast, sorry.

All right. We’ll go ahead and get started. Hi, everybody. My name is Steven Rich. I am the Database Editor for Investigations for the Washington Post. If you were expecting a heavily technical session on machine learning today, I’m sorry so disappoint you. The focus of this session, I think, is something that I’ve been talking a lot with people this community and then the NICAR community a lot that I feel that people see machine learning as this great hope and they have no idea of what it actually is, what it’s useful for and that we’re going to see a story that just straight-up puts machine learning analysis in that shouldn’t be.

And so, my hope is to share with you a few things that I’ve done, share a few things that have been done. And also share a few things that I’ve seen that are awful and should never will repeated, namely Twitter sentiment analysis, but I’ll get to that later. So I will start this off by saying that I am by no means an expert on machine learning. I have made myself an expert in the few types of machine learning that I’ve done for stories but I always consult with experts, which I think is going to be your best bet in any case to figure out what algorithms are going to be the best, what’s the best approach, and all these things. And there are people out there that can help you.

So just briefly, if you don’t really know what machine learning is, at a really basic level, it’s teaching a computer to be a human in some ways. I mean, you’re not going to teach it to emote, but you can teach it to do extraordinarily tedious tasks that require—that generally require the eyes of a human. So you need to fix names in, like, a database where there are a hundred, a thousand names and you want to watch people and so like, “Mike” is “Michael.” You know that, but a computer doesn’t know that. Dan is Daniel, but not Danielle. You know that, but a computer doesn’t know that. So the idea of machine learning is to train a computer to understand that, to recognize that and be able to to do it. My favorite analogy of what machine learning is comes from which a Chase Davidson, from the New York Times. Machine learning is like teaching a drunk 5-year-old to do a task for you repeatedly. I think better put, I think it’s more of a team of drunk 5-year-olds. But I’ll leave that debate for another time.

So I’ll just jump in and talk about the big one that I worked on last year. It’s not one that I think is particularly replicable, and I will get into why. So last year we were approached by whistleblowers at The Agency for International Development. Not in the agency, but in the Inspector General’s office of the agency. And they were telling us that the AG was watering down their reports. How do we prove that?

That was the thing that we wrestled with for a long time and so what we ended up getting in the first place is we got we had 12 audits, and we had all of the drafts of the audits. I should note that the drafts of the audits are not public. We had to go them through whistleblowers and we ended up compiling somewhere in the area of 60 audits total. So there were 12 audits and 60 drafts. So what we wanted to do determine at the end of the day was were they taking out negative references to either the agency, or their missions abroad.

And so we originally tackled that in that, like, oh, we’ll just look for a few phrases that were really negative in the audits and then just, you know, put a few lines in there that says, well, one audit it said this, and then they took this out.

But anecdotally, we knew what their argument was going to be. Their argument was going to be, well, of course, things get changed now and again, and you only highlighted three or four things. Even if we had a pile of them on the cutting room. So eventually we settled on sentiment analysis. So the reason why this is not replicable is because language is a funny thing. It’s in most forms of writing, there is a certain—you might be able to read something and understand that may be it’s sarcastic, or maybe it’s not reading exactly what it’s saying, but a computer can’t. So doing sentiment analysis on a very large scale like we were trying to do was proving to be impossible saying here are positive words and here are negative words. And so with what I ended up doing is writing a script that essentially, would comb through and tell us whether there were negative references, whether there were positive references and total them up, and let us know what it was.

But when I got that back, I started to realize other problems. The word “poor” can mean a number of things. You could be talking about work being poor. You could talk about a poor country. And just because it says poor country, it doesn’t mean that it is in negative reference to that country. And so, a essentially what we had to do, and this is we had to do a lot in machine learning is you create a training set, which is basically a set of words that it singles out and it will know, and it can learn that these words are good, these words are bad, these words are neutral. And it will bring them back and score them probabilistically, this is likely that this is a negative word, this is unlikely. But what we had to do for this one, though is we had to—I had to write the script in such a way that it would understand context which is a very difficult thing. I mean, it’s difficult enough for humans to understand context sometimes but you’re training a computer how a word is being used. Basically I wrote the script in such a way to understand when “poor” was an adjective, like, some words can be nouns and verbs and it would definitely change what it would be. So the verb would be negative, and the noun as neutral, et cetera. Or vice versa. So the biggest thing that you’ll find no matter how often you do machine learning is that it’s a constant tweaking process. Things change, you don’t anticipate things. And I think you’re going to fail 500 times before you get it right once and that’s just the nature of things.

Another case that we use this on was the Sony email hack. So I will note that the most common use of machine learning that you see on a day to day basis is your spam filter. Essentially what people did was they created a program that said, “This is very likely to be spam” and they sent it there. And so what I wanted to do there was with the Sony emails was two-fold. After I downloaded them off, I wanted to filter out the spam and the things like just straight-up press releases or shared articles via email because none of those were going to get at anything that we might want to see in the emails. But on the flip side of the coin, I also wanted to find phishing attacks which are generally spam. If I created a general spam filter it’s going to scoop them up and find them amongst them. And so I had to write a program to essentially separate out the spammiest of the spam, which were phishing attacks and I will tell you the one that rose to the top of the list was very clearly a phishing attack. And it was a very successful one and we wrote it in the story. We didn’t know what it did. We didn’t know if it was “the” attack. It took almost no time to find. I mean we found the email. It was basically one of those things where the headline was, like, “important document.” And then, in it was like a, “Hi, please open this secured document and enter your password.” And then my favorite part of it was the final line was, “PS, this is not spam.”

I think you can basically filter—if you look at any emails that says, “This is not spam,” it probably is. So that’s how we used it in that case. But thing at the end of the day, the ultimate use for machine learning is probably the least sexy use for it, which is data learning. Machine learning is the perfect thing for data cleaning. I’m going to use the New York Times as an example here. They are working very hard to match people in the FEC database with themselves.

I mean, there are so many variations of people’s names in there and there are so many variations of addresses, and occupations, and so what they are doing is, they are doing that, like, Mike is Michael and you know, if the address is pretty similar, maybe batch that and they’re scoring them on a something basis and it’s—I can tell you that they are putting a lot of effort into it because it is not very easy.

But I mean, at the end of the day, it’s kind of the perfect task for machine learning because essentially what you’re doing is, you’re just going record-by-record and saying, is this person this person? Okay. And so what you can do, and how they’re creating their training set is they are doing that. They are taking a subset of the data and they’re manually doing that. And they’re saying, “Okay, if you notice this thing you’re going to want to match these people.” And they’re doing that and making life easy. I know one of my colleagues came to me about standardizing the database ourselves and I want to use machine learning, not yet. I’m hoping at some point that New York Times is just going to open source their code. But there’s really no other way to do it. I mean, data cleaning of small and medium-sized datasets is simple. I mean it’s really not that bad. By when you have literally a hundred thousand different names and you’re trying to figure out if one is another, it’s very hard. I mean, there are a lot of Mike Smiths. And how do you know this Mike Smith is this Mike Smith. Or maybe this Mike Smith moved. There’s never going to be, unless the FEC goes back and creates ID numbers for people, we’re screwed. We’ll never know if one Mike Smith is another. We can say probabilistic long, there’s a very high chance but we can never say they are. Which brings me to my next point which is when not to use it. For the love of God, Twitter’s sentiment analysis is awful. If you’re sitting on your computer right now and you’re on Twitter, I want you to type in the search bar: I love… and search for it. I want you to read the first one and I want you to tell me whether or not it is a sarcastic tweet. Or you’re sure that it’s not a sarcastic tweet. So one of the things that—there was an organization a couple years ago when, in the last election cycle that tried to do divert sentiment analysis for Obama and Romney. But just try to imagine the tweets that people send out about these people. I mean, somebody’s going to say, “I love that Obama sucks.” And it’s going to break it because it’s going to say, “I love Obama,” basically. So it may classify that as a positive tweet. So because humans often write as they speak but can’t necessarily get across that they are being sarcastic, it can be a major, major issue with Twitter sentiment analysis. It’s the reason why my use case sort of ended up working. I mean, think about audits. Audits are literally just straightforward language that conveys the point that it’s trying to make. And we created our training sets specifically for those audits. The training set on my program will not and cannot be used on basically any other set of data that you want to do sentiment analysis on. That’s the important thing about training sets. They have to be very specific for what you are working on. It’s probably our biggest issue in machine learning is that it’s not, you know, one program is probably not going to be replicable on something else. If anybody puts out a program that says, that it’s your essentially 1-size-fits-all program, they’re selling snake oil. If they say it’s one-size-fits-most. There are some of those and there are some really great ones out there that you can for data cleaning, you can order or identify entities within a set of data or documents that you want to identify.

But they don’t work on everything. They’re just—there are always unknown scenarios. And so it’s very important when you’re using these programs that there might be shortcomings and maybe try and talk to an expert and say, like, if I put this in here, is it going to work? If I use this, is it going to help me? And that should help get you on the right track. So I’m kind of hoping that this session can be way more talkative than me just standing up here and spewing crap. And so, has anybody in the room used machine learning, tried to use machine learning, done it well, done it poorly? Do you want to talk a little bit about that?

Uh, sure. I work for Mac&Company, and we do data integration for data providers and so we recently had a partnership with TrueDelt which is the global database on tones and aggregates. And they source from global media from like BBC and then subclassifies it so that you can search by, like, who’s posting about ISIS right now. And so we want to put this all on a map because we want to geolocate as to where they were published and we had a few journalists who wanted to do themed maps. So kind of parsing through, looking for one was doing a story on wildlife crime. And so she was, like, look up poaching. See how poaching is featured in these articles but then you get all these things like people poaching someone for a hire, or poached eggs also come up, like, in cooking articles and what we ended up doing was just making the maps with the poaching information and kind of whittling it down as best we could but then explaining those disclaimers in the posts and saying you’re going to find some poached eggs because there were a lot of edge cases that we didn’t control for. So I don’t know what the best way is for those errors, or to just not use it at all.

I mean, from experience, I think transparency is going to be your best case scenario. You say, look, this is hard. I mean, this is not—I mean, there’s 99.999% of your readers are going to look at that and not say, “I could do that better.” Or they might say it, and they can’t.

And so, I think if you’re going to use machine learning in a way like that, it’s very, very good to say, like, here’s what we attempted to do. Here are some of the downfalls of that that are very difficult to weed out, but you know the majority of it is going to be what you want. So I think that being as straightforward as possible about where it fails is great. I think trying to tweak it as you go is also good. Some things are just not tweakable.

But yeah, there are a very few use cases where you do forward-facing machine learning. I think a lot of it ends up being suffix. Because it is probabilistic in most ways, it’s helpful to point you in the light direction in reporting, I think but that is a very good use case for, “Hey, look at what we’re doing.” And I think it’s also very fun. Anybody else?

Sorry. Another problem that I don’t really know how to solve for is that it has articles from in 65 different languages, right? So what you write is something that looks up the translation for what you’re looking for because I don’t speak 65 languages but there’s no way to control for the errors that come from I not understanding how poaching so interpreted contextually in all those other languages. So that’s fun.

It should be noted that translation software edits, at its heart is machine learning. It’s not the greatest because you can’t program it to get every phrase as it would be said in that language. You pretty much have to tell it that if you see this word in this language it is this other word in this other along and not necessarily get it contextually accurate. So there’s that. I mean, you pretty much—there’s really no good way to create a translator that’s going to work accurately in that setting. It’s unfortunate but I cannot imagine a scenario where you’re going to through multiple languages where you’re able to accurately pull context out of—I mean pulling context of our own language is tough. Translating something with context to our language that has that same context is virtually impossible unless the sentence is very, very straightforward. The translators get things wrong all the time and that’s because machine tells you, like, this is sort of right. This is more right than nothing. And so, I think the major thing that you need to know about machine learning is that it—you’re never going to do anything with 100 percent accuracy. There’s always going to be—there’s probabilities, you’re always going to be tweak it to try and be as accurate as possible but it’s not something that’s 100 percent accurate. You can get it very close, but unless you have a data set that’s, like, 100 items and you literally created a training set for every one of those items, it’s not going to be 100 percent accurate. And if you’re doing that for one that has 100 items, you shouldn’t be doing that. It’s much easier to just clean it and change things and do it like that.

If you don’t have examples of machine learning, I’m happy to field questions on issues you’re having if you want—if we want—if you want to talk about a potential use case where you think it might be useful but you’re still not sure. I’m happy to answer those. I want this to be a conversation if anybody’s got anything.

So how easily have you found it to convince other people—people who will say they were either skeptics of anything machine related, how—can you talk about your process of getting them onboard with your statistical assumptions and whatnot?

Yeah, I mean, it’s very tough. I mean, I think the biggest problem doing any amount of statistics at any level in journalism is that people aren’t going to trust me because I’m a journalist. They say like, “Oh, that’s guy’s a journalist, he can’t do math.” Or he can’t do stats, he can’t do any of those things. So the assumption going in, at least if I’m public facing is that I can’t do this. Internally it’s not that bad but there are still people out there that are like I’m skeptical of statistics. Are you sure this is good? And so it’s tough. I mean, one of the big ways that I do it is by bringing in experts and saying look, I’m consulting with some of the best people in the field on this and let them explain it in their own words to editors, to other reporters and let them say, like, this is the best method for doing this. And have there be an understanding that there’s sort of this backstop for me because I’m not an expert in machine learning and I’m not an expert in the various fields of—in some various fields of statistics. But other people are, and there’s literally no reason for me not to ask them. And I’ve found more often than not, that people are very open to helping you. I mean, if you go to a statistical expert and you’re like, “Hi, I work at a news publication and I need your help.” They’re just like, “Awesome, how can I help?” So you can’t be afraid to ask experts. But I would make it very clear from the get-go with my editors that this is what I want to do. I think it’s the most appropriate. A lot of people trust me now that I’ve gotten it right once or twice. But in the initial stages, it’s sort of like a you’re just going to have to trust me on this. If it goes wrong, it goes wrong and we won’t use it but I always make the promise that I will consider not using it if I don’t think it’s good. If an expert says, like, at the end of the day, “Oh, well, I thought this would work but we failed.” I always make sure that people understand that I am not married to the idea of using it. But I still it’s the best option if we can use it.

When you’re consulting, what are—or who are the types of experts that you’re talking to? Are they people in industry, are they people in academia? Is it depend on the story specifically in terms of how to model your machine learning.

So I have a few friends and colleagues in academia that are doing some of this work and usually my first step is to sit down with a few of them and talk them through my problem. And they are the ones that—like, they are so interconnected to a lot of other people so they are the ones that are saying, this is the that’s doing something similar to what you’re doing, they’re going to be the one that you want to talk to. A lot of these a lot of these people are in the computer science world. Some of these are in the linguistics world. I mean, a lot of these experts are just people who understand language, which is really, really necessary if you’re going to do machine learning. So I can—I mean, I’d like to consult technical and non-technical experts. And honestly, the best way is to find one or two people out there that you can sort of make friends with and they’re the ones that can get you connected with the right people. They might be the right people. But they are the ones that are going to tell you that you can find somebody. They’re also the people that are going to tell you: You should not—this is not a perfect use case for this; don’t do it. Which I’ve had in certain instances where I thought like, “Oh,, this might be an interesting use case for this.” And they would say, “That’s interesting in that it’s not good.” So I think it’s always good to consult experts or people who know this sphere before you really dive into something like this. ‘Cause it can get complicated quickly. And lots gonna depend on—I mean, there are too many variables for every one of these situations to be able to say like, yes, you can do it there, or no, you can’t do it there. I mean, there are some we know yes, you can do it there, no, you can’t do it there but that’s because we’ve tried. But I think most use cases we haven’t tried yet so we don’t know.

So, I was actually talking with a colleague of mine this morning about some statistical modeling that he’s doing with some police report—police abuse reports. And he’s kind of like a very intelligent person who knows a lot about math. And as he’s kind of walking me through this stuff, it became very clear to both of us that—to attempt to explain exactly how he came up with this conclusion, it’s very opaque and it just becomes kind of—it’s kind of like a blackbox, you put the numbers on one side and this thing comes out on the other side. So I’m wondering, so question one: I’m wondering to what extent in journalism, generally speaking, do you feel like, or does anybody feel like I should like be fully transparent about what’s happening inside the blackbox because it’s very complicated and it’s hard to explain? And then the wow, I lost question two. Damn it. Anyways. I might ask it later. On you yeah, I mean the fact that these machine learning and doing statistical analysis and some things.

I’ve always argued that the biggest story killer of all time is to insert the words, “Logistic regression” into a story and people are like I don’t care anymore. It doesn’t matter if the subject matter is rivetting or not. So my goal with this and statistics in general and let’s be honest, machine learning is kind of statistics on steroids in a lot of ways, is if you can use it, and as, like, a—as a reporting tool to, like, point you in the right direction or something, that ends up being easier because you never have to explain how you got to, like—how you found—if you’re looking for something in particular, you never have to say how you got there. It’s only when you come to the putting your analysis into the story that comes out of that, is where you need to be entirely transparent. The Post does—we do what, what we call “did-boxes” which are basically methodology boxes. Which is if you really really want to know I did this, one, here’s how I did it. It might not be the most obvious thing to most people but it’s a starting point. It’s a “here’s how I did it.” If people want to talk to me about further details, I always email them back, I always call them. I want to be as open and transparent about this because, at the end of the day, you can’t just, like, make this pie-in-the-sky claim and not tell people how you got it. And so especially with machine learning, you have to be open and honest about it. And how you want to do that is a different story.

If you say, like, I used machine learning to get this analysis in a story, no one who reads it is going to understand that. I mean, we’re not writing for a technical community but if you put it in a methodology box, those people who really want to know are going to figure it out.

So I guess I remember the second question: To what extent do we feel like some kind of a complicated analysis should be reproducible. You mentioned this thing that you did as not reproducible.

I mean, I’m not—it’s reproducible on the same dataset. It’s just not reproducible on other datasets because it’s not one size fits all. My training set—you can use the algorithm that I did and everything around it, but my training set is very, very specific for the audits that we used. So you couldn’t just—if you had another group of audits, it might work. But even then, you would probably still have to do your tweaking. I could give you everything including my training set with this giant note that says, “You’re going to need to change this.” Because there are words in audits that are, like—like “recommendation” in an audit is a very bad thing. But “recommendation” in reel life can be good, or it can be neutral. And so I want things to be reproducible and I want to share my code when people want my code but I always want to put a note on there if you’re going to use this for a different purpose or for a different dataset specifically, you need to change it. It’s not a one size fits all.

I’m curious if you’ve seen any examples of combining crowdsourcing to clean up or otherwise inform machine learning output, particularly around journalism. I work at Popup Archive and we index a lot of sound for journalists, producers, media, radio, universities, archives. And so we work with a combination of tools, a lot of them third-party, some of them done I’ve done experimentation and research with, some of them include speech-to-text software that’s trained to certain news media and history. And semantic extraction. We won’t go as far as cent analysis for reasons that you’ve just argued and described. And then we look at things more into now speech and audio analysis. There’s now one called “Hipstas” which started with bird calls and poetry, and just analyzing different qualities of waveforms related to the sound. But anyway, there’s a project called Metadata Games that’s come out of academia out of Dartmouth to clean up messy OCR transcripts or other, like, more machine learning type academia projects and I’m always looking for—we’re thinking about ways of augmenting.

So I actually had this thought very recently when Hillary’s Benghazi emails came out, the Washington Post did something great, they would ask readers to randomly describes this. They had tags that you could put on it, or you could make up your own tag but what if I showed 500 emails and used 50 thousands. You could use the tags they gave them as your training set and then try to extrapolate it. Things with similar language would get tagged in a similar way and it might be a really great way of finding out without having to go through it yourself, in the larger context, like, what—so people see the small set and then it extrapolates to the large set and then you might find the really interesting emails in the large set by understanding what people think about the small set. As long as you think the small set is representative, I mean, Hillary’s Ben gaze emails are not going to be representative over her emails as a whole because they were very specific to one subject but that’s how you great a training set, you take a small group from a large group and you go through it by hand and you learn what the intricacies are going to be and then you learn from that and then you go through another group and learn if it’s working or breaking.

Do you know what the Wall Street Journal’s end goal for that was?

Let’s say that you have 10,000 people read that in a span of 30 minutes and everybody tags one. Then you probably have somewhere in the area of 12 tags. I probably didn’t do that math correctly for each one. Now every 30 minutes in 500 emails, without having to read all of them, they know which ones are the interesting emails are, they know which ones the controversial emails are. They know where the spam emails are in there. So for their purposes I think it was a great reporting tool that sort of skipped the step of having a reporter read and confide through your emails. They were able to figure out things quickly in a breaking-news scenario like that, machine learning’s just not going to do it because you’re not going to have time to train something if you’re hoping to get it out that day. But if you’re—I love the idea of crowdsourcing your training set because I see things through one lens and a lot of other people see things other ways and I think it’s really great to bring in people who are a little more detached than you to help.

I’ll check it out, too, do you know if they had controlled jobs for the people to use?

I can’t remember.

Okay. Cool. Thanks.

Along those lines, have you ever—or has anyone ever done anything with Mechanical Turk with some other kind of piecemeal—paying people to small amounts to ask to qualify things. Are there any kind of journalistic issues that can arise from going about it that way?

That’s a good question. I mean, I like to do as much of what I can do in-house, or with experts that I’m not paying because if we pay an expert, then people think that I—it’s very easy for someone to come back and say, “I think you paid him to say your analysis was good.” Even though that was completely not the case. So in that way, it ends up being about what we talk about for investigative subjects all the time. It doesn’t matter if you’re doing something wrong or unethical. If it appears to be wrong, or it appears to be unethical, it’s just as bad. So I’m okay with paying people to create my training sets. I mean, I will tweet them. But I don’t necessarily know of somebody who has done that yet. I mean, machine learning is still very, very new in journalism. There’s still very few people doing it. So the examples are obviously not everywhere. But it’s starting to be in more places and I just don’t know. I know that BuzzFeed is partnering with—I cannot remember who it is—they’re doing machine learning, which, this is—and I’ll just give you a little bit of background, because I think this is one of the coolest uses of machine learning that I’ve ever seen: Taking previously redacted documents that have now been released publicly, things that they released ages ago that are now being released without redactions. And it’s a predictive tool for seeing what are the redactions in the document. It’s obviously you can’t say with any certainty that oh, yeah, now I know this name, or I know exactly the words under this. But it might point us in the right direction. And it starts to—it actually helps us to understand what the government does and does not want us to see in the specific context of national security documents. And so, they—BuzzFeed is not doing any of the analysis themselves; it’s happening from this third-party. I’m relatively certain, it’s a group of academics. There’s no issue with that because they’re not saying it’s theirs. They just have some rights-based claim on what they’re doing. And I think that’s fine if you’re straightforward about the fact that you’re not the one that it’s doing it. It’s coming from another source and you have no way of knowing that it’s accurate. That’s the scariest thing about taking somebody else’s algorithm at its word; is you are just assuming that they know what they’re doing.

And I think in some cases that’s good, but in a lot of cases, it ends up being the same problem that some of us have in journalism when we’re doing some of this stuff, is that there’s not anyone in my newsroom who can check my algorithm. And so I have to go elsewhere and basically if I want anyone to trust me on this, I have to get experts to say that it’s good, or that it’s bad, or whatever. And I think it is very, very, very important that there—that we are—that we hold algorithms accountable. I just don’t know how we do that, and I don’t know how we do that in a way that is very out in the open.

And so, I’ve tried to be as open about some of the things that I’m doing, but I would imagine that there are still a lot of skeptics for what I do and who think the things that I’m doing are not above board, even if I’m not paying anybody or anything like that. Anybody got add idea about what we might use it on?

For those of us who have not gotten do machine learning at all, no experience with it, is there any suggestion for how to get in the ground and start digging into it a little more?

I like—I’m mostly self-taught on this. One of my goals, I set it for myself to meet by this conference but I didn’t meet it because I just got swamped with stuff over the past couple of months. I’m working on creating interactive courses on both R and Python on the basics of machine learning at some point. I will release those. And I’m hoping that they will be geared—or no, they will be geared more towards journalists’ uses of machine learning. They’ll teach you how to do certain things. They’ll teach you the very, very, very, very basics. I mean, there’s a lot of stuff that you need to know before you can really get off the ground. But yeah, I mean there’s a lot of great resources out there. I mean Stanford does some fantastic online programs that you don’t have to pay for. So, I mean, I highly recommend going and finding them and they are very good at just hitting the ground running but not, like, you don’t have to have a CS degree to be able to understand the basics of that course. We had a lot of tags that are added to content by people. Sometimes it’s

Sometimes it’s like DEA, and Drug Enforcement Agency. Sometimes there’s a hierarchy that aren’t expressed. Do you have any suggestion for resolving those entities and trying to understand structure within them if they’re not in that already?

So I mean, that’s sort of a great use case for this, is depending how many you have, I mean, if you—if we’re only talking about a hundred tags then –

It’s like 2,000…

So you could theoretically go through that on your own and do it. It’s not going to be fun. I mean, essentially what you would do is pick out some random ones and you would essentially want to define what you wanna do. So between DEA, and Drug Enforcement Agency, you need to decide which one you want, and you want to keep that consistent and so if you’re going to pick DEA, you also need to have FBI, or vice versa, Federal Bureau of Investigations. You have Drug Enforcement Administration. So basically, the first thing that you just wanna do is set out how you want your tags to work and how you want them structured. And so what you say is: I know that if you take the first letter of these—of each word, look for that. So like, “Drug Enforcement Administration,” it would take D, E, A, and it would look for all tags with DEA, and just DEA, and you can say, “If it looks like that, just take that and change it.” So this is essentially data cleaning. You’re trying to get everything uniform. You’re trying to do that. For 2,000 you’re just going to have to eyeball a good portion of it. You’re just going to understand what some of your major issues are before you tackle them. It’s same thing with anything that you’re going to do in machine learning, is if you don’t understand the data that you’re working with, you cannot write a program for it. And so, it is very, very important that you just sit there and take, like, 30 minutes to an hour and understand the issues and then you can write the program that either changes it, or tells you what it most likely is, which is probably your best bet because you don’t want it changing a tag to something else and then—but it was a screw-up and it should have never changed that.

So is it worth looking at an external source like Wikipedia or something to try to resolve similar entities versus doing actual parsing of?

Looking for what? I mean, it’s possible for some of those Wikipedia pages to—so resolve some entity. Would you use some third-party service rather than just training on the text?

I think you can do whatever you want. At the end of the day it’s your decision: Whatever you think is best for your for you is going to be best for you. So you can use a third party one. You just want to understand what they’re doing.

Sure.

And so, yeah, I mean my best piece of advice there is just know thy data.

Thanks.

I’m just going to talk real quickly about the one that I’m working on, that is I have no idea whether it will ever work but I’m hoping to God that it will, and it is super timely, and I’m sort of mad at myself that I haven’t gotten it to work yet but I am working on a program to download the SCOTUS opinions and tell me exactly what happened in basically five seconds after they’re posted. I want it to tell me what the margin was. I want it to tell me who’s dissenting. Or I want it to tell me the concurrent opinions of the case. I want it to tell me the basics of what they’ve decided but it’s very difficult because it’s very difficult to write programs that summarize because you can find keywords and say oh, yeah, that’s going to be important but it doesn’t know how to structure a sentence around that.

And so, you can find key sentences and say look, just take these key sentences but you might have five key sentences in a row and it’s this disjointed thing. So that’s one of the issues that I’m trying to tackle is understanding how to do summarization, how to do extracting. The positive thing about these is they’re generally pretty similar. You’re going to have very similar structure from one opinion to the next. You’re going to have some consistency there. The biggest, biggest issue is having a computer understand the full text of something that might use colloquialisms. I mean, read any dissent by Scalia, and my summary might say that California’s not a western state because it thought that that was important. So it’s a very difficult thing. I cannot imagine that I’m the only person working in this sphere to do something similar to that. But it is one of the ultimate goals of machine learning, I think is teaching a computer to do everything that I can do, just way faster. And, you know, some of that stuff is impossible. I hate to say it. I mean, there are some things in journalism that we will have to do over and over and over and over and over and over again. And it will be tedious and it will suck, and there’s nothing that we can do about it. But I mean, machine learning is this great unknown. It can help us with a lot of things but it can fuck us on a lot of things, too.

And I think it’s really important to note, more than anything, that machine learning is not a magic bullet. It’s not something that you’re going to be able to pull out of your toolbelt and say, “I can apply it to this situation.” I like to think of, especially data journalists as Swiss army knives. We have very skill sets that we can employ at a given time, but you don’t want to use the saw when the knife is going to work.

And so it is important that we understand what tools we need to be using and machine learning is just one of those tools in your belt. It is probably going to be one of the most unused tools in your belt because it does not have practical applications across the spectrum. But it has some. I mean, it’s not an useless tool. It’s just—you need to, if you really want to use it, you have to understand it. And I think the best way to understand it is just to talk to people who are using it and understand the various use cases for it and then try. I mean, there are ones that I’m not going to get into now that I’ve tried and miserably failed. And some of it was stuff that I ended up—I mean, the audit one. I just gave you an abridged version, but that was—I think I gave you like two or three failures before I ultimately succeeded but it’s probably closer to two or 300 failures before I ultimately succeeded and that’s just life. And I feel like the best learning that I’ve ever done was saying, “I think this tool is cool. I think it could be applicable in this case, and I think I’m going to learn how to do it in this context.” And I think that’s how you learned to do a lot of that stuff. I think that’s one of the best tools that we have. I see a problem. I can see how the tool can fix it, I don’t know how to use the tool but I’m trying. And you’re going to fail here. You’re going to fail. Even if you’re an expert at this you’re not going to succeed on your first try unless you’ve literally already done it on a dataset that’s almost identical. You’re not just going to be able to say, “Oh, yeah, I’ve got this old code in my closet and I’m going to pull it out and it’s going to work perfectly the first time.” That’s just not how it’s going to work. So it’s just going to be, this more than any other discipline of journalism, it requires you to have patience and to understand that you’re going to fail and you may never succeed on something because it’s not possible. But that should be taken as a learning experience and that we should—it will at the very least help you understand how things work.

Five minute warning, that’s all.

Any final questions while we’re staying in here?

Oh. Can you tell me what tools are you using in projects? The Supreme Court thing?

So I write most of my code in r. I’ve been teaching myself how to do this in Python mostly taking my r code and modifying some things. So I’m rewriting all of my r code for the Artis project in Python because that’s how I feel I can learn to do it best. And so there are some great packages and modules in both worlds in r, and Python, I’m sure there’s Ruby stuff, I just am not familiar with that world. I’m sure there is—there are modules or packages in every language that you could possibly think of.

I don’t know what they are off the top of my head. If you want to contact me after, I’m happy to provide you with the ones that I’m working with, the ones that are really good. Yeah, so I pretty sure just—yeah, you don’t have to create your own packages and modules; people have done that for you. They’re always looking to get better because this is still a very new area. And well, maybe not super new, but in the open source world, there’s still not a massive amount of code because it’s not very reusable. But it’s just basically an extension of writing in Python, or r, or whatever it is that you want to do. I mean, I like r because it’s a statistical program first and that’s sort of what this is. And I always tell people—I mean, how many people in this room know r or have used it? Okay. So you will understand this. So I think the best thing about r is that it was written by statisticians who understand statistics intimately, and they can—and so basically any use case that you have is covered and the worst thing about r was that it was written by statisticians because it has the steepest learning curve of any language that I’ve ever worked in. But once you know it, it’s incredibly useful. It’s incredibly powerful. It’s incredibly easy. But it’s not easy to go from zero to 100 on that in a couple days. It is something that you’re going to have to learn. You’re going to have to dedicate time to. But I think—I know that there are people out there that would argue with me that it is not the best language for this but I think it is. I think r is the best language for machine learning. I think you can do it in all languages. But I just think that using a language that is grounded in statistics for a—for something that is very much statistical and very much probabilistic is your best bet. I mean, I’m not going out there and saying, “Learn r tomorrow!” But I am saying that I think it has some of the best packages and if you want to talk about that with me, I mean, I’m—not right now because I’m sort of new to machine learning now, but I’m happy to talk after the conference and give you some of my—a talk about some of the packages I’m working with. Some of the good things that I’ve found, some of the bad things that I’ve found and really, trying to help you get off the ground. And again, I think the best way to get off the ground with this is Vanidea in mind. It doesn’t have to be an idea that you ever want to publish. But I think you should go into with a dataset that you want to do, too. And you think that machine learning, that’s really the best way.

I just had a quick comment. I don’t know how common it is for—you mentioned one thing that you often do is talk to somebody who has, like, grounding in, you know, whatever field it is that you’re trying to learn more about using machine learning. And, you know, having some friends and colleagues that kind of work in academia, it seems a lot of the problems that are at the heart of building the little bits of functionality that you need to do to do any kind of, like, machine learning at scale. A lot of those problems have already been solved and they’re just bur ed in these, like, ridiculous academia papers with big formulas. And so, like what technique, like, one of the projects I work on latent within it are a lot of these kind of mathematical formulas that work very deep within these kind of academic papers and what it took was calling these people up and saying, “Hey, what did you learn?” I mean, one guy was in Russia, so we couldn’t really call him up. But we got in touch with him online and he has, like, code that he used to write his paper. And it was really terrible because it was academic code. It was written in Java, so none of us knew what to do with it. But anyways, so patience and asking a lot of questions. You can yield work that’s already been done. I know it’s hard to know what to search for, but I know some people can speak academese, and normal.

You have a good point. There’s probably people who have already made it. You know, there are so many algorithms out there, and none of them are one size fits all, but all you need is one size fits one to have that one. So again if you talk to the experts, someone is going to say, someone’s done something extraordinarily similar to what you did, you’re basically to use a lot of that with tweaks. That’s machine learning now. Is that there’s a much larger userbase. All right. Thank you.

[ Applause ]

Thank you for bringing questions. I wanted this to be a conversation. And I think that we got pretty close to that so…

Session Transcripts

A live transcription team captured the SRCCON sessions that were most conducive to a written record—about half the sessions, in all.

Machine Learning—How Useful Is it for Journalism?

Session Facilitator(s): Steven Rich

Day & Time: Friday, 11am-noon

Room: Thomas Swain