The session will begin in ten minutes.
The session will begin in five minutes.
The last session of the conference. Boy, you’re full of Etherpad tricks that I had no idea about.
The Rastafarian button does that.
Oh, cool. That one? Cool. So it looks like it’s about 4:30. And we have about… All of our chairs filled. So I’ll kind of go ahead and start it up. We have a pile of chairs here, if more people are coming in, and need a place to sit. We have a couple more chairs over in a pile here. If you want them. Yeah.
Cool. So welcome to the past of the future today. If you want to be talking about archives, you’ve come to the right place. If you don’t, I’m sorry, but we’re talking about archives. So I’m Daniel McLaughlin, I work at the Boston Globe. I would say when it comes to archives, I’m an enthusiastic amateur. I see many people in here that I know probably know more than me about archives, so I’m looking forward to sharing knowledge. I kind of want to start out by showing some links that I found that are interesting. Maybe that are interesting, kind of orient the discussion towards my preferred hobby horses, but then we can kind of open it up, talk about ideas you had for ways of dealing with archives, archives of content that your organizations have, ways of orienting current work flows to support the archives of the future, challenges you’ve encountered in either of those, and cool things we can do.
So when I think of media archives, my kind of main point of reference are microfilm and newspapers. Scanned, there’s days. But sort of facsimile. Facsimile reproductions of newspapers, put out over a long period of time. And what’s interesting to me about this kind of… This type of archive… Is that it captures a very complete picture of… You can imagine what it was like to read that newspaper as a full thing. You’re not just looking at a particular article or a particular item. But you’re seeing the full range of juxtapositions between different types of stories, the stories and the advertisements, and the whole thing. So, for instance, these are all linked from the Etherpad, which is Etherpad.Mozilla.org/SRCCON2015-archives. I don’t know what the best way to keep that on the screen, but I’ll come back to that. For instance, this is an article that is using the archives of the New York Times to sort of do this deep dive into advertising design in the ’40s and the ’50s, and it’s not necessarily… You know, in this case, he’s in fact, like, entirely blanked out the news. So it’s sort of just looking at only the kind of auxiliary matter in the newspaper. But it’s still—you know, this kind of primary resource.
Here’s an example from the Boston Globe. I like to call this snowfall circa 1960. This sort of just playing with editorial design here. That… This front page that’s this sequence of things, and I don’t think there’s even… I don’t think there’s even a story attached to this on the front page. It jumps into the sports section. But just like… I don’t know how you kind of capture this without looking at the whole page. But at the same time, you have this sense of seeing the whole thing, but you also see… You can also see how that abstraction starts to leak. How it starts to be… You get hints of the stuff that isn’t being captured here.
This whole kind of ecosystem of different editions, during the course of the day and the night, that is not archived. You only normally see, like, one final edition. This is the front page of the Boston Globe that came out on election day, 1918, and it has these notices on the front page. Notices that the Globe will not issue election extras, due to paper rationing. That they won’t give you the returns by telephone, because between the war and the influenza epidemic, they don’t have people to answer the phones. The past was a terrifying place. But these sorts of things that are… That were kind of so part of what—of how the newspapers interacted with their readers, how they provided the news, that only show up when they’re temporarily disrupted. There’s no archives of people calling for election results, which presumably they did around there. I don’t even know how you would archive that. But it’s sort of—we think like—oh, how do we archive our Twitter bots? Well, I mean… News organizations have always been providing stuff in these auxiliary special ways that have been challenging.
And that sort of—this, I think, is a kind of good place to pivot, to think about… How can we archive what we’re making now? We’re all experimenting with different ways of telling stories. Different ways of using new technologies. And often what that means is kind of poking around the edges of how we define news and how we define media. And so that sort of presents a challenge, if your archival schema relies on a fairly stable notion of what you’re archiving.
This is not necessarily news or media, but this is a catalog record from the Getty Institution for a book that they have in their collection. The title of this book is: what is wrong with this book? You may notice the physical description here—one plastic sphere in paste board box. Note. Text stamped on white ball. Initials, copyright date, and edition size inscribed in pen and ink. Said to have been created by artists to expand catalogers’ understanding of the potential range of artists’ books. And I think that many of the things that we’re producing are sort of like… They’re sort of like an article in the same way that this is a book. And so the question is: How do we think about archiving these things? Do we think in… What is it that we’re trying to preserve? Are we trying to preserve the experience of reading or interacting with this thing in the moment? Are we trying to preserve it like the information in a sort of presentation-agnostic format?
This is the digital package that was produced for the Boston Globe’s investigation of the Catholic church in 2002-2003, and this is—I’m actually not really sure how this is still as functional as it is. Because it’s… Boston.com has undergone a lot of changes in that time, but somehow this still all pretty much works. Like, all the Flash works. There’s a little bit of rust around the edges. But it’s still… It’s pretty much functional. But at the same time, it is very much a time capsule of a particular moment in web design, where we make it exactly this wide. We make everything in Flash. But… You know, is there a better way? Should we be trying for more than this? Is this about as good as we can get? And maybe it is.
This is another publication in Boston, the Boston Phoenix, which was a long-running alternative weekly. When they ceased publication, it became… It’s sort of a kind of day by day thing. Is the Phoenix’s website up? Can people who wrote for it over the years get their stuff? In their robots.txt file, they’ve basically stopped everybody from scraping it, which means it doesn’t show up in the Wayback Machine. It doesn’t show up in any of these places, because they’ve set it up this way. And so these things that are… It’s not even at this point really a technical problem. It’s a kind of organizational and… How do we… There’s this whole entire kind of cultural legacy that’s just sort of disappearing here. This is, again, talking about—what does it mean to preserve something in context?
This is a project that a couple of artists did to produce screenshots of archived Geocities pages, using a virtual machines that loaded kind of a contemporary web browser. So you’re looking at these sites that presumably look quite a bit different in a modern web browser, but here they’ve been screenshotted, and so you see the combination of both the chrome of the browser, and the actual rendering engine, kind of… It’s less interactive, but it’s maybe perhaps more faithful to how these things existed in the first place.
And finally, one example that I can’t really show, because it involves listening to 24 hours of audio… This is probably one of my favorite artifacts on the internet. This is the complete broadcast day of CBS News from D Day, 1944. And it’s essentially the sort of newspaper scan facsimiles, but an audio version. You get to see this whole event unfolding in realtime, and definitely if you have 24 hours, listen to it. So yeah. I don’t know. I hope some of this discussion is getting you thinking about things that you’ve thought about before, or projects that you’ve worked on or wish you could do. So let’s open it up. Does anybody want to talk archives?
We’ve got half a decade of Realmedia files.
Who are you and where are you from?
WGBR Boston, NPR station. So we have half a decade or so of Realmedia files that we can never do anything with again. And there are interviews and stuff in there, that we have looked for, but I don’t think we have a way of… Because we can’t play Realmedia anymore.
Are those the canonical version that you have?
Everything else is on tape, apparently. Do you have any idea how to start with that? Reel-to-reel. That’s one thing about the archiving process. Audio becomes seemingly harder to deal with. Especially proprietary formats. Dead formats.
I guess this is the first set of questions about how to kind of bring archives into the present. I work at The Atlantic, and a lot of the archives are sitting in the middle of the office in bound books. And I would say everyone thinks they’re cool and everyone on their first day is like… Whoa, the archives! They’re right there! I’m going to spend so much time with them. And you never do. And one way we kind of endeavored to just make time for them is to have an archive party. And we just all hung out, and the staff all got pizza and beer, and went through the physical archives we had. Which was easier than setting up a microfiche machine. But that would be within the realm of plannable, if everyone has made time for it some night. And that wound up leading to a lot of interesting actual story ideas and ways to adopt the archives. Because I think like a lot of magazines, we can kind of reuse archival material if we find a way to frame it under a fair use context. And so that was a good way for us to kind of… That was a good social event for us to bring them into our kind of current day practice.
Yeah. The question of magazines, where a lot of… Especially The Atlantic. Where most of that stuff is longer pieces. It’s harder to kind of dip in and be like—oh, here’s a cool snippet. Like… Yeah.
It is, and it’s also—I think it’s also—the ads are awesome. Old ads—the Madison Project. Old ads are super cool, and in some ways revealed the decade’s ideology way better than some long, ruminative piece does. So if you look at a book and you’re like—oh, that’s how they marketed Minivans in 1990s? Which is not as far back as it goes—that could be the shock of the old that could bring that all by itself.
Yeah. I work at Harvard Business Review, and we’ve done the same archive party type of thing with our… We’ve got, like, a library with a hundred years’ worth of bound things, magazines, that is. But one thing we found a lot of success in, that I just had to Google to remember exactly how it worked out, but we find stuff in the archives that has kind of boomeranged back around in the cultural landscape. Like there was an episode of Mad Men that had some then-high-tech IBM computer, and it turned out an IBM engineer had written an article for HBR in 1969 about that computer, and I guess Mad Men is a very historically accurate show, such that… It dovetailed so perfectly with this story from this 1969…
How did you find that?
Fortunately, HBR is the kind of place that a lot of people have been around for a long time and just carry that kind of institutional knowledge. Which, I mean… There’s really no real accounting for that. Like, the archive party thing surfaces a lot. Especially with graphics. Not so much ads for us, because we’re more of an academic thing, but a lot of very old school data viz type of things. We can kind of find—like, those of us who are much newer—but some of the real old school editors, who have been there since the ’70s, just have that archival knowledge that none of the rest of us can really fake.
When we had a hack day at the Globe and we did a project that basically took—we have, like, OCR of the scans for all of these old things, which is… It’s not clean enough to read, but it’s kind of good enough to search, and so I loaded it into Solar, which is a platform for, like, full text search and stuff, and one thing that Solar does is you can send it the text of a document, and it’ll send back the ten most similar documents to that. So I made, like, a thing that would sit in the right rail of our article template and take the current—like, published today Boston Globe article you’ve read, find the ten most similar stories from our archives, and it’s kind of amazing, how daily newspapers write the same story over and over again. Like, this was this January, where we’ve been getting a whole lot of snow. It turns out that Boston has had miserable winters before. And you can sort of quickly find like—here are all… Here’s this same story, written in 1960, and that’s what it looks like then. Here it is in 1930, and it’s a kind of interesting entry point to, like, suddenly… You’re reading a similar article, but in a kind of entirely different editorial context.
Yeah. I work at the Wall Street Journal, and we added a hack to our graphics server about a year ago, and in response to that, IT wiped our entire server and we lost years of projects. So we’re trying to figure out how to resurrect those. And our strategy is to respond to requests. If a reader sees it, we’ll put it back online, and if someone is not the creator of that and they’re doing a portfolio piece, we’ll privatize that. Because of the security concerns, it’s such a time-consuming process, and our official directive has been to not resurrect them, because no one is going to see them and people don’t go to them, but that doesn’t seem like a great strategy. We don’t have a good solution for doing that, aside from spending a lot of weekends trying to do it. I don’t know if anybody has any suggestions or ideas.
Is it like… I guess we probably don’t want to dive too deep into exactly how the server configuration… So they still exist, but they’re not online? But you need to sort of, like, convert them? Like, change all of the URLs and stuff?
Yeah, we have a new security system, and when we submit the project, it’ll flag a bunch of problems that we have to solve. A lot of the people left and we’re not familiar with their code and it’s all old.
Can you not just sand box that stuff? So it’s totally isolated? Different server, different domain? Put them up somewhere so you don’t… It’s sitting basically in an emulator, so you don’t have to do that?
Yeah, maybe. We did talk about that. I think there were concerns that it could be vulnerable again, because it’s connected to databases.
If it’s totally isolated, on a different network, different host, different domain name, whatever… Because there was a talk I saw a few years ago where they were talking about—how do you preserve software on floppy discs? You can get it off the floppy disc, but the machine doesn’t exist. What if you’re 20 years or 200 years in the future and you can’t get 110 volts anymore to turn the machine on? Does that work? You’re basically emulating things, and you might be nesting emulators. So you don’t have that extreme a problem. That might be one way to look at it. It’s just like… On a read-only… Have the thing boot off a CD-ROM so it couldn’t be infected if it wanted. I have a question as well. So our archives aren’t even that old. They’re only 7 years old. And we imported all the ESPN stuff, and if you look at some of the articles you’ll see broken images, and the reason for that is because images used to be hosted on this other domain called 538image.com, which is long gone, off the internet. The Wayback Machine has got most of this stuff, but there’s some random site where you can pay them $15 and they’ll do it for you. So someone ran it and archived it for me. And got a bunch of stuff. But getting those back into our current CMS in WordPress is… Slow. Because you have to figure out what the file name is, find it in my directory, upload it into WordPress, and edit that HTML and then pop it back in. So it’s like… Write a really clever script or find some Mechanical Turk-like service. Right now it’s on request. Where readers are like—hey, this page is missing. We’ll dig it up and stick it back in for them. But I would love to get it on so it’s as quick as possible. We just don’t have a good way of doing that, without hours of people time, which is not realistic.
Speaking of people time, has anyone in the room ever done a crowd sourced project, where you get your readers to help you tag and categorize large sections of your archive? No? I’m the only one who’s ever thought of this? That can’t possibly be the case.
Readers. I think the first major one was the Guardian. Probably six or seven years ago, now. Doing regular documents. But you could easily do that for archives, in creating the kind of metadata structure.
The New York Times did that with their Madison—
So Madison was their version.
And I know the New York Public Library has done a couple of really cool things with their building inspector, and their menus. Yes. The menus. Which is this sort of fascinating—I think people get excited because it’s so weird. To sort of read these old menus. And it’s also like this kind of highly structured data that suddenly becomes really a lot more interesting once you have it all in. Yeah.
Sorry for coming in late. So document labeling is, like, a huge thing. And topic analysis and things like that. And from a machine learning perspective, it just makes me tingle to think about what’s at your fingertips. Just the monetizing data session. And all we talked about was getting government data from somewhere else and messing with it and then selling it to someone else. And I thought we were going to talk about what you’re all sitting on, and how to, like, do good, find out amazing things with it. One problem that comes to mind is kind of like labeling. How we talk about something. So if I’m going to talk about a topic—so I’ve been talking about racism today. The words they use are going to be completely different than what they would have been 30 years ago. But what we have in the archives of various newspapers is a rolling history of the words that make up the topics, that are just like… Insanely valuable… From an academic standpoint, to just like in the world of machine learning, how we find out what we’re talking about, we need labels, and those are valuable. So if anyone is like—I love like—oh, look, this is cool. The vintage ad, and seriously a lot—not to belittle it at all. But there’s this whole other aspect to it, and if anyone thinks that’s really interesting, like I do, let’s talk.
Yeah. The New York Times has a kind of Google Engram sort of tool… I saw this in the Globe archive.
Space. Okay. Never mind. I have to… Sorry. No, I don’t want Yuppie, beatnik, and all of these.
Can you just edit the URL?
All right. Cool. Okay. Yeah. So you see kind of transitions like this. Where some of this is driven by… I think in the case of newspapers, like a lot of usage is more constrained by their specific style books. But you do see these sorts of shifts happen. Yeah. I should probably put this in the Etherpad. Because it’s awesome.
That only covers things you can think of too, though.
Yeah, that’s sort of the thing. There are a whole bunch of ways you could cover that data that would let you find… Would let you find interesting things, rather than just trying things until you find something interesting.
Although I suspect that the access to the actual electronic text, for most major news organizations, is not the challenge, but rather the presentation that sort of puts it into context, maybe associates—especially images, like, are much less well archived. ProQuest and things have been doing full text index of news for a long time, but I’m interested in things that achieve what you were showing, with PDF renditions of pages in context with ads and other stories. I’m sad that that’s getting lost. Even just down to the level of, like, what were the other stories that were news when this was news? When everybody is running their pages on sites that have the dynamic read this, and all the other links are today’s stories, not the stories that were relevant when that story was news.
I was just going to say—this isn’t a solution with the digital archive, but I do a lot of family research and historical research, and I have been finding newspapers.com, the UI, and some of the features that they’re building in there, is really slick. I’m able to search for obits. They’ve got a fairly robust search. It finds it on the page, and then it lets you do a clip, so I have a feeling that the more and more people that are using it—we’re actually tagging and doing that for them in the background, but we’re clipping it, and then it saves it—you can save it directly to Ancestry.com and link it to the person that you’re researching, or you can clip it and save it as a PDF or download it, and it retains the context and the sourcing for it, as if you had cut it out and written the whole thing. So I would say the last two years, especially, they really seem to be doing a lot with how that works.
And their scans are much, much higher quality than what Google was doing even a year or two ago.
Interesting. That’s an interesting point too. Why can’t we kind of crowdsource this tagging and stuff? But the question is where do we find a group of people who are really into this, sort of enough to dig in and commit to finding things?
Yeah. And it turns out that there are these communities out there, that are interested. They’re interested in doing this.
How newspapers.com gets their content, by the way—
They usually contract directly with the organization.
Yeah, a lot of the older ones, they seem to be getting—the Post has a separate login. So now you put in their archives, and you get a separate pay for it. You can pay one fee and access a lot of the older papers under that fee, and then there’s a premium for some publications.
In terms of things not to do, that story up on— Digital First Media (inaudible) gave their papers to—
They contracted this company to scan all their old photos and stuff, and they sent them material and then he went bankrupt and didn’t send them back.
Actually, sort of on that note—there was a newspaper I used to work at, the Toronto Star, donated all their archival photos to the library. Which could be interesting if, like, there was a way to maybe do that with some of our technology. Like, donate all of our Flash interactives.
For someone else to maintain.
Well, yeah. It’s like… Can we split the work here, of figuring out how… I mean, there’s tools now, or whatever, to convert… There’s like Swiffy or whatever, which converts it to HTML, and stuff, but, like, these are, like, interactives that for a long time were how we told stories, that have just disappeared. And I don’t know if there are organizations or whatever that would be interested in taking that content and converting it, but trying to split that load.
That is an interesting thing, though. So we just did the same thing. We had photos and every paper we ever produced was microfilm, and we just donated it all to the library. But it wasn’t altruistic. It was—get this shit out of our building.
You take it!
Which is too bad.
But it doesn’t have to be a bad thing. Some organizations are set up to archive and to maintain. And newspapers aren’t necessarily that.
Depends on how the rights are negotiated too. If it is just a—get this the hell out of our building and use it for whatever you want and license it back to us, that creates other problems down the line too. Where you don’t then have the opportunity to monetize that, even if you still have access to putting your own stuff up.
Do you think that the newspaper powers that be that are selling off those archives realize the potential for monetization or research or whatever in giving them away?
More now than two years ago.
When I worked at the (inaudible), which is a small daily deep in the southern Delta of Arkansas… And they had filing cabinets in the back full of microfilm of that paper, going back into the 1850s. It was in pristine condition. Because no one had ever gone in and actually opened up the microfilm. And so I was going through and looking, and the history that they have there, through a lot of the Civil Rights era, and a lot of… Just the amount of stuff there, that an academic or a school would just love to have… We talked about at the time. I tried to push them to monetize, and looked at a few options, and the cost at that time was just so, so high to convert all of that, that even though they might have seen the value, it was not… That wasn’t going to work.
If it’s a project that you’re going to be able to monetize, it’s going to be years and years and years before you monetize it.
Yeah, that’s a large up-front investment.
I think the economics of how you monetize the archive, of like—along the lines of format and what it would take to digitize certain types of formats—certainly born digital stuff, and maybe this encompasses other stuff too, but the journalism Institute at the University of Missouri has a guy who’s heading up the news archives there. His name is Edward McCain, and one of his areas of inquiry, he’s done some federal grants to work on this stuff—is how small newsrooms can be better equipped with a basic framework for archiving, as they produce—hence the born-digital focus—but he’s also looking into monetization models, particularly if he were to, for example, try to collectively gather a bunch of newsrooms together to have more sort of—even bartering power, leverage, with some of the aggregators, so they could then sell their archives against, collectively. I don’t know. It seems like something that people in this room might be interested in keeping tabs on. Because I’m not sure where it’ll go. It’s pretty young.
Yeah, it seems like part of the issue with these things is that if you have microfilm conversion or any of the expenses associated with this, the expenses are sort of very front loaded, and then the monetization is a kind of long process. And that’s not something that a lot of news organizations, like, can accommodate with their capital structure. And by capital structure, I mean… You know, like… It’s hard to come by a giant pile of money. Especially if you’re a small newspaper. And so it is… Yeah. It’s like… The question is like… How do you do that without giving away the farm?
Yeah. And I was going to ask—we’re talking a lot about how to get stuff unlocked. But I mean—I don’t know how other people—we have 25 years of just text digital archives. Like, not OCR. Just actual text. Tagged, organized by… We’re not really sure what to do with it. We’ve got it all. It’s all on our CMS. It’s all there by day. But other than throwing up surveys around it, we’re not sure how to actually monetize it.
I kind of wonder, for that stuff that’s… That does exist as text, if there’s some way to try to replicate the experience of the page facsimile, to sort of go from one article to—here’s everything else that was published on that day, here are the similar things, and the advertising isn’t being done in the same way, and the layout isn’t preserved, necessarily, but you do want to figure out a way to maintain that serendipity there. Because otherwise it’s just like one article.
And some text archives—I don’t know if yours is one of them—but some of the ones I’ve seen in a few different newsrooms do include, like, page number. Not necessarily position on page, but you can look at everything that was on A1 or everything that was in the local section or everything that was… If you know sort of the indexing structure, you can sort of turn it into tags, by searching responsively for it, I guess.
And we’ve done all that. It’s just… It’s there. Some people are reading it. We’re making a little bit of money off of it. But we’re not ProQuest. We’re not distributors.com. We’re not a destination for anything archive-related, really. And we would need to do something really special in order to make that case. Oh, if I want to know what happened in the last 25 years, we might have to do research and dig into it. I’m not really sure what that is.
Are there partnerships with local historical societies or things like that? At least just to increase knowledge in a community that’s already intensely interested in that stuff, that you have all of this? And people aren’t using it enough. So how can we do it… How can you use it? What are you going to think of, that we haven’t at all, and likely won’t, because we can’t prioritize it?
I guess, like, moving away from, like, how do we preserve stuff, and more to, like, what we do with all this old stuff that we have… The guy… Ryan? I’m curious—how did you do the crowdsourcing your archives?
I haven’t done any. I was trying to ask. Because it’s such a huge undertaking. You need a really strong community. You need, like, a lot of different things to fall into place in that way.
So my company, DataMake, we actually have been working on an Open Source tool. It’s not ready and Open Source yet. But it will be soon. But it’s… Right now what it is—we’re working with the national democratic institute to transcribe election results. But we made this tool that’s kind of, like, really easily adaptable, to the point where you could make it serve up any images, and create any tasks. And I know—like Propublica has done stuff like that. Like, crowd sourcing ad spending. But I guess how our thing is a little different from that is—it’s really flexible in creating tasks. Anyone can create a task on the fly. And I think, like, that could be of interest to journalists who are interested in, like, tagging, labeling, like, those kinds of tasks, that do need to be crowd sourced. Like, you probably can’t pay for someone to just sit there and do that all day long. But I mean, if you’re interested in that, you should come talk to me.
Now that we’re sort of talking about it, I’m kind of thinking about how things like Flash interactives can be similar to scanned newspapers, and that maybe kind of transcribing and tagging is a layer that can be built on top of these things that are sort of digital, but not as open as web stuff.
Has anyone really made an effort to archive the Flash interactives? What did you do?
Some of the simpler stuff. There’s a lot of, like, facial grid and that kind of stuff—we built a quick HTML template, powered by JSON or something. I think it was RSS. (inaudible) some of the harder stuff, like happens… They just kind of died. They’re hosted there. (inaudible) they’re going to be gone eventually. But simpler stuff that we wrote and everything else… If it’s worth it, we’ll redo it.
I wonder if for some of that stuff—if the best you can do is republish the data that underlies it, so that… Because ultimately, it’s like a render. And what’s important is the analysis, which probably exists in the copy that accompanies it, and the actual data—someone can reconstruct it. So maybe… You know, just being intentional about retaining, keeping metadata on your datasets, and then publishing those whenever possible—is the path forward. Instead of being… You know, trying to rerender that data.
Or even if you’re kind of not comfortable with publishing it directly, kind of time capsulizing the source material for something in a form that at some point in the future, if you want to reconstitute this, you kind of know where to go. I don’t know.
Carrying that forward, I think, for us, stripping out all formatting is kind of what we did before. Keeping the data at its most pure level. So if nothing else, we can just render and stylize a text file. And I think that’s what we’re just trying to do, for our future archiving, as we redesign, redesign, and redesign. Hopefully it’s plain enough that we can carry through on those redesigns. As we plan to move forward. I think we’ve definitely decided and a lot of people have—that you can no longer cut to old archive sites anymore. Those are just another mess to maintain. So if we can carry the data forward into every redesign, I think that’s the goal. We finally may have the technology and the structures in place to finally do that.
Do you run into problems there, where there’s some stuff that’s maybe trying to do something ambitious, interactive, just doesn’t make sense? When it’s stripped to text?
It depends. It all depends on whether people would like to see it again. And resources.
Yeah. Anybody else? Yeah?
What happens for database maps? Is there any reasonable way of integrating all the possibilities? Something that’s attached to something that you can’t really enumerate? We had this conversation last night about—what if you froze not just the page, not just all the assets, but the actual machine and the browser that it’s loaded on. But at some point you can’t spider it, because it’s not… You can’t enumerate all those paths.
I think some people have taken the approach of—if you can figure out how to serve it statically, with just piles and piles of JSON files for every sort of possible—every possible configuration, or kind of break it up into small enough chunks that you can figure out which file the data you want is, and do the processing in the browser—if you’re serving that statically anyway, which has the advantage… It gets you all the performance and what-not advantages… Then part of that work is done for you. But yeah. I think if it’s… If what you’re doing is necessarily dynamic, then that’s much harder to do.
But in the spirit of memory and archives, we should remember that Open Source convened a hack day on digital news apps after NICAR in Baltimore, and I think there’s been a couple—Ben Walsh has been to a couple of different summits on this topic. So people are thinking about this. I don’t know what the best practices are. But there’s work on this happening.
Yeah. I worked on the news archiving thing after NICAR in Baltimore, and there was a blog post that (inaudible) wrote, around a conceptual framework for archiving news apps specifically. But it would totally apply to what people are talking about here today. And that’s totally—how do we keep this thing going? The stuff coming from, like, academia that’s federally funded too, I think can only help. Because any one news organization is so strapped when it comes to solving this. Someone from the internet archive was talking about running virtual boxes to preserve some of this stuff.
And I think at some point part of the strategy has to involve thinking more in a documentation mode than a kind of continuous operation mode. That, like, you are—there are always going to be aspects of—if you’re doing something that involves user interaction and hitting APIs and stuff, you’re not… Eventually the universe is going to become so large and so contingent that there’s no way to stuff it all in a box and treat it as a kind of archival object. And so you need to start thinking of—how can we document this as a performance? I mean, in ways that will provide some sense of what was going on. I mean, in the same way that you would document any other sort of performance, where you acknowledge that we could never put this back together, but we can capture it on film or capture it with photo or audio. And so those… Maybe not those direct same media, but that same kind of strategy can address some of that.
Some of it too sounds like… Sort of to go back to your 1918 example, we can’t reproduce phone calls that did or didn’t happen. We can’t reproduce some of those conditions, but we still sort of see the form of it, and can understand that. So maybe the thing isn’t for interactions in particular, preserving a way to interact with them. It’s a way to step through the experience of being able to have done that at the time. It doesn’t need to go in a database. We just need to see a representation of how it worked.
Sometimes that representation can be closer to what the desired operation was than the actual final…
And you can make GIFs. That’s pretty awesome.
GIFs, the cockroach of archival format.
We take a screenshot of our homepage every day as A1.
The Internet Archive is doing that for you.
Ben has… It’s called Past Pages. It takes pictures every hour.
I run a service called Freezit. And it does on-demand point in time captures of web pages. It has an HTTP API that isn’t publicized, but if you go to me, and find me, you can use it. We do that on our web page every hour, actually. It doesn’t do assets. It just does the HTML. I developed it for (inaudible). Somebody published something and wanted to be able to refer back to. They knew they were going to change it.
Yeah, that’s a sort of aspect too, that’s interesting. Is when stories are published, and then edited, and kind of maintained as a thing, there’s always the kind of tension between a particular version of a story as being an object, and that story as a kind of continually updated thing. And I know there are tools like news diffs that will let you see the evolution of a story over time. After years have gone by and it’s harder to track those things, how do you figure out… How do you figure out how this story came to be in the state that you see it? Like, was this an early draft that was… Was this kind of a quick story, that then was superseded by another story that exists as a separate object? Or was this a story that was started, and then edited and edited as new facts came to light? And it can be hard to do that without kind of either scraping constantly or having special insight into the CMS. Any other stuff? Any other projects you’ve seen or things that we should put in this document?
I just have a question. I’m curious how you guys are doing—you talked about enterprise-level, publication-level archiving. How are you guys doing personal archiving? If you get a byline somewhere? That’s the only problem I have not solved for myself.
That’s a really good question, and one that’s really kind of… This discussion of the Boston Phoenix. That’s the constituency that is kind of in some ways most directly noticing this, is—I wrote all of this stuff. Like, how do you maintain your personal archives? Does anybody…
This isn’t… I can’t point to a project and I can’t point to a successful personal archive, but I wonder… As a reporter, in a room with tech people, it would be cool if—you know, we’re at the point now where it’s kind of expected you’ll be able to get all your data from a service before you leave. And actually, I wonder if there would be a way for users in a CMS to be able to pull all their bylines. I realize that might also be done best externally. Just through a scraping.
So our stuff is in WordPress. And WordPress has a RSS feed for everything, including by author. So I don’t know about “scrape”, but you sort of get it. Yeah, I think pulling bylines in general seems like a really useful service. Someone should make a byline version of whatever you do. Fetch all of your shit. So when the Phoenix goes under, you’re just like—I only lost the last two articles, which I’ll grab off the rack.
A really simple practical solution is Evernote with a Chrome plugin to scrape the page. It does a good job of getting the page. You don’t have to deal with links and stuff. As long as you trust them as a resource, it’s a great thing to just grab it every time it runs.
And it has a sort of import/export functionality which gives you a file, and you can assert that by the author. Not that I’ve tried this, but I think the likelihood, as a freelancer, if you ask the publication to export all your things for you, their willingness to do that and their knowledge of it… Whatever ownership issues might be… I think… I haven’t tried it, but I imagine there would be a lot of hangups, getting the publication to do that.
Right. It seems like the sort of thing that… That would be the kind of thing that would benefit from some sort of standardization. Like, it’s a thing that’s built into a CMS, that this publication can say—yes, we’re freelancer friendly in this particular way. We may not get you your checks on time, but we can give you a complete XML dump of everything—
It seems like somebody needs to do a Yelp for freelancers for publications. You can rate them like the Department of Health. A, B, C, D.
It would be really interesting. Everybody here—I’m sure all their publications send their papers to the Library of Congress. But they don’t really have a similar thing for websites. And even if they did, it’s still sort of locked into the Library of Congress. It would be really interesting to do basically a community—we’re all just going to syndicate all of our shit. A save action when you get published in CMS is going to send to this thing. And you can limit by rights and that sort of stuff, but make it easy.
The Flash Museum?
Yeah, pretty much.
That would be a great place to do it.
Data exchange. This is my question, actually. This is my day-to-day obsession and/or nightmare. How many folks here are, like, using any sort of standardized… For your digital representation of your articles—like, down into the even going to the article metadata level, but go further—paragraphs, images, that sort of stuff. Is this outside of most folks’ work area? I’m sort of dealing with this sort of thing. Because let’s say you do have a digital archive, going back. And let’s say you do want to send it to the Library of Congress or you want to share it with someone else.
Are you looking for SGML again?
Maybe. But you’ve got XML, you’ve got JSON, there’s no guarantee that your head line is the same as someone else’s title.
There’s an industry news format, which is an XML format… I don’t know if it’s something that anyone has implemented.
This is where I’m going.
When I was a university student, I wrote for my college paper, and there was a Yahoo group mailing list where people talked about it.
That’s my curiosity. Is anybody using any of this stuff?
I’ve been really fascinated by this project that came out of the New York Times interactive news team. Maybe four or five months ago, called RTML, structured metadata. The idea is that you can inject it right into the article. I haven’t solved this problem myself, but I have these delusions that I might be able to RTML all my text documents if they have metadata inquiries.
There are structured formats for people and places.
The experience I’ve had is that getting people to adopt structured data is very, very difficult, until the moment that Google or Facebook needs that data or… Google, Facebook, Twitter. If they need that data to show up for your things to draw traffic from their platforms, then boy howdy. That data gets built and tested, and it flows like water.
To the personal archiving, I think that’s a really interesting question, on a bunch of levels. And it’s like… What would have happened before the digital age? You would have written a story, bought the paper it came out of, used scissors to cut it out and paste it in a book that you had for the purpose. I think we’re spoiled. I have a Xanga from my late adolescence. And there’s a couple of things in there that are really angsty and embarrassing, but I really like them, and so I have not brought myself to either shut it down, which I should really, really, really do, or they want $5 for the whole thing with the comments. You know? But I know that… It’s on me to… If I’m… And the easiest way to do it is as you’re producing. The simplest way is as you’re making stuff, think about it as you go forward. Which I should do. So they keep reminding me again. It’s my responsibility. It’s nobody’s responsibility to dump me the stuff I made and gave to them.
I’m actually really great about it. I have versions of alternate edits and all that, and I save all that. But there’s a point at which, as a freelancer, you lose control, and you send it in, and it’s in the editor’s hands, and then it goes to copy editing and production. That last little step—what I’m really bad about is going back and archiving the version that is actually published. Because at some point, it’s out of your hands.
The clipping analogy.
That was a five-minute warning.
Yeah. The Xanga example that you mentioned—there are some things that are kind of embarrassing.
I didn’t say that.
I hesitate to bring this up right after we get our five-minute warning, but there’s also the question—maybe the ethical question—of: What if our archives are too good? If most papers run police blotters that record arrests, we don’t necessarily then print… It doesn’t make news every time those arrests don’t actually lead to any criminal prosecution, but to the extent that those are published as news and then put on the web and exist forever, that may be the number one Google result for somebody’s name, pretty much, sort of indefinitely.
Preserving versions, and you’re preserving errors.
Right, right. And how do you contextualize all this data?
Google and the EU are having fun with this right now.
This sort of thing is like… It’s very difficult to sort of add that context at scale. Because you’re always going to miss stuff.
Yeah. I was just going to say… I had two sort of perspectives on this. A good friend of mine, who’s a prosecutor, and one of the things she said is—you shouldn’t take that stuff down. That’s part of getting arrested, is your fucking name goes in the newspaper. But that’s one of the things that you get stuck with. Which I thought was sort of interesting.
But at the point that people are falsely arrested—
Yeah. So the New York Times’s policy, which I think is a pretty good one, I emailed them about this, because we went through this a while back—if it’s a misdemeanor, forget about it. You got caught with a dime bag of weed. Nobody gives a shit. Any reasonable circumstance—who cares. Felonies, basically it’s on the person to provide an actual official documentation that that charge was dropped, and if so, they’ll note it.
Which is where expungements get a little bit weird too. There are legal processes for entire classes of crime where you can go back five, ten years later, and say—I’m a better person. I fulfilled all this stuff. And I want to get my record expunged. This never happened, basically. And from the legal system’s perspective, it literally never happened. It can never be brought up in future charges, future crimes, anything like that. It literally gets wiped it way. But there it is, on the internet. You got arrested, convicted, whatever, and it got covered. Some newspapers are okay about—if you can document that, we will take it down. We will sort of respect that process and say it never happened. But not everybody does that. Some of them will add a note, like you said, saying—I mean, it sort of never happened, but it kind of did.
I mean, in some ways, the difficulty of archive access… Allowed us to not have to confront these issues, because you end up with a situation where, like, it still does exist. We’re not removing things from the archive. But there’s a difference between existing on pristine, never-reviewed microfilm, and being sort of instantly accessible. And as we consider moving things from that microfilm to the instant accessibility, like… You know. Things change.
So the one that we started writing to lately, which is even harder, is paid notices. Things like marriage announcements, birth announcements, obituaries, that sort of stuff. We weren’t really clear—it wasn’t actually a problem if you put an obituary in the news ten years ago. But now some of them are coming out and saying—I got divorced or I got whatever—and there wasn’t a clear policy at that time. It’s easy now. We can put up a TOS that this is it. Put it in. It’s archived. But… It wasn’t back then. These people paid us hundreds of dollars. We’re going to turn around and give them the middle finger?
Cool. Well, I think we’re just about done here. Thanks, everybody, for coming and sharing, and I know you could have been taking a nap in your hotel room, and I appreciate you coming here and being part of this.