Stealing Ideas

open access — kpw @ July 19, 2011

Reading about Aaron Swartz’s most recent run-in with the law dredged up all kinds of feelings. I’m a long-time admirer of his work and was obviously saddened to hear of his troubles. At the same time, reading the indictment I was surprised by the seriousness of the charges and evidence against him.

I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents,  and various forms of metadata. I’ve lost count of how many hours I’ve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the web–sometimes legally sometimes not–the name of academic inquiry.

None of us want to break the law. It’s simply that we don’t have a choice.

The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academic publishing industry is to modern day collaboration.

I’ve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as a undergraduate assistant in a major neuroscience laboratory–building publishing tools to help the lab’s director break copyright law.

His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could help facilitate. He took matters into his own hands by publishing the articles on the laboratory’s site, almost always violating the licensing terms of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality of what we were doing and was told not to worry. If the journals didn’t like him bending or breaking the law he’d publish elsewhere and it would be their loss.

As far as I know the publishers understood the bargain and never complained. Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenured neuroscience professor doesn’t have the luxury of pissing off Science or Nature.

But for those of us interested in meta-analysis–these questions about questions that people like Aaron and myself are forced to pursue from basement wiring cabinets, scraping large swaths of text from the web–the hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCR’ed page at a time, simply do not work.

If you want to understand the collaborative nature of a specific field or follow the trajectory of an idea across disciplines a reference librarian can’t help you. Instead, you have to become a felon.

What’s missing from the news articles about Aaron’s arrest is a realization that the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.

Aaron and the other “hackers and thieves” like him don’t have that option. But their work is no less important–they are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industry and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.

It shouldn’t and doesn’t have to be this way but there are at least two main issues holding back progress:

First, as a society we’ve forgotten the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creative pursuits rather than lock them away.  Many have fought for a return to this vision, however, the prospects for such change seem dim. If there’s anywhere this idea should still have a fighting chance, it’s within the walls of universities.

However, it is this most basic failure, our inability to create a rational set intellectual property laws, that necessitates the creation of things like JSTOR. We shouldn’t need it in the first place. Nor should anyone curious enough to ask questions as big as Aaron’s ever need to break JSTOR or the law to find answers.

We should offer people with big questions more than a trip to jail–we should celebrate their willingness to explore our collective intellectual heritage. Universities should take the lead in building the platforms needed to support such inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.

But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. They’re also about controlling academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure of disciplines. It’s almost certain that pursuing new forms of collaboration and communication will reshape these power structures–sometimes subtly, sometimes not. That’s the nature of change.

Change, however, doesn’t come easily within academic communities. It should be no surprise that universities have done far more to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.

This conservatism is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. It’s also responsible for allowing a handful of powerful corporations to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.

Sir Tim Burners Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginable–except get unfettered access to scholarly communication.

It is not technology that holds us back.

Aaron’s arrest should be a wake up call to universities–evidence of how fundamentally broken this core piece of their architecture remains despite decades of progress in advancing communication and collaboration.

The MIT staff who called the FBI police* would have been served better by calling the chancellor to ask, “How have we created a system that forces 24 year-olds to sneak around in the basement, hiding hard-drives in closets in order to ask basic and important questions about our work? Can’t we do better?”


Update: I’m not OK with scraping JSTOR or any other copyrighted data source for the purpose of re-distribution. Some, including the FBI federal prosecutors, have made the claim that’s what Aaron planned to with the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.

Also, it’s true that JSTOR does offer a bulk interface for research users. That interface didn’t exist when I was doing my work. But it’s not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data. Giving researchers a straw is not a useful response to requests for open and complete access. We shouldn’t settle for less.


* For those interested in the blow by blow: since writing this post I’ve learned that no one at MIT called the FBI–in fact it’s not clear the FBI was ever involved. As I now understand it, the local police were called to investigate a break-in. Because this involved network equipment the Secret Service were called by the Cambridge police. After that the investigation took on a life of its own outside the MIT campus.


A version of this essay appeared on Reuters MediaFile under the title “The difference between Google and Aaron Swartz.”


  1. This is really good. Thanks for writing it. You indicted the system that prevented Aaron from doing his work. And you are right on.

    One thing, though: Aaron is not being charged with IP infringements. So changing IP would not have helped him. (It’s a good idea, but not because of this case). The legal question is much more about the appropriateness of such draconian penalties for rather pedestrian work-arounds.

    Conservatism does not keep academic publishing as it is (and it is changing rather quickly). Contracts and monopoly market power do.

    Comment by Siva Vaidhyanathan — July 20, 2011 @ 3:54 pm
  2. Siva, Thanks for the kind comment.

    You’re right that the charges facing Aaron aren’t about copyright infringement. However, the point I hoped to make above was that he (and others like him) wouldn’t need to go circumventing JSTOR’s security features and sneaking into wiring cabinets if it weren’t for the unfortunate copyright landscape around academic writing.

    There are two problems. First, copyright itself, which it seems won’t be fixed anytime soon. Second, academia’s willingness to participate in the current copyright regime.

    The latter is fixable today. All it takes are a couple of letters from heads of major universities to their faculty saying, “From this day forward all of your work will be available in the public domain. If any publishers prevent you from claiming these rights, have them call me.”

    Comment by kpw — July 20, 2011 @ 4:29 pm
  3. Best post I’ve read about this subject. Thanks a lot, Kevin

    Comment by Mark Meldola — July 20, 2011 @ 5:03 pm
  4. Very thoughtful article! You’ve made very valid comments and I wish universities would follow your suggestions.

    Comment by sws — July 21, 2011 @ 12:35 am
  5. Great stuff, and so similar to a G+ post I made that you might be interested:

    “Umpteen years ago (ok, 2000), I spent some time developing a system of harvesting key content, processing it through my tools, extracting the key significant phrases from that material, and doing interesting things with the results — because it wasn’t being done appropriately, to my mind.

    Was that criminal? I didn’t think so at the time. I thought I was experimenting, and exploring.

    In fact, I thought I was developing something that would benefit the publisher, the author, and the likely reader. I thought it was a win all the way around.”…

    and I agree, it’s about power and control:

    “I suspect this is a response to Anonymous, Lulzsec, and similar extremists — and an attempt to demonstrate that the powers-that-be are in charge, after all.”

    Really glad you wrote what you did (and so well!) — and especially glad that it’s in Reuters, being read.

    Comment by Michael Jensen — July 21, 2011 @ 8:20 am
  6. Thanks for this important perspective. There are several layers of economic players – the universities, the publishers, and of course JSTOR itself.

    Together they created the usual extortion racket where something more “Jeffersonian” was called for. The moral imagination of US academia is for shit.

    Comment by Tom Matrullo — July 21, 2011 @ 9:17 am
  7. [...] #2: Via Yglesias, an excellent piece by Kevin Webb exploring similar themes. This entry was posted in Commons, Copyleft, Law and tagged academia, [...]

  8. Excellent commentary. It is depressing to me to see how diminished is the ideal of the “free flow of ideas” in the academic world today. As you point out, this problem is especially frustrating, because technology offers the means to significantly enhance the flow of ideas.

    Comment by Jane Hadley — July 21, 2011 @ 5:43 pm
  9. Assuming the world has not gone entirely bonkers, one can expect that charges against Aaron Swartz will be dropped by JSTOR once it becomes clear that he was interested in data-mining what he downloaded, not redistributing it.

    Breaking into a locked room and computer at MIT is not ethical except if something far more important is at stake — but Swartz will be pardoned for that peccadillo too.

    But access to retroactively scanned journal article databases is definitely not the same sort of “primal right” as access to current, born-digital articles, with access provided by their authors.

    Nor is author give-away the same thing as user rip-off:

    I hope the JSTOR downloading caper will not be conflated or even associated with the worldwide efforts by researchers to give and get open access to one another’s refereed research.

    Stevan Harnad

    Comment by Stevan Harnad — July 21, 2011 @ 8:46 pm
  10. I have several issues with the post. Initially, I have been a user of JSTOR in the past, as well as of journals published by for-profit (e.g.,Elsevier) and non-profit (e.g., AMS) organizations.

    a) “people like Aaron and myself are *forced* to pursue from basement wiring cabinets, scraping large swaths of text from the web–the hobbled and clunky tools for downloading PDFs”

    “Nor should anyone curious enough to ask questions as big as Aaron’s ever *need* to break JSTOR or the law to find answers.”

    “How have we created a system that *forces* 24 year-olds to sneak around in the basement”

    The simple fact that you are interested in something — even if something objectively interesting — does not provide you with the right or duty to pursue it. You’re interested in meta-analysis of academic papers, or NLP or scientific language? That’s great; however this doesn’t give you the right to infringe on JSTOR’s terms and conditions, which are spelled out for you every time you download a pdf from their site. They are quite simple and clear; and if you disagree with them, feel free to sue them or to petition our lawmakers. Or, feel free to scan the journals that JSTOR spent literally thousands of hours scanning. Your interest is not sufficient reason to violate the terms of agreement between you and JSTOR. I could argue that is not a sufficient reason even from a utilitarian-consequentialist point of view, but the procedural one should be enough.

    b) “the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users”

    Not at all; what makes Google so valuable is the analysis and classification of information that is freely available. If you are referring to Google Books (which makes a very small part of Google’s equity), you are not making a like comparison: google scanned directly the sources it’s classifying, so it is closer to JSTOR than to a final user like you or Swartz.

    c) “None of us want to break the law. It’s simply that we don’t have a choice.”

    “*We* should offer people with big questions more than a trip to jail–we should celebrate their willingness to explore our collective intellectual heritage. Universities should take the lead in building the platforms needed to support such inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.”

    When you say “we don’t have a choice”, who do you mean by “we”? I assume not JSTOR employees nor the publishing companies. Not even the very large majority of users that is able to find information in JSTOR quite easily through its basic and advanced search features. I probably have a few hunderd of JSTOR-downloaded papers — more than I will ever be able to study carefully — all for personal use and well within JSTOR’s T&Cs. I think by “we” you mean “I and my interest group”, without recognizing that this may be different from the public interest and the interest of those who published and digitized the papers in the first place. So, please, lose the “we”.

    d) “But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. They’re also about controlling academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure of disciplines.”

    I’ll leave aside the pretentious and repetitive use of “academic discourse”. You are conflating two different issues here. One is whether publishers have property rights on their product. Another is whether the current arrangement is kept in place by groups interested to preserce certain “power structures”. Swartz’ arrest has to do very prosaically with theft, independently from what his final goal was. And these laws and agreement were preexisting the current academic-publishing “power structure”. Regarding the latter, it has changed a lot in the past 10 years. Several journals and organizations have agreed to leave copyright with the authors, and I don’t know of a single case of a paper whose preprint has been publish on ArXiV, SSRN, or on the author’s web page, and has been removed afterwards on request of the publisher. Neither do I know of a single case of an untenured professor who has been denied tenure for making his preprints publicly available.

    Comment by gappy3000 — July 21, 2011 @ 9:57 pm
  11. Gappy3000,

    Thanks for the comment. I think we disagree on some fundamentals but you raise some interesting questions that deserve clarification:

    a) I make no argument about the existence or validity of current law or license terms. I’m simply pointing out their brokenness and lack of consideration to the reality of how this information is or could be used. It’s important to understand this is really about market failure–current licensing frameworks don’t serve all user needs.

    What’s so sad about this particular failure is that it doesn’t need to be like this, nor many (I assume most, but don’t have any hard numbers) of folks involved in creating the content want things to work like this. It results in two huge problems:

    First, it results in a substantial tax on content creators and consumers. I just started digging into the economics of existing commercial publishing channels vs. open access models. I’ll post something discussing this in more detail but it looks like we (you in included) are currently paying about $4 billion more per year in the U.S. than we’d need to under other publishing models. If you include international the number is well into the tens of billions. That’s just on publishing and subscription fees and is not even counting the money that’s being spent on information retrieval tools.

    Second, the retrieval tools out today suck and it’s almost impossible for new players to enter the space and build new approaches–the costs are prohibitive for any startup. But this is a sector that’s worth additional tens of billions and infective tools lower productivity, collaboration and inhibit progress, which is also a loss that *we* all pay (you included, whether you like to admit it or not).

    What we’re talking about is a pretty sizable dead-weight loss, somewhere well into the tens of billions annually. Things like that piss me off and it’s time for it to be fixed. I don’t care if it hurts some incumbent business model.

    I was going to give a long explanation on why it’s broken but you made the point for me in c). The truth is that this whole thing works just well enough for most users, apparently you included, for anyone to do anything serious about fixing it. Also that $4 billion subscription tax is paid by your university’s library or your employer should you be lucky enough to work somewhere with journal access. You never see the price and don’t care how much it costs. Your librarians do but they don’t get a say in how publishing works. People like me who aren’t even allowed to be customers don’t get a say at all.

    So we end up with a giant collective action cluster fuck. And a broken publishing system that never gets fixed.

    Maybe you’re OK with that–I’m not.

    b) Actually, I was referring Google Scholar which scrapes JSTOR among many other non-freely available sources. Google like the very small number of other journal search providers (Thomson and Elsevier being the two largest) have privileged access to the data needed to offer these services. You or I can’t get it. Also, I don’t think they pay for it. I had the chance to talk with the Google Scholar team a few years back and learned that like the Google News folks, publishers and aggregators like JSTOR give them access because they drive traffic.

    A fair business deal fore sure, but a far cry from your claim that Google builds its services on freely available data.

    And don’t even get me started on Elsevier which is both a publisher and one of the largest aggregator/search providers. They get to take big cut on both ends.

    c) I’m happy that you’ve downloaded your hundred or so papers and are done. That doesn’t solve my problem. Nor does it solve the problems of academics who have to keep on top of their fields and don’t get to download a few PDFs and declare victory. They need continued access through out their career. They also need good tools to help find those articles. Hopefully they’re lucky to work somewhere that has libraries with enough funds to pay the subscription fees for the journals and the search tools–many aren’t.

    d) My old employer wasn’t posting pre-prints. Pre-prints are great and I love the arXive but it’s a partial, second tier solution . Also many acdemics aren’t comfortable sharing pre-review articles publicaly. Also pre-print articles aren’t citable in the same way which matters to folks trying to grease the wheels for their peers to cite their work–which matters because of the role of impact metrics in job performance (power structures!).

    Comment by kpw — July 23, 2011 @ 5:51 pm
  12. a) I am glad that the tone has shifted from “*we* need/should infringe the property rights of JSTOR to do our Very Important Research” to “the current arrangement is inefficient and we must change it”. I don’t disagree on the inefficiency or on the fact that Elsevier is evil. In fact I know quite a few professors who won’t publish on Elsevier journals for this very reason. Quite simply it has nothing to do with the analysis of large corpora of research articles.

    b) I never saw the JSTOR header page on a google book search. In google scholar, I think they do index every possible journal and source, but as you said, they have an agreement in place there, and it is mutually beneficial to google, the publishers and the user. It’s not the same thing as Swartz’ actions.

    c) I’d consider this slightly offensive, had it come from a respected academic. To make things concrete, I have about about 7Gb of papers downloaded from JSTOR and other publishing sources, as well as the authors’ web pages over the courses of nearly 20 years of research. That’s a few thousand papers, all downloaded legally and used within the Ts & Cs of the publishers. I like to think that I stay on top of the research of my field; a comprehensive monograph in my field *very* rarely cites more than 1000 references. And I would be very skeptical of anyone claiming she has read carefully more than 2000-3000 papers in her research career. I would say my experience is quite representative of someone engaged in scientific research (say, engineering/math). My colleagues seem to think more or less the same. I believe that you and Aaron Swartz are smart guys, but he at least doesn’t strike me as a scholar, and his reported usage pattern is vastly different from that of any researcher I know. So please, spare me the condescension of “downloading a few PDFs and declare victory”.

    d) you’re totally off on this one, at least in engineering, math and most economics. First, you can get the preprint and quote the journal paper. Second, it’s getting harder to find researchers who *don’t* publish reprints. The only consistent case I am encountering is that of some NBER reports, but those aren’t expensive to buy in electronic form. Third, increasingly these preprints are *better* than the published version, not only typographically but also because of added proofs, code and data to reproduce the paper results. In this respect, I am optimistic, because I think it will become more and more common to use a publisher only to get the “credentialing” part, but not the information diffusion part. If this happens, the current publishing business model could become almost irrelevant, as soon as someone manages to tweak the stackexchange model to generate academic credentials.

    But I am digressing. The main thrust of my initial comment was that, while Swartz’s motives may be understandable, his actions are indefensible. And that the mention of power structures in this context really sounds like second-hand Foucault.

    Comment by Gappy3000 — July 24, 2011 @ 1:40 am
  13. Regarding c), apologies for the condescension, however I’m not sure your perspective here is universally true.

    For example in life sciences (in truth, the only field where I have first hand experience in a university setting) the volume of papers consumed is quite high. These papers serve as a mechanism for sharing experimental methods and researchers draw from wide range of sources throughout the daily course of work.

    In this case it’s not always about carefully reading something and integrating the theoretical implications into one’s own understanding. It might be about reading something from a completely different field and extracting useful techniques or experimental parameters.

    Granted this is quite different from math or economics where I imagine your take is more correct.

    Either way, I find it unconvincing to say that just because you and many others are able to get access to the papers you need we should ignore the structural problems with how this work is created and disseminated.

    d) I love pre-prints and agree they’re becoming increasingly prevalent. However, I think you might have skewed perspective on just how widely available they are. Their use differs substantially by field. For example, see here:

    I find the theorist/experimentalist debate in the comment thread particularly interesting. Also, plenty more secondhand Foucault in there. These power structures do matter, particularly in bigger and more competitive fields.

    Further, pre-print publications are still grounds for disqualification in some journals (though, thankfully, decreasingly so).

    This is by no means a settled issue nor is it as simple as you make it sound.

    That said, I’d love to see the kind of solution you propose with decoupling communication and credentialing. But at least for a lot of fields it seems we’re pretty far from that ideal.

    Disagreements aside, thanks for the spirited debate.

    Comment by kpw — July 24, 2011 @ 11:09 am
  14. OK, I won’t speak from first-hand experience, but from close-second one: my wife (and several of our friends) is a professor in top 20 US medical school, in a data-heavy field (computational chemistry). I asked her, and the inadequacy of the publishing system (access, cost etc.) was not a concern. The only thing that really upset once was that a publisher wanted to charge her $500 for the pdf of a review she had written for free (she did not).

    The use case you are presenting (“It might be about reading something from a completely different field and extracting useful techniques or experimental parameters”) seems decidedly *not* the case of the vast majority of researchers. To reiterate: you are conflating two very distinct issues: i) the access costs to sectoral published material for researchers; ii) the access costs to all published material for information extraction from large corpora of texts, which is your research project. ii) doesn’t strike me as a representative concern of the the overwhelming majority of researchers. So, when you mention “structural problems”, I have to ask, with respect to what? e-journals are expensive, and their costs may be reduced with some policy measure. But whatever the cost reduction may be for a university library, they will always be to expensive for anyone interested in retrieving most of them. That doesn’t seem like a structural problem, it seems more like your problem. And let me add a bit of historical perspective here. I don’t know how long you have been around, but I was doing research in the pre-web days, when you would be charged per hour for access to journal DBs, and the only way to get a journal article was to physically copy it. And this was in theoretical Physics, which was an early adopter of technologies (Tim Berners-Lee, anyone?). We are not living in utopia, but surely we have come a long way. I’ll certainly be interested in reading your analysis that shows “tens of Billions of dead-weight loss”.

    And by the way, there’s not much BS about power structures or academic discourse in the comments to the Science Blogs post you linked to. Just very practical concerns that are essentially procedural in nature (or technology-path-related, if I want to sound smart and pretend I have read Rosenberg’s work).

    Comment by gappy3000 — July 24, 2011 @ 12:13 pm
  15. [...] Stealing Ideas — Structural [...]

    Pingback by 7 Eyeball Worthy Links of the Week #10 | Lendio — August 10, 2011 @ 4:31 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2015 Structural Knowledge | powered by WordPress with Barecity