Stealing Ideas

Uncategorized — kpw @ July 19, 2011

Reading about Aaron Swartz’s most recent run-in with the law dredged up all kinds of feelings. I’m a long-time admirer of his work and was obviously saddened to hear of his troubles. At the same time, reading the indictment I was surprised by the seriousness of the charges and evidence against him.

I was also reminded of my own attempts at similar work, collecting and analyzing journal articles, patents,  and various forms of metadata. I’ve lost count of how many hours I’ve spent sitting in basements of academic buildings, breaking federal laws in the pursuit of answers. And I was reminded of my colleagues who still spend their days painstakingly scraping data off the web–sometimes legally sometimes not–the name of academic inquiry.

None of us want to break the law. It’s simply that we don’t have a choice.

The mechanisms for sharing academic discourse are broken. They barely even function as systems for connecting interested parties within existing disciplines. Ask just about anyone who spends their time writing or consuming scholarly work and you will hear a litany of complaints about how poorly suited the academic publishing industry is to modern day collaboration.

I’ve spent most of my professional career just outside of the academy but have seen the failures of these systems first hand. I formed my opinion on the matter as a undergraduate assistant in a major neuroscience laboratory–building publishing tools to help the lab’s director break copyright law.

His work regularly appeared in and on the cover of major journals. Yet he was in a field that was moving faster than the journals could help facilitate. He took matters into his own hands by publishing the articles on the laboratory’s site, almost always violating the licensing terms of his own work (rights now held by Elsevier or AAAS, not the author). I asked about the legality of what we were doing and was told not to worry. If the journals didn’t like him bending or breaking the law he’d publish elsewhere and it would be their loss.

As far as I know the publishers understood the bargain and never complained. Unfortunately this sort of non-aggression pact is available only to a select few. Your average untenured neuroscience professor doesn’t have the luxury of pissing off Science or Nature.

But for those of us interested in meta-analysis–these questions about questions that people like Aaron and myself are forced to pursue from basement wiring cabinets, scraping large swaths of text from the web–the hobbled and clunky tools for downloading PDFs through research library proxy servers, one poorly OCR’ed page at a time, simply do not work.

If you want to understand the collaborative nature of a specific field or follow the trajectory of an idea across disciplines a reference librarian can’t help you. Instead, you have to become a felon.

What’s missing from the news articles about Aaron’s arrest is a realization that the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers, just as my former boss used his name to set the terms with those publishing his work.

Aaron and the other “hackers and thieves” like him don’t have that option. But their work is no less important–they are collecting and organizing information in order to ask deep questions about the nature of academic discourse. Unfortunately for most, the structure of the publishing industry and the laws that surround creative works prevent these questions from being asked, at least without taking sometimes substantial risks.

It shouldn’t and doesn’t have to be this way but there are at least two main issues holding back progress:

First, as a society we’ve forgotten the Jeffersonian ideal that intellectual property laws should enable and encourage the spread of ideas and creative pursuits rather than lock them away.  Many have fought for a return to this vision, however, the prospects for such change seem dim. If there’s anywhere this idea should still have a fighting chance, it’s within the walls of universities.

However, it is this most basic failure, our inability to create a rational set intellectual property laws, that necessitates the creation of things like JSTOR. We shouldn’t need it in the first place. Nor should anyone curious enough to ask questions as big as Aaron’s ever need to break JSTOR or the law to find answers.

We should offer people with big questions more than a trip to jail–we should celebrate their willingness to explore our collective intellectual heritage. Universities should take the lead in building the platforms needed to support such inquiry. It is an embarrassment that JSTOR is the best the academy has to offer.

But this leads to the second and perhaps more fundamental problem: journals are only partly about communicating. They’re also about controlling academic discourse. The editorial power held by journals and those that run them (quite different from those that own them) shapes most academic careers and the very structure of disciplines. It’s almost certain that pursuing new forms of collaboration and communication will reshape these power structures–sometimes subtly, sometimes not. That’s the nature of change.

Change, however, doesn’t come easily within academic communities. It should be no surprise that universities have done far more to free the content of their courses than they have the content of their publications. The former has economic value, however, the latter holds the keys to the academy itself.

This conservatism is at least in part responsible for why, despite the new possibilities offered by the web, most scholarly work is still published as though it were 1580. It’s also responsible for allowing a handful of powerful corporations to gate access to this knowledge and make authors pay for the privilege of signing away rights to their own work.

Sir Tim Burners Lee invented the web to solve this very problem. Twenty years later it allows us to do almost everything imaginable–except get unfettered access to scholarly communication. 

It is not technology that holds us back.

Aaron’s arrest should be a wake up call to universities–evidence of how fundamentally broken this core piece of their architecture remains despite decades of progress in advancing communication and collaboration.

The MIT staff who called the FBI police* would have been served better by calling the chancellor to ask, “How have we created a system that forces 24 year-olds to sneak around in the basement, hiding hard-drives in closets in order to ask basic and important questions about our work? Can’t we do better?”

 

Update: I’m not OK with scraping JSTOR or any other copyrighted data source for the purpose of re-distribution. Some, including the FBI federal prosecutors, have made the claim that’s what Aaron planned to with the data. Others have pointed to his past research analyzing influence in academic writing. I have no insight into his real intentions, however, I do believe the latter goal is important and likely not possible without breaking the kinds of laws discussed above.

Also, it’s true that JSTOR does offer a bulk interface for research users. That interface didn’t exist when I was doing my work. But it’s not clear it would have made any difference. There are many, many research applications, including mine, that are still not possible with approved means of accessing data. Giving researchers a straw is not a useful response to requests for open and complete access. We shouldn’t settle for less.

 

* For those interested in the blow by blow: since writing this post I’ve learned that no one at MIT called the FBI–in fact it’s not clear the FBI was ever involved. As I now understand it, the local police were called to investigate a break-in. Because this involved network equipment the Secret Service were called by the Cambridge police. After that the investigation took on a life of its own outside the MIT campus.

 

A version of this essay appeared on Reuters MediaFile under the title “The difference between Google and Aaron Swartz.”

Measuring Centrality in Tacit Social Networks

tacit social networks — kpw @ January 21, 2009

There’s an interesting new paper (Maslov, arXiv:0901.2640v1) up on the arXiv this month about using centrality metrics (in their case a modified PageRank) to analyze citation graphs in academic publishing. I’ll refrain from summarizing the paper as a related post on the arXiv physics blog has already done a great job. But the upshot is that there’s a lot of value in applying these kinds of metrics to citation networks.

This paper fit closely with work I did in the past looking at citation graphs in patent data (the complete set back to the 1970s) . In my case I was trying to assess the importance of inventors within a given field of innovation using a betweenness centrality metric (though PageRank/eiganvector centrality would have also been an appropriate choice). Like the Maslov paper illustrated, this approach had a very high degree of success in finding key individuals in given fields. As an example, I did a test on patents issued for technologies related to video games and the betweenness centrality metric showed Shigeru Miyamoto, the lead designer at Nintendo, as the top innovator in an inventor-to-inventor citation graph. This result appears to be supported by his biography which includes such honors as being named the “Walt Disney of electronic gaming” by TIME Magzine.

One problem not addressed in the Marsolv paper, however, is the translation from papers to people. The Marsolv approach only ranks papers though it makes inference about the rank of the people that wrote them. I considered this in my work with patents and found it problematic. Rather than looking at centrality across a paper-to-paper citation graph I decided to first derive a person-to-person graph that summed citation edges between inventors across the complete body of each inventor’s work. I was fortunate enough that the data I was working with had already attempted to disambiguate the inventors (no small feat!) so it was possible to translate between a paper citation graph and a people citation graph with relative ease. In a sense the person-to-person citation graph forms a sort of tacit social network extracted from the patent data.

I’m very much aware of the challenges in doing this with journals so I don’t fault them for not addressing this in their paper. However, it would be a great follow-on study to explore the difference in rankings using these approaches. I believe that there’s a need to continue thinking about the value of tacit social networks derived from sources like journals and patents, particularly in cases where the data is used to generate sociometric values like impact and importance.

Getting started…

house keeping — kpw @ January 7, 2009

I’ll spare you any prognostication about what’s to come and simply list a few of the things I’m thinking about these days:

See you soon.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2012 Structural Knowledge | powered by WordPress with Barecity