On September 10 there was a sort of guerilla action digital reference event, called Slam the Boards. This was “a day-long answer fest” where reference librarians would answer as many questions as they could in one day, using authoritative resources. This was quite well publicized, at least in the digital reference community, and planning had been underway for more than a month. It’s not clear how many librarians actually participated, since there was no official signup, but several dozen librarians and a few entire services committed to participate. Also there are quite a few answer boards, so it’s also not known where participating librarians were doing their answering.

It’s a fair bet, though, that most were on Yahoo Answers. It’s the largest of the answer boards, plus Yahoo did some promotion and recruiting for the event.

Bill Pardue, the de facto organizer of Slam the Boards, made it clear that there would be no evaluation of the event. So I thought, well, since I do digital reference evaluation, I’ll do one. I wasn’t sure what the evaluation questions would be, but figured I’d see if I could even get the data first, & then worry about that. I figured that evaluation questions could focus on participation: What percentage of total answers provided on Sept 10 are from librarians? What percentage of libraries in the US are represented? Or eval questions could focus on the answers provided: Are the answers provided by librarians really qualitatively better than those not provided by librarians? In those services that have rating schemes, do askers rate librarians’ answer higher? Send thank yous more often?

And here we get into a series of frustrations.

First, collecting the data turned out to be complicated bordering on impossible. The Answer Board Librarians wiki (from which the Slam the Boards campaign was orchestrated) has a list of answer boards. There are 16 boards on this list. Let’s assume that I only wanted to collect answered questions from these 16 boards. This would involve either (1) manually downloading however many answered questions are produced in a day, probably several hundred, or (2) building a web crawler to do it automatically. Option (1) was too time-consuming, and option (2) is beyond my technical skills. Plus since every answer board is structured differently, it would really need to be 16 crawlers.

Another stumbling block: I had no way to identify which answers have been provided by librarians. The Answer Board Librarians wiki also has a list of librarians who have signed up to participate. But I have no idea what these librarians’ usernames are on the different boards. I also have no idea how many librarians have participated who did not sign up. There are two ways around this:

  1. search for sig blocks containing the text “library” — though this would certainly miss many librarian-provided answers, and
  2. email all librarians on the wiki list & ask them to send me their usernames — this would allow better filtering of answers, but what would be my response rate from those librarians?

So there you are. Collecting the data was prohibitive. But a solution presents itself: Bill Pardue had had communication with the good people in Yahoo’s Social Search group! I emailed Bill to ask to put me in touch with these folks, which he graciously did. I thought, maybe they can cut me a slice of their Yahoo Answers data, all the questions answered on September 10, and then I can filter out the librarian-answered questions (somehow… see above).

So I had a long phone conversation with someone from Yahoo’s Social Search group. He explained Yahoo’s interest in human-intermediated question answering to me, something that has always puzzled me. He explained their Knowledge Partners program to me. I explained my interest in evaluating Slam the Boards, and offered any data analysis Yahoo wanted from me in payment. It was a great conversation. And I’ve heard bupkes since. I’m not holding my breath.

So… if I can’t get Yahoo to give me their data, maybe I can just take it. Turns out, Yahoo Answers has an API, so I considered writing a tool to collect answered questions that way. I could do that, it’s within my technical skills, but I just didn’t want to spend the time. So I asked Chirag, one of our doctoral students who is a far better programmer than I and enthusiastic about a challenge, to look into what it would take to write an API app for this.

Turns out, Yahoo Answers’ API is pretty limited. One can’t simply ask for questions on a certain date, one has to specify a query. Chirag cleverly tried some generic queries (‘what’ and ‘how’) to retrieve the maximum number of questions possible. But since I want all questions answered on a date, on all topics, this method really won’t work.

So there you are, dear reader. More of my kvetching than you really wanted to ever read. But let this be a lesson to you: a good idea for a research project can die because you can’t get access to the data you need.