Preview goes here.

Group: CiteULike-discussion - Forum Thread

Topic: Feature requests

Web service planned?

Do you plan to add a web-service? It would be great to be able to embed literature in external sites, tap into the tag searches, etc.

Posted by tharris on 2008-09-16 20:40:35.

14 replies.    Login or join this group to post to this thread.

Yes. We intend to do something like this. You can get the HTML for list items by prefixing with/embeddable, so to get my library:

This isn't ideal -- you'll have to write a style sheet (or take items from ours), and need to manipulate the HTML in your browser if you don't like our layout (which is also subject to change...)

Posted by chris on 2008-09-17 19:55:30.

Most excellent!

Posted by tharris on 2008-09-18 13:57:03.

Are you considering some sort of API (RESTful?) to allow programmable happiness? E.g. the ability to post articles. Something like Connotea's maybe? (http://www.connotea.org/wiki/WebAPI)


In the mean time I'm trying to do this without an API by using an HTTP POST on post_unknown.do but I keep getting HTTP 500 errors. Is this allowed in the first place? If not I'll stop. I'm sending the data values 'url' and 'title' as a starting point. Is there anything else I need to send?


Thanks!

Posted by mrsmond on 2008-09-19 18:24:49.

You can partly do this using tools like Exhibit ( http://code.google.com/p/simile-widgets/ ) from the Simile group at MIT. This will take Bibtex output (e.g. from a search output page) and generate a nice faceted view that allows you to then drill down by author, year etc. The one thing it doesn't cope with nicely is realising that the 'tag' field from each entry should be parsed into separate items (i.e. the tag list foo, bar is composed of two separate items not one that contains a comma) but hey nobodies perfect!

Posted by CameronNeylon on 2008-09-18 14:46:36.

Hi,

I was wondering if you still plan to add some sort of web service, and if you are, do you have any kind of estimated time frame for when it might be complete?

I'm curious because I'm working on a bibliography project that would need to be on its own domain name. We love the service offered by citeulike, though, and would like to use it if at all possible. We'd certainly be willing to put up references to citeulike or Springer, as long as we can get a fully functioning bibliography on our own domain name.

Alternatively, has anyone had any luck using the embeddable feature to do something like this?

Thanks.

Posted by aproman on 2009-01-29 05:23:32.

I thought this may interest some of you. Many wikis/CMS allow you to parse and embed feeds from external sources. This is a filtered view from my CiteULike library embedded in my personal website (running on WikkaWiki ).

Embedding an RSS is as simple as putting the following in the page:

{{rss url="http://www.citeulike.org/rss/user/dartar/tag/online_communities"}}

Posted by dartar on 2009-01-31 21:45:32.

yeah, i have 1600 tags and lots of mess in there. I'd really like a write-access API to clean them up programatically. Currently it's very tricky - you can use a browser-emulator, but you have to ignore robots.txt which explicitly disallows your client. As such I suspect it violates the terms of service, so I'm not going to be going in to details about how to do it all. Then you have to glean forms from some frankly nasty markup. THEN you can do stuff. Brr. Is this going to change any time soon?

If it is in the next few months, I'd like to know, since as my priorities have evolved its become an important area for me.

Posted by livingthingdan on 2010-08-17 09:39:03.

It should be really easy to script this. Download the JSON (for example) and change the tags by emulating the form post.

Posted by fergus on 2010-08-17 12:27:14.

Hi fergus.

Yes, it is not hard to do that, but it's far from optimal. for a start, the site owners have explicitly disallowed it using robots.txt, so I'm not sure that my automated browser client won't get me banned from the citeulike. For yet another, you can't use just the JSON output, you have to parse the HTML to get some of the necessary post URLs. This is tedious as the page design changes over time, so my code sporadically breaks, and it requires multiple page downloads per code update. Finally, it's something i hate doing - I left the web development industry for academia so I could, among other things, not spend my days messing with web design. If I have to write code to do things on CUL that are trivial with dektop clients, then the attractiveness of the web service diminishes a lot. What have i gained over running my own desktop local citation client and pumping the outputs to a static web page periodically?

Posted by livingthingdan on 2010-08-25 14:37:08.

for a start, the site owners have explicitly disallowed it using robots.txt, so I'm not sure that my automated browser client won't get me banned from the citeulike

robots.txt is about crawling. Anyway, which robots.txt entry are you thinking of? You would only get banned with abuse ("excessive over-use"). We don't have strict rules, but if you are making fewer than 1000 requests per hour, and not too "bursty", you probably won't get noticed. Notwithstanding, you can always email us to check, once you've an idea of your requirements.

For yet another, you can't use just the JSON output

Not true. The JSON contains all the data you need, and specifically the article_id.

This is tedious as the page design changes over time, so my code sporadically breaks

The form is (to first order) independent of the HTML, and is very unlikely to change. You might only need a simple regex-like test of the returned page to check for success.

Posted by fergus on 2010-08-25 15:01:51.

Thanks for the quick answers, @fergus -

robots.txt is about crawling.

AFAIK robots.txt is about any automated browser, and I don't believe the standard makes the distinction. Moreover, if i automatically update a thousand bookmarks, I'm fuzzy about how that is not crawling. Still, if CiteULike is OK with that I'm happy to ignore the robots.txt file. Note that it is tricky (i.e. undocumented) to ignore robots.txt in, for example, wget, so this could still be frustrating for some.

Anyway, which robots.txt entry are you thinking of?
     /bibtex/
     /endnote/
     /rss/

And for that matter, while checking out the file I notices that Yahoo Pipes is disallowed, which explains why that attempt to ease my workflow failed. Have there been problems with Yahoo Pipes? (Certainly I've noticed their Slurp bot being rude from time to time, but pipes has never been a problem for my sites. Anyway...)

Further, I notice that /login.do is excluded. If I wish to access my own private notes for an article and thus log my agent in, what is there recommended procedure there?

The JSON contains all the data you need, and specifically the article_id.

Sorry, my bad. At first inspection it didn't seem to have what I want. In part this may be due to the fact that I don't know about the differences between article id and user article id etc. Is this documented anywhere? How about how to constructing a form submit URL from the JSON?

The form is (to first order) independent of the HTML

Right, but rather than relying on CiteULike code handling forms submitted with empty fields I've been parsing the form content and resubmitting it. If I delete a field from my request entirely will CUL leave said field unchanged in the record to which it pertains? Can i rely on that behaviour in perpetuity?

A couple of years ago I relied on the behaviour of CUL serving up a redirect to a user's article record if you browsed to an article ID but omitted the username. Admittedly that was undocumented behaviour. But then, so it intercepting ajax requests and spoofing them from WWW:Mechanize. I think a bit of documentation about what is supported and what is not could go a long way toward making this seem less risky. And which "regex-like test of the returned page" is supported would be very helpful to know too.

Anyway, don't want to bang on about APIs at you, I'm sure we're all used to the Web 2.0 rhetoric around here. But hopefully this is some useful user use-case info for you.

For my part, you are so close to a REST read-only API - I'd love to see a REST write API. Then i could integrate into my various parsing scripts, do my bulk cleaning-up and so on, maybe let my desktop citation manager update the site for those fiddly updates etc. I reckon people like the SyncUThink folks et al would be interested too, perhaps.

Posted by livingthingdan on 2010-08-30 10:19:21.

AFAIK robots.txt is about any automated browser, and I don't believe the standard makes the distinction.

From http://www.robotstxt.org/

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

As far as you are concerned, ignore robots.txt: just don't hit us too hard. Put "sleep 5" in your code and have a cup of tea.

Posted by fergus on 2010-08-30 10:23:34.

If I delete a field from my request entirely will CUL leave said field unchanged in the record to which it pertains? Can i rely on that behaviour in perpetuity?

I thought you wanted to edit tags? That form is very simple.


For the main article form, if you omit a field, that entry will go back to the default value - any custom edits you have made will be lost.

Posted by fergus on 2010-08-30 12:42:33.

In part this may be due to the fact that I don't know about the differences between article id and user article id etc.

The "article" is an abstract "base" copy - user articles are references to this + custom additions. All users with the same article then share the article. (username, article_id) is a unique pair, logically equivalent to the user_article_id, which is just a database "primary key" for efficiency.

Posted by fergus on 2010-08-30 12:47:04.

Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.