Create a dictionary of authors and author attributes and values for a journal article using the Scopus API and Python

As an exercise to brush up my Python skills, I decided to tinker around with the Scopus API. Scopus is an online database maintained by Elsevier that records and provides access to information about peer reviewed publications. Not only does Scopus let users search for journal articles based on key words and various other criteria, but the web services also allows users to explore these articles as networks of articles, authors, institutions, and so forth. If you’re interested in risk factors that lead to scholarly publications, publication citations, or impact factors, this is a place to start.

The following code yields a dictionary of author information by requesting content through the abstract retrieval API. This request is made using the Python package requests and parsed using the package BeautifulSoup. Enjoy!

#### Import python packages
import requests
from bs4 import BeautifulSoup


#### Set API key
my_api_key = 'YoUr_ApI_kEy'


#### Abstract retrieval API
# API documentation at http://api.elsevier.com/documentation/AbstractRetrievalAPI.wadl
# Get article info using unique article ID
eid = '2-s2.0-84899659621'
url = 'http://api.elsevier.com/content/abstract/eid/' + eid

header = {'Accept' : 'application/xml',
          'X-ELS-APIKey' : my_api_key}

resp = requests.get(url, headers=header)

print 'API Response code:', resp.status_code # resp.status_code != 200 i.e. API response error

# Write response to file
#with open(eid, 'w') as f:
#    f.write(resp.text.encode('utf-8'))

soup = BeautifulSoup(resp.content.decode('utf-8','ignore'), 'lxml')

soup_author_groups = soup.find_all('author-group')

print 'Number author groups:', len(soup_author_groups)

author_dict = {}

# Traverse author groups
for i in soup_author_groups:

    # Traverse authors within author groups
    for j in i.find_all('author'):

        author_dict.update({j.attrs['auid']:j.attrs}) # Return dictionary of attributes
      
        j.contents.pop(-1) # Pop dicitonary of attributes
 
        # Traverse author contents within author
        for k in j.contents:

            author_dict[j.attrs['auid']].update({k.name : k.contents[0]})
            
print author_list
Advertisements

Return all Column Names that End with a Specified Character using regular expressions in R

With the R functions grep() and names(), you can identify the columns of a matrix that meet some specified criteria.

Say we have the following matrix,

x<-data.frame(v1=c(1,2,3,4),v2=c(11,22,33,44),w1=c(1,2,3,4),w2=c(11,22,33,44))

Screen Shot 2012-12-23 at 5.19.37 PM

To return only those columns that end with a character (e.g., the number 1) submit the R command grep(pattern=".[1]",x=names(x),value=TRUE) into the console. 

Screen Shot 2012-12-23 at 5.19.28 PM

Accessing All the Curl Options under R

The native curl package in R, RCurl, provides an integrated set of tools for interacting with remote servers, to say the least. While it provides a number of useful functions, it still lacks a few sorely missed options (e.g., retry). Of course, it’s still possible to write some of these missing functions in R, which can then be used to expand the functionality of the RCurl package, but, on the other hand, it might just be easier to use the better maintained and fully functional curl program that comes with your computer. Under Mac OS X, the native curl program can be accessed in R using the command system().

For instance, we can serially download and save webpages (and retry the process if it fails) by using the following R syntax.

for(i in 1:n){
    system(paste("curl http://www.google.com --retry 999 --output sitePage",as.character(i),".html",sep=""),wait=TRUE)
}

Similarly, we can use some simple R syntax to asynchronously download a number of webpages. For instance,

for(i in 1:n){
    system(paste("curl http://www.yahoo.com --retry 999 --output sitePage",as.character(i),".html",sep=""),wait=FALSE)
}

If you’re just downloading webpages, it’s easy enough to use the native curl program that comes with your computer–just use the R command system(). In this way, you can download some pages with curl and then parse the information from them at a later time.