Looking at a mathematicians tree?

A fun project that I've been aware of for the past few years is the Mathematics Genealogy Project. To quote from their mission statement

The intent of this project is to compile information about ALL the mathematicians of the world. We earnestly solicit information from all schools who participate in the development of research level mathematics and from all individuals who may know desired information.

The type of data that they are storing involves

• The complete name of the degree recipient
• The name of the university which awarded the degree
• The year in which the degree was awarded
• The complete title of the dissertation
• The complete name(s) of the advisor(s)

I've been interested in scraping the data from my own record in the database but it's been one of those tasks that I've kept meaning to get back to. I initially thought I would write a python script to run the scraping myself but after a little google searching I found someone else had already presented the world with code to get started!.

So here I'm going to go through some of the minor changes to the code presented by Francisco Blanco-Silva that I had to get make in order to get things running on my machine using Python 3.4.X and offer a little more explanation in places for others that may not be used to playing around with ugly truth behind webscraping.

The original scraper that I tried to get running found here

import urllib
def augment_genealogy(subject_id,subject_tree):
# This function assumes that subject_id in not in subject_tree
f=urllib.urlopen("http://genealogy.math.ndsu.nodak.edu/id.php?id="+subject_id)
f.close()
# How many advisors did subject have?
# For each advisor, retrieve their information, and attach
# them to subject as parents

return subject_tree
else:
subject_tree[subject_id]=[]
return subject_tree


threw me a few errors, some of which are due to me running python 3.4.x and some of which looks to have been a copy paste issue!

First the libraries

import urllib
import pickle # need this to dump the dict file


and here is the augment_genealogy3 function written with the minor adjustments for python 3

def augment_genealogy3(subject_id,subject_tree):
# This function assumes that subject_id in not in subject_tree
f=urllib.request.urlopen("http://genealogy.math.ndsu.nodak.edu/id.php?id="+str(subject_id))
f.close()
# How many advisors did subject have?
# For each advisor, retrieve their information, and attach
# them to subject as parents

return subject_tree
else:
subject_tree[subject_id]=[]
return subject_tree


So how about running with and scraping your info? Well we need the subject_id of interest; this is just the number that the folks at the MGP have assigned your mathematician of interest. When you search for me for example Sepanda Pouryahya you should end up with a url that looks like http://genealogy.math.ndsu.nodak.edu/id.php?id=177161 and you see my id is 177161. Now to do the scraping

subject_id = 177161 # my ID
subject_tree={}     # just an empty dictionary

# Save the data for later use
pickle.dump( subject_tree, open( "math_subject_tree.p", "wb" ) )


We now have a dictionary with all the subject's tree. Here's a little snippet of what the data looks like

{'126667': ['125938', '126659'], '125938': [], '127461': ['127275', '127252'], 177161: ['89543'], '125232': [], '134780': ['146365'], '74384': [], '125142': ['127245', '125303'], ...

The dictionary 'keys' are the ID numbers from the website and the value(s) associated with each key contains the ID(s) of the subject's supervisor(s). Although these records are all mathematicians and the numbers themselves are a pleasure - grabbing 'names' seems like a good idea.

Associating names with the advisor IDs

Just like the previous scraper Francisco Blanco-Silva's blog post has what I needed - after making a few minor adjustments

Just as a reminder the subject_tree data can be loaded up with a call to

subject_tree = pickle.load( open( "math_subject_tree.p", "rb" ) )


To make the dictionary containing the number to name correspondence

#Create a dictionary that maps to each id, its name
names=dict()
for id in subject_tree:
print(id)
f=urllib.request.urlopen("http://genealogy.math.ndsu.nodak.edu/id.php?id="+str(id))
f.close()
name=s.partition("The Mathematics Genealogy Project - ")[2]
print(name[0:name.index("</title>")])
names[id] = name[0:name.index("</title>")]

pickle.dump( names, open( "math_subject_tree_names.p", "wb" ) )


Now I have all the advisors in my tree - my academic ancestors if you like!