Open Data in China: Some Assembly Required

As part of a recent project, my team has been looking to get a sense of the spatial patterns of population growth across China over the last ten years. Given the massive urban migration underway across China, understanding these patterns at a reasonable level of disaggregation is critical to developing sensible plans for future growth.

Luckily for us, like most other countries, China periodically carries out a complete national census (most recently in 2010) which collects detailed population data down to the county (more than 2800 across China) and ‘township’ level (more than 47,000 across China and roughly analogous to a US census tract). We were also lucky that the data had been published only a couple days before we started looking for it (in mid March 2013).

We happily set off to purchase the data, which came in two volumes – one for county level statistics and one on township level data. Unfortunately, the data was available in only one format – pictured below. For those who spend a lot of time glued to laptops ensconced in machine-readable data, this data format is colloquially known as a Book. I.e., it’s a book. A thick, massive book.

20130401_163530

The photo below shows a shot of one of 100s of pages contained the township data I described above. On the left is the township name, and across the columns runs population by age and gender.

20130401_163600

So, the data is ‘open’ – i.e. you can have a look at it. And, certainly, if you were interested in knowing exactly how many females aged between 50-64 lived in your township, this might not be a completely inconvenient way to find out (although you’d have to drop about $400 for the book itself). But clearly, for almost any other purpose you can imagine, this a far from ideal format.

The first step to a more user-friendly future is having the data in some type of digital database – no small task in itself. Even with that done, though, carrying out any type of spatial analysis would require a spatial dataset to join the data with. Neither of the datasets above are released with a spatial boundary file.

In some cases, open geospatial datasets are available for use in mapping the data above. The excellent Global Administrative Areas (GADM) dataset, for example, provides county level administrative boundaries in China. Having worked to match this source with Chinese data in the past, however, I can tell you that’s it not always as easy as it sounds given the slight changes in borders and naming conventions that switch between data sets and years. But clearly GADM is light years ahead of starting from scratch.

In the case of the township level data (equivalent to the census tract level in the US) there is simply (to the best of my knowledge) no available geospatial file to link the data to. In other words: 47,000+ data points, no locations. This drove one prominent geospatial company to manually enter the data into a spreadsheet and manually geocode the location of each of the 47,000+ township locations from hardcopy sources assembled from 100s of local governments. Anecdotally, I understand this took 3 full time people working for almost a year to complete.

They did this for the 2000 census and were able, from that data, to derive the richest and most detailed data set available on Chinese urbanization. Despite the amazing work done by the company, their data set is clearly now a key part of their added value as a consulting firm, and given their substantial investment of time and money in creating it, they are (quite understandably) not about to give it away.

Separately, the Michigan University China Data Center does similar work matching published Chinese census data to spatial boundaries for Chinese counties. In 2000, the last year this data appears to be available, they offer detailed data for all 2800+ counties in the country. The price tag? $38,000. $38,000 for census data – the bread and butter and free starting point for every demographic analysis done here in the US. This is not a knock on the China Data Center – I’m sure their time and cost investment to produce the data set is substantial. But it strikes me as odd that a data set created by the government of China is sold by a US institute for a $38,000, putting it out of reach for all but the wealthiest research institutes in China or anywhere else. Even if it could be purchased, restrictive licensing prevents the type of data mashups that are becoming so powerful elsewhere.

It is ultimately reasonable that countries like China have an interest in protecting some sensitive data from the public where absolutely required. And geospatial data remains a sensitive issue in China, as evidenced by resistance to OSM and geospatial data collection more broadly. But in the case of census data, the data is already being made public, just not in a way that is able to actually generate useful and interesting analysis. Because of this decision, private firms and institutes outside China are able to step in provide a costly service, ironically ultimately excluding many Chinese institutes, students, and citizens from using the data themselves to enhance their understanding of China’s demographic evolution.

The relevance of this discussion is heightened by the fact that China is a country still in the midst of the largest wave of urbanization in human history. Massive investments are being made to accommodate growing cities and regions. Understanding population dynamics at a detailed level is critical to properly planning for this growth. No one doubts the ability of China’s researchers, but as I’ve heard at many conferences and in the hallways of many meetings – it’s no good having the techniques and know how if the data is not available. Just a few samples of some analyses done with free US data – for example, on commute times here, or public transport access here – show how valuable open census data can be in enabling broad based dialogue on urban development.

It’s great that China is at least releasing its data, but let’s move to make it digital, open, mappabale, and shareable – and the basis for an understanding of Chinese urban growth that helps lay the groundwork for sound urban investment decision making in the future.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s