Milos Popovic

Compute and map railway density using R

May 24, 2021 | Milos Popovic

With a total of 67,956 kilometers of railways in 2020 India ranked 4th just behind the United States, China and Russia. While Indian Railways, a statutory body under the Indian Ministry of Railways, manages one of the world’s largest rail networks, adjusting for the size of the country reveals that a much smaller portion of the territory is covered in railroads.

In this post, we’ll dig deeper into this topic by exploring India’s railway density on a more granular level. Using the 2019 official subdistrict boundary data generously provided by superb GIS specialist Justin Elliot Meyers on his rich GitHub page as well as Geofabrik OpenStreetMap data for India, we’ll learn how to effortlessly compute the length of railways and land area size for every Indian subdistrict polygon to arrive at railway density, measured as 1 kilometer of railway per 100 square kilometers of land area. We’ll then plot all this on a neat choropleth map. And we’ll do it programatically in R using <150 lines of code. No waaaay, dude! 😮

Ah yeah, but that’s not all folks! Once you go through the code, you’ll be able to apply it to other spatial lines such as roads or rivers. This code could ultimately inspire you to launch your own projects on, for example, motorway density or river length per population.

We’ll start off by importing the usual suspects such as dplyr and plyr for data processing; sf and tidyverse for spatial analysis and ggplot2 for plotting. Also, package classInt will come in handy for constructing a discrete density rate for our map legend.

#load libraries
library(plyr, quietly=T)
library(tidyverse, quietly=T) 
library(sf, quietly=T)
library(ggplot2, quietly=T) 
library(dplyr, quietly=T)
library(classInt, quietly=T)

We’ll first download the 2019 India Official Boundaries shapefiles from GitHub. When we unzip the master folder, the India_Official_Boundaries_2019-master folder is returned. This folder includes 37 zipped subdistrict files. To unzip those files, we’ll first add the India_Official_Boundaries_2019-master folder to our path argument in the list.files function and unzip only subdistrict shapefiles by specifying ^India_Subdistrict.* followed by .zip extension in our pattern argument. Finally, we’ll use ldply from dplyr to split the list of files, apply the unzip function, and return the unzipped subdistrict shapefiles in our working directory.

u <- "https://github.com/justinelliotmeyers/India_Official_Boundaries_2019/archive/refs/heads/master.zip"
download.file(u, basename(u), mode="wb")
unzip("master.zip") #unzip master file
shpzip <- list.files(path = paste0(getwd(), "/India_Official_Boundaries_2019-master"), 
  pattern="^India_Subdistrict.*\\.zip$", 
  full.names = T)
outDir <- getwd()
ldply(.data = shpzip, .fun = unzip, exdir = outDir) # unzip all subdistrict shapefiles

Now we can proceed with loading the shapefiles into R. The easiest way is to declare a list of shapefiles that we want to batch import. We then make a function ind_shp that loads every subdistrict shapefile from the the list of files (shps) as sf objects with standard EPSG:4326 (WGS84) geodetic coordinate system for world and merges them into a single data.frame using bind_rows()

#create a list of Indian subdistrict shapefiles for import
shps <- list.files(pattern="^India_Subdistrict.*\\.shp$")
#Use lapply to import all shapefiles in the list
india <- lapply(shps, function(ind_shp) {
  ind <- st_read(ind_shp) %>%  #next, read all shp files as "sf" object and assign WSG84 projection
         st_transform(crs = 4326)
  return(ind)
  }) %>% 
bind_rows() #finally, merge subdistricts into single data.frame

In our newly minted india subdistrict shapefile, we have 5966 subdistricts denoted with a unique identifier sdtcode11.

head(india)
Simple feature collection with 6 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 73.88812 ymin: 33.70196 xmax: 74.81442 ymax: 34.78638
Geodetic CRS:  WGS 84
  stcode11 dtcode11 sdtcode11          stname  dtname   sdtname Subdt_LGD
1       01      001     00001 JAMMU & KASHMIR Kupwara   Kupwara         1
2       01      001     00002 JAMMU & KASHMIR Kupwara  Handwara         2
3       01      002     00004 JAMMU & KASHMIR  Badgam      Khag         4
4       01      002     00005 JAMMU & KASHMIR  Badgam   Beerwah         5
5       01      002     00006 JAMMU & KASHMIR  Badgam Khansahib         6
6       01      002     00007 JAMMU & KASHMIR  Badgam    Budgam         7
  Dist_LGD State_LGD JSID                       geometry
1        8         1    2 MULTIPOLYGON (((74.33744 34...
2        8         1    3 MULTIPOLYGON (((74.24232 34...
3        2         1    4 MULTIPOLYGON (((74.4989 34....
4        2         1    5 MULTIPOLYGON (((74.69221 34...
5        2         1    6 MULTIPOLYGON (((74.69483 34...
6        2         1    7 MULTIPOLYGON (((74.70503 34...

Now that we have the shapefile in place, we’ll fetch the railway data next. We’ll use the Open Street Maps Geofabrik for India to download the most recent shapefiles. Watch out because the following code will download the entire India dataset with a size of nearly 2 GBs. To my knowledge, there’s no other convenient way of downloading such a large OSM dataset into R. For example, package osmextract allows you to query only highways, waterways and aerialways from OSM Geofabrik while library geofabrik downloads the complete OSM data for any given region/country from Geofabrik. So, we’ll stick with base R so God helps us.

In the lines below, we first programmatically download and unzip the OSM data for India. Then we load the railway shapefile as sf object, declare the EPSG:4326 coordinate system and select only the rail class. OSM defines this class as full sized passenger or freight trains in the standard gauge for the country or state. Bear in mind that there are other railway classes in the OSM nomenclature such as, for instance, light_rail, subway or narrow_gauge that could be considered for analysis.

# get OSM data for India (size 1992.9 MB)
url <- "https://download.geofabrik.de/asia/india-latest-free.shp.zip"
download.file(url, basename(url), mode="wb")
unzip("india-latest-free.shp.zip")
 
#load railways and subset rail type
rail <- st_read("gis_osm_railways_free_1.shp", stringsAsFactors = FALSE) %>% 
        st_transform(4326) %>% 
        st_as_sf() %>%
        filter(fclass=="rail")

Let’s create the copies of our india and rail objects before venturing forth.

#create copies
ind <- india
r <- rail

We’re ready for our geospatial analysis! If railway density is the length of rails of a subdistrict divided by the area size of the subdistrict, then we first need to determine the total length of railway that falls under every subdistrict polygon. We use st_intersection function to determine the portions of rails that overlap in every Indian subdistrict. Once the function returns railway extents for every subdistrict we follow up with sf::st_length(geometry) to compute the length of each rail extent in meters (len_m) and group them by a subdistrict code using dplyr::group_by(sdtcode11)

# intersect rails by subdistrict polys and compute length
ints <- st_intersection(r, ind) %>% 
        dplyr::mutate(len_m = sf::st_length(geometry)) %>% # returns length in meters
        dplyr::group_by(sdtcode11)

The previous chunk returns a spatial object where every row is an intersected rail extent, but we need rail length for every subdistrict. So, we first turn ints into a data frame int. Then we make sure that len_m is of integer class and that sdtcode11 is a string so that we can merge the data frame with our ind spatial object. Finally, we aggregate the rail length in meters by subdistrict code to create ii data frame.

int <- as.data.frame(as.matrix(ints)) # place it into data.frame
int$len_m <- as.numeric(as.character(int$len_m)) #length must be numeric
int$sdtcode11 <- as.character(int$sdtcode11) #subdistrict code must be string
ii <- ddply(int, "sdtcode11", summarise, lr_m = sum(len_m, na.rm = TRUE)) #aggregate length by subdistrict

We are finally ready to merge our subdistrict spatial object and the rail length data frame by subdistricts. As we are interested in the total sum of rails divided by land area size, we next compute area size in square kilometers using st_area. This function returns square meters but we want square kilometers. Since 1 kilometer is equal to 1000 meters, then 1 square kilometer will be 1000 meters times 1000 meters. So, to transform area size from square meters to square kilometers we divide every rail length by 1 million. Finally, we have both the rail length and area size and that helps us create rail density. We first transform rail length to kilometers and then divide it by the area size and multiply by 100 to return 1 km of rail per 100 square km of land area.

# join indian subdistricts and rail length by subdistrict code
df <- merge(ind, ii, by='sdtcode11', all.x=T)
 
# let's calculate area size in square kilometers
df$area_sqkm <- st_area(df) / 1000000 #st_area returns square meters so we get square km by dividing the result by 1000 squared
 
# rail density: km or rail / 100 square km
df$rail_dens <- (df$lr_m/1000) / df$area_sqkm * 100

We will use a discrete scale to depict rail density in our legend to distinguish among the colors. We’ll first calculate the natural interval of our measure using quantile breaks to get our cutpoints. We’ll use 6 cutpoints to keep the colors discernible.

ni = classIntervals(df$rail_dens, 
           n = 6, 
           style = 'quantile')$brks

Then we’ll create labels based on the above cutpoints and round them to get rid of the decimals.

# this function uses above intervals to create categories
labels <- c()
for(i in 1:length(ni)){
    labels <- c(labels, paste0(round(ni[i], 0),
                             "–",
                             round(ni[i + 1], 0)))
}
labels <- labels[1:length(labels)-1]

Finally, we’ll make the categorical variable (cat) based on the cutpoints.

df$cat <- cut(df$rail_dens, 
              breaks = ni, 
              labels = labels, 
              include.lowest = T)

Since some subdistricts have no rails we’ll create an additional category labeled No rail.

lvl <- levels(df$cat)
lvl[length(lvl) + 1] <- "No rail"
df$cat <- factor(df$cat, levels = lvl)
df$cat[is.na(df$cat)] <- "No rail"
levels(df$cat)

And, finally, we can plot our map of railway density in India. The geom_sf argument will fill the polygons with discrete values while the subdistrict lines will be omitted for the sake of simplicity. Also, in the coord_sf line we’ll set the coordinate system to EPSG:4326. We’ll paint our choropleth map using a manual color palette from Adobe Color and make it colorblind friendly with the help of Chroma.js Color Palette Helper. We’ll also modify the last numeric label to indicate that it encompasses all the values higher than 18.

p <- ggplot() +
geom_sf(data=df, aes(fill = cat), color=NA, size=0) +
    coord_sf(crs = 4326, datum = NA) +
    theme_minimal() +
  theme(
    text = element_text(family = "georg", color = "#22211d"),
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_text(size=9, color="grey60", hjust=.55, vjust=25),
    axis.title.y = element_blank(),
    legend.position = c(.7, .85),
    legend.text = element_text(size=10, color="grey20"),
    legend.title = element_text(size=11, color="grey20"),
    panel.grid.major = element_line(color = "white", size = 0.2),
    panel.grid.minor = element_blank(),
    plot.margin     =   unit(c(t=0, r=0, b=0, l=0),"lines"), #added these narrower margins to enlarge maps
    plot.title = element_text(face="bold", size=20, color="#2a1559", hjust=.5, vjust=-2),
    plot.subtitle = element_text(size=16, color="#b72286", hjust=.5, vjust=-2),
    plot.background = element_rect(fill = "white", color = NA), 
    panel.background = element_rect(fill = "white", color = NA), 
    legend.background = element_rect(fill = "white", color = NA),
    panel.border = element_blank()) +
  labs(x = "©2021 Milos Popovic https://milospopovic.net\n Data: OSM Geofabrik India",
    title = "Railway density in India", 
    subtitle = "at subdistrict level", 
    caption = "") +
  scale_fill_manual(name= expression(paste("1 km of rail per 100", km^{2}, "of land area")), 
    values = rev(c("grey80", '#2a1559', '#701d6f', '#b12884', '#de529a', '#ee8eb0', '#f8c5c6')),
    labels = c("~0–3",     "3–5",    "5–7",   "7–11",   "11–18", ">18",  "No rail"),
    drop = F)+
  guides(color=F, fill = guide_legend(
            direction = "horizontal",
            keyheight = unit(1.15, units = "mm"),
            keywidth = unit(12, units = "mm"),
            title.position = 'top',
            title.hjust = 0.5,
            label.hjust = .5,
            nrow = 1,
            byrow = T,
            # also the guide needs to be reversed
            reverse = F,
            label.position = "bottom"
          )
    )

photo1

In this tutorial, I showed you a few data and geospatial tricks that would hopefully help you kick off your own dataviz projects. First, I showed you how to easily download files from a GitHub repo and OSM Geofabrik, extract them into your working directory and batch import them into R. Second, you’ve learned to intersect line by polygon and return the sum of lengths using sf library. Finally, you found out how to write fewer lines of code to create a beautiful map with geom_sf.

Feel free to check the full code here, clone the repo and reproduce, reuse and modify the code as you see fit. With a slight modification of this code, you’ll be able to produce a myriad of maps using OSM Geofabrik shapefiles and all that programmatically in R instead of spending your precious time on manually downloading the data.

I’d be happy to hear your view on how this map could be improved or extended to other geographic realms. To do so, please follow me on Twitter, Instagram or Facebook!