On a commuting train, random browsing between international news outlets left me a kind of fun question; which character of the Latin alphabet is most popular in the use for the 2-character country code? I am living in the country represented by ‘J’ and ‘P’ and wondering how likely this pair can arise by random chance.
R language is so handy to quickly satisfy this kind of curiosity that more or less 30-min coding provided me with some visuals. The
countrycode package largely helped me do this.
The first visual is a 2D tile plot in which the used pairs are filled with blue. In this plot, its y (x) axis is arranged in the descending order of appearance count as the first (second) character. Characters ‘M’, ‘S’, ‘G’ and ‘T’ seem popular both as the first and second ones. The ordered pair from ‘J’ to ‘P’ finds its place close to the right-bottom corner, which implies this combination is fairly unexpected soley in terms of frequency.
Human mind cannot help seeing the tile plot as an adjacency matrix between the characters, defining a directed network among them. Therefore, the second visual is a plot of the directed network between the 26 characters in which an edge is drawn from the first to second characters used in an actual code. The size of a vertex is roughly proportional to the total number of connections it holds. Core, medium, and peripheral characters loosely form a complex web.
Below is the script to reproduce these figures:
library(countrycode) library(dplyr) library(ggplot2) library(ggthemes) library(igraph) # Extract the relevant data df_country_name_iso2c <- codelist %>% select(country.name.en, iso2c) %>% filter(!is.na(iso2c)) %>% mutate(fst = substr(iso2c, 1, 1), scd = substr(iso2c, 2, 2), val = 1) # Count how many times each letter appears as the first one fst_rank <- df_country_name_iso2c %>% group_by(fst) %>% summarise(count = n()) %>% arrange(count) # fst "X" should be added manually fst_rank <- data.frame(fst = "X", count = 0) %>% bind_rows(fst_rank) # Count how many times each letter appears as the second one scd_rank <- df_country_name_iso2c %>% group_by(scd) %>% summarise(count = n()) %>% arrange(desc(count)) # Draw a tile plot ggplot(data = df_country_name_iso2c, aes(x = scd, y = fst, fill = val)) + geom_tile() + scale_y_discrete(limits = fst_rank$fst) + scale_x_discrete(limits = scd_rank$scd) + guides(fill = FALSE) + coord_fixed() + ggtitle("ISO alpha-2 country code") + theme_minimal(base_size = 15) # Convert the data into a graph g <- graph.data.frame(d = df_country_name_iso2c %>% select(fst, scd)) # Draw a plot of the graph plot.igraph(x = g, vertex.size = 0.6*(2+degree(graph = g, mode = "total")), vertex.color = "white", vertex.label.family = "sans", edge.arrow.size = 0.2, edge.curved = TRUE, layout = layout.kamada.kawai(graph = g))