Here, we will look at the network of Euro-Roads, which is an international system of numbering important roads throughout Europe. The marking of roads as Euro-Roads usually co-exists with national numberings and the visibility of the E-numbers varies by country.
The dataset was downloaded from KONECT. There is no licence information included so I won’t share the original data.
I had to apply one manual fix to the data: there are two major cities in Europe called Brest, one in Belarus and one in France. Unfortunately, they were merged together in the raw data, so I split them up into Brest_BY and Brest_FR.
set.seed(78987)
library("ggraph")
library("igraph")
library("tidygraph")
library("tidyverse")
There are two files containing the data. One for the nodes and one giving the edges. The edges are unweighed and undirected.
nodes <- read_table("ent.subelj_euroroad_euroroad.city.name",
col_names = "name")
nodes <- nodes %>% mutate(id = 1:nrow(nodes))
Have a quick glance at the data to verify that it looks as expected.
glimpse(nodes)
## Rows: 1,175
## Columns: 2
## $ name <chr> "Greenock", "Glasgow", "Preston", "Birmingham", "Southampton", "…
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
Now proceed to the edges. The edge dataset starts with two rows that look like
comments (lines starting with a %
), so those should be skipped.
edges <- read_delim("out.subelj_euroroad_euroroad",
skip = 2, col_names = c("from", "to"), delim = " ")
And also have a peek at the data.
glimpse(edges)
## Rows: 1,417
## Columns: 2
## $ from <dbl> 1, 2, 2, 3, 4, 4, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9…
## $ to <dbl> 2, 3, 17, 4, 5, 855, 7, 880, 8, 22, 23, 411, 453, 454, 889, 9, 4…
Finally, we can put bot parts together to from a graph.
eroads <- tbl_graph(nodes, edges, directed = FALSE)
Before characterising the graph with more depth we start by calculating some basic properties of it, so we get a a better understanding of the data we are dealing with.
First, order and size:
gsize(eroads)
## [1] 1417
gorder(eroads)
## [1] 1175
With the size not being that much larger than the order we can already conclude that the network must be somewhat sparse. This makes sense as we are dealing with a road network only involving major roads across the entirety of Europe and parts of Asia.
comp <- components(eroads)
comp$no
## [1] 26
The amount of components is relatively large. There are both islands as well as potential e-roads not being directly connected to other e-roads, as even major roads are not present in the dataset if they are not part of the e-road network.
comp$csize
## [1] 39 1040 3 4 8 5 2 15 3 4 5 4 4 2 2
## [16] 2 2 7 2 2 2 10 2 2 2 2
Most nodes belong to one large component, with all other components being rather small in comparison.
Compared to what I have covered in a first tutorial there are more measures we can recruit when we want to characterise as a whole.
The first of these measures would be the diameter. The diameter is the amount of edges on the longest geodesic in the graph. Geodesic is essentially different way of saying shortest path. So, of all shortest paths between all the nodes in the network, we have a look at which of these paths is the longest.
diameter(eroads, directed = FALSE)
## [1] 62
We will have a closer look at this value later on.
Another interesting property is the edge density, sometimes also referred to as cost. We got a first hint at the sparsity of the network just from looking at the amount of nodes and edges present in the graph, but calculating the edge density gives us an exact value. This value describes the proportion of present edges compared to all possible edges in the graph (assuming single edges between nodes). A network with many edges is good at quickly transferring information between nodes (the exact interpretation of this of course depends on the kind of data your network represents), but at the same time it is costly (assuming that there is some sort of cost associated with establishing an edge). The opposite is of course true for networks with fewer edges: the network is cheap, but transfer of information becomes slower as one piece of information likely has to travel through more nodes and edges to reach its destination.
edge_density(eroads)
## [1] 0.002054442
We see a very low value here, indicating that only about 0.2% of possible edges are present, thus the network is definitely on the sparser side of things.
Finally, another whole-graph measure that is of interest is the average shortest path length. While the diameter above gave us the longest shortest path and the edge density gives us an idea of the sparsity of the network, we are likely interested in how long it takes on average to get from A to B in our graph. This is exactly what the average shortest path lengths tells us.
mean_distance(eroads, directed = FALSE, unconnected = TRUE)
## [1] 19.37821
Given some knowledge about the network we are dealing with here the results make a lot of sense. The diameter is large compared to the average shortest paths, as our network includes for example connections from Portugal to Kazakhstan. While there is no distance information in our network, it appears straightforward that such a connection would travel through many nodes and edges. However, in contrast especially for nodes in the central part of Europe it seems likely that there are a) many nodes and b) many connections between the nodes, making (graph) distances short. Finally, the relative sparsity of the network is easy to explain, as it seems unlikely that there are direct connections between nodes that are thousands of kilometres apart.
Obviously these are not the only measures you can calculate to characterise
a graph, so I suggest exploring what igraph
and
tidygraph
can do.
There are different ways we can characterise centrality in a graph. Centrality in relation to graphs is essentially a characterisation of importance. The most straightforward way to characterise centrality (or importance) is the degree, which is the amount of connections a node has. The idea here being that nodes that have more connections to other nodes are more important than nodes with only few connections.
eroads <- eroads %>%
activate("nodes") %>%
mutate(degree = centrality_degree())
As we are dealing with a lot of data here it is probably useful to generate some descriptives. As we will apply this to a lot of measures we create a function.
basic_desc <- function(data, col) {
data %>%
activate("nodes") %>%
as_tibble() %>%
summarise(what = rlang::as_name(rlang::enquo(col)),
mean = mean({{ col }}),
median = median({{ col }}),
sd = sd({{ col }}),
se = sd({{ col }}) / sqrt(n()),
min = min({{ col }}),
max = max({{ col }}))
}
And we can apply it to get a quick overview.
basic_desc(eroads, degree)
## # A tibble: 1 x 7
## what mean median sd se min max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 degree 2.41 2 1.19 0.0347 1 10
It seems like most nodes have a rather low degree. As most cities and towns in the data are just part of a single e-road passing through them, leading to a degree of two, this is in line with the data.
Plotting the distribution of the values is probably useful as well.
ggplot(eroads %>% activate("nodes") %>% as_tibble()) +
geom_histogram(aes(x = degree), bins = 10,
fill = "turquoise4", colour = "white") +
scale_x_continuous(breaks = seq(-1, 10, 1)) +
scale_y_continuous(breaks = seq(0, 700, 100)) +
labs(x = "Degree", y = "Count", title = "Nodal Degree") +
hrbrthemes::theme_ipsum_ps()
The plot essentially confirms what we see in the table above. The majority of nodes has a degree of 3 and smaller, with values of 6 and higher being really rare.
And we can get an overview of the nodes with the highest degree.
eroads %>%
activate("nodes") %>%
as_tibble() %>%
select(name, degree) %>%
slice_max(degree, n = 10)
## # A tibble: 11 x 2
## name degree
## <chr> <dbl>
## 1 Moscow 10
## 2 Paris 8
## 3 Liège 8
## 4 Munich 8
## 5 Berlin 8
## 6 Budapest 8
## 7 Metz 7
## 8 Prague 7
## 9 Bratislava 7
## 10 Warsaw 7
## 11 Vienna 7
Most of the high-degree nodes are capital cities, except Liège, Metz and Munich. It is not too surprising that capitals seem to be better connected than less important cities. Moscow also happens to be the largest (most populated) city in Europe.
Betweenness centrality indicates how many shortest paths travel through a node. The higher this value is the more important the node is for the network.
For some measures, like betweenness, there is an option to output a normalised value. The obvious advantage of the normalised value is that it is easier to interpret in a broader sense and compare between networks, while the obvious disadvantage is that you can not directly interpret the value in terms of data.
eroads <- eroads %>%
activate("nodes") %>%
mutate(betweenness = centrality_betweenness(
normalized = TRUE, directed = FALSE))
basic_desc(eroads, betweenness)
## # A tibble: 1 x 7
## what mean median sd se min max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 betweenness 0.0123 0.00254 0.0275 0.000803 0 0.261
While the data is normalised, the mean and median are pointing towards most nodes having rather low betweenness values.
ggplot(eroads %>% activate("nodes") %>% as_tibble()) +
geom_histogram(aes(x = betweenness), bins = 25,
fill = "turquoise4", colour = "white") +
scale_x_continuous(breaks = seq(0, 0.25, 0.05),
minor_breaks = seq(0, 0.25, 0.01)) +
scale_y_continuous(breaks = seq(0, 700, 100)) +
labs(x = "Betweenness", y = "Count",
title = "Betweenness Centrality") +
hrbrthemes::theme_ipsum_ps()
As thought, most nodes have relatively low values with only a few nodes exhibiting large values here.
And again, a look at the nodes with the largest values can generate some insight.
eroads %>%
activate("nodes") %>%
as_tibble() %>%
select(name, betweenness) %>%
slice_max(betweenness, n = 10)
## # A tibble: 10 x 2
## name betweenness
## <chr> <dbl>
## 1 Warsaw 0.261
## 2 Saint Petersburg 0.195
## 3 Berlin 0.191
## 4 Poznań 0.181
## 5 Świebodzin 0.176
## 6 Hamburg 0.169
## 7 Vyborg 0.154
## 8 Vaalimaa 0.153
## 9 Pskov 0.152
## 10 Kotka 0.152
We can see the values quickly decreasing. We also see that the cities here seem to be in a corridor including north-east Germany, Poland, southern Finland and north-western Russia. Given that higher betweenness means that more shortest paths travel through a node, this could suggest that this region acts as sort of an hub in between more remote parts of the network.
Local transitivity, also known as local clustering, gives the amount edges between the neighbours of a node in relation to to maximum possible amount of edges between these notes. So, it is essentially the cost, but for a single node’s neighbourhood. The larger this value is for a node the better its neighbourhood is connected.
eroads <- eroads %>%
activate("nodes") %>%
mutate(transitivity = local_transitivity())
basic_desc(eroads, transitivity)
## # A tibble: 1 x 7
## what mean median sd se min max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 transitivity 0.0167 0 0.0836 0.00244 0 1
The descriptives show that for most nodes the local transitivity is rather low. Again, we are dealing with a network including only the most major of roads in Europe, so it is perhaps unlikely to have dense networks of these roads in most places.
ggplot(eroads %>% activate("nodes") %>% as_tibble()) +
geom_histogram(aes(x = transitivity), bins = 20,
fill = "turquoise4", colour = "white") +
scale_x_continuous(breaks = seq(0, 1, 0.1),
minor_breaks = seq(0, 1, 0.05)) +
scale_y_log10() +
labs(x = "Transitivity", y = "Count",
title = "Local Transitivity") +
hrbrthemes::theme_ipsum_ps()
The plot confirms the thought that most nodes appear to have relatively small local transitivity values. The small peak at 1 are nodes where all neighbours are connected between each other. Note that this does not include any information about the amount of neighbours, only whether they are connected among each other. However reaching a value of 1 for local transitivity if there are numerous neighbours requires a lot more edges than if there are only two or three neighbours for a node.
Looking at the nodes with the highest values might be interesting as well.
eroads %>%
activate("nodes") %>%
as_tibble() %>%
select(name, transitivity) %>%
slice_max(transitivity, n = 10)
## # A tibble: 27 x 2
## name transitivity
## <chr> <dbl>
## 1 Brindisi 1
## 2 Domokos 1
## 3 Vyšné Nemecké 1
## 4 Postojna 1
## 5 Spielfeld 1
## 6 Lausanne 0.333
## 7 Naples 0.333
## 8 Salerno 0.333
## 9 Bari 0.333
## 10 Jihlava 0.333
## # … with 17 more rows
However, I can not really see a pattern here. The places in the list are not exactly major cities, it might just be that they are in an area where e-roads cross. It might be a good idea to visualise that.
Obviously you are not just limited to calculating three centrality measures on
your graph. Looking at the
help page for centrality of
the tidygraph
measure shows that there are potentially a lot more measures
that might be of interest.
Let’s create visualisations of the network.
For plotting graphs the ggraph
library is a handy tool which is rather
straightforward to use if you have experience with ggplot2
.
The simplest call can do to visualise the networks is relatively short.
ggraph(eroads, layout = "auto") +
geom_edge_link() +
geom_node_point() +
theme_graph()
## Using `stress` as default layout
That’s a bit ugly, but at least we now have a visualisation of our network. But we can do better.
Given that we are dealing with geodata we can build a helper function to interact with a geo API to fetch coordinates. Fast-forward of searching a straightforward geo API and cleaning the data quite a bit we can visualise the network in a more natural way.
coordinates <- read_delim("coordinates.tsv", delim = "\t")
coordinates
## # A tibble: 1,175 x 4
## place lon lat id
## <chr> <dbl> <dbl> <dbl>
## 1 Greenock -4.76 55.9 1
## 2 Glasgow -4.25 55.9 2
## 3 Preston -2.70 53.8 3
## 4 Birmingham -1.90 52.5 4
## 5 Southampton -1.40 50.9 5
## 6 Le Havre 0.1 49.5 6
## 7 Paris 2.35 48.9 7
## 8 Orléans 1.91 47.9 8
## 9 Bordeaux -0.580 44.8 9
## 10 San Sebastián -1.98 43.3 10
## # … with 1,165 more rows
We have names, node IDs and coordinates in this data set. So, this can now be joined with the node data.
eroads <- eroads %>% left_join(coordinates, by = "id")
Plotting with custom x and y coordinates is fortunately very straightforward
with ggraph
.
ggraph(eroads, x = lon, y = lat) +
geom_edge_link() +
geom_node_point() +
theme_graph() +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
Which pretty closely resembles Europe and parts of central Asia. Plotting this is kind of hard, especially with limited space, as we cover a really large area here.
While I did clean the data and tried to ensure everything is correct, there are most likely some coordinates in the data that are off to some extent skewing the map a bit.
At the beginning we found the diameter of the graph to be
62. Now, we are interested in finding and visualising the
longest shortest path in the graph. As far as I can see there is no function
in tidygraph
or igraph
to do this, but this is not too complex.
There is shortest_path
in igraph
, which gives the shortest paths from a
specified node to all other nodes of the graph. This means we can apply this
to all nodes and find the longest shortest path for each node and then
subsequently the longest of the longest shortest paths.
With 1175 nodes this does take a minute or two, though.
node_num <- seq_along(nodes$name)
shortest_path_lengths <- sapply(node_num[-length(node_num)], function(x) {
# uncomment the next line to get an indication of progress
# cat(paste(x, "/", length(node_num), "\r"))
max(sapply(shortest_paths(
eroads, from = x, to = node_num[-1:-x])$vpath, length))
})
We now have a vector where each value is the length of the longest shortest path for a node. The order of the values corresponds to the id of the nodes in the graph.
head(shortest_path_lengths)
## [1] 15 14 15 15 16 50
Thus it is straightforward to find which nodes have the maximum path length.
dm <- diameter(eroads) + 1
nodes$name[shortest_path_lengths == dm]
## [1] "Rennesøy"
We need to add 1 to the diameter here, because the length we generated is based on the amount of nodes in the path.
Now we can extract the path (in fact there are two paths of length 62).
longest_shortest_paths <- shortest_paths(
eroads, from = nodes$name[shortest_path_lengths == dm])
longest_paths <- longest_shortest_paths$vpath[
lengths(longest_shortest_paths$vpath) == dm]
longest_paths
## [[1]]
## + 63/1175 vertices, named, from f4e7667:
## [1] Rennesøy Randaberg Stavanger Sandnes
## [5] Helleland Flekkefjord Lyngdal Mandal
## [9] Kristiansand Arendal Porsgrunn Larvik
## [13] Sandefjord Horten Drammen Oslo
## [17] Gothenburg Säffle Mora Östersund
## [21] Sundsvall Örnsköldsvik Umeå Luleå
## [25] Haparanda Tornio Kemi Oulu
## [29] Jyväskylä Heinola Lahti Helsinki
## [33] Kotka Vaalimaa Vyborg Saint Petersburg
## [37] Pskov Riga Kaunas Warsaw
## + ... omitted several vertices
##
## [[2]]
## + 63/1175 vertices, named, from f4e7667:
## [1] Rennesøy Randaberg Stavanger
## [4] Sandnes Helleland Flekkefjord
## [7] Lyngdal Mandal Kristiansand
## [10] Arendal Porsgrunn Larvik
## [13] Sandefjord Horten Drammen
## [16] Oslo Gothenburg Säffle
## [19] Mora Östersund Sundsvall
## [22] Örnsköldsvik Umeå Luleå
## [25] Haparanda Tornio Kemi
## [28] Oulu Jyväskylä Heinola
## + ... omitted several vertices
And now we have everything we need to generate visualisations of the two longest geodesics in the graph.
visualise_path <- function(data, from, to) {
data <- data %>%
morph(to_shortest_path, from = from, to = to) %>%
activate("edges") %>%
mutate(shortest = "1") %>%
unmorph()
data %>%
ggraph(x = lon, y = lat) +
geom_edge_link(aes(edge_colour = shortest, edge_width = shortest)) +
scale_edge_width_discrete(na.value = 0.5) +
geom_node_point(alpha = .3) +
theme_graph() +
guides(edge_colour = FALSE, edge_width = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
}
visualise_path(eroads, longest_paths[[1]][1], longest_paths[[1]][dm])
visualise_path(eroads, longest_paths[[2]][1], longest_paths[[2]][dm])
This does work with any other path in the graph as well, providing us with some rudimentary navigation. The only downside with our current implementation is that we need to give node ids to the function, not names.
get_id <- function(data, place) {
data %>%
activate("nodes") %>%
filter(name == {{ place }}) %>%
select(id) %>%
as_tibble() %>%
deframe()
}
id_helsinki <- get_id(eroads, "Helsinki")
id_ljubljana <- get_id(eroads, "Ljubljana")
visualise_path(eroads, id_helsinki, id_ljubljana)
Of course we can also easily visualise the centrality measures we calculated above. As this is a rather repetitive endeavour I will just do this for local transitivity.
ggraph(eroads, x = lon, y = lat) +
geom_edge_link() +
geom_node_point(aes(colour = transitivity, alpha = transitivity)) +
scale_colour_gradient(low = "grey30", high = "orangered") +
scale_alpha_continuous(range = c(0.2, 1)) +
guides(alpha = FALSE) +
theme_graph() +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
Compared to the tabular data we did have further above, we are able to see some extent of local focus for nodes with high values for local transitivity. By changing both alpha and colour dependent on the nodes’ transitivity values the plot allows us to clearly see the location of the nodes with high values. If all nodes had the same alpha the visibility of the orange coloured nodes would be quite bad.
When dealing with graphs you might be interested in doing community detection
(also called clustering). As the name suggests this means finding nodes that
on some measure are closer to each other than to other nodes in the graph.
There is a large number of algorithms available both in the literature and in
the R packages we work with. One of the igraph
main developers did post a
summary of algorithms available in the package some time ago on
stackoverflow. This answer is definitely
worth reading.
Different methods for community detection can lead to very different results in clustering.
Let’s visualise these differences. We start by applying the infomap algorithm to our data and colour nodes by group membership.
eroads <- eroads %>%
activate("nodes") %>%
mutate(infomap = group_infomap())
ggraph(eroads, x = lon, y = lat) +
geom_edge_link() +
geom_node_point(aes(colour = as.factor(infomap)), alpha = .9) +
theme_graph() +
guides(colour = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
The plot gives a first impression, but there are too many groups to see anything.
One approach could be to just generate lists of cluster membership.
infomap_groups <- eroads %>%
activate("nodes") %>%
as_tibble() %>%
group_by(infomap) %>%
group_split()
Note, that we need to call as_tibble
, to turn the nodes data into a tibble
as group_split
is not available in tidygraph
and dplyr
can not deal with
tbl_graph
objects.
infomap_groups[[1]] %>%
select(name, lon, lat)
## # A tibble: 16 x 3
## name lon lat
## <chr> <dbl> <dbl>
## 1 Klaipėda 21.1 55.7
## 2 Kaunas 23.9 54.9
## 3 Vilnius 25.3 54.7
## 4 Riga 24.1 56.9
## 5 Šiauliai 23.3 55.9
## 6 Tolpaki 21.3 54.6
## 7 Ventspils 21.6 57.4
## 8 Rēzekne 27.3 56.5
## 9 Velikiye Luki 30.5 56.3
## 10 Nesterov 22.6 54.6
## 11 Marijampolė 23.4 54.6
## 12 Ukmergė 24.8 55.2
## 13 Daugavpils 26.5 55.9
## 14 Ostrov 28.4 57.3
## 15 Palanga 21.1 55.9
## 16 Panevėžys 24.4 55.7
It looks like the cluster is somewhat regionally constrained. Even if you do not know anything about the places in the list the coordinates provide a nice clue.
We might want to visualise this.
eroads %>%
activate("nodes") %>%
mutate(cluster1 = ifelse(infomap == 1, infomap, NA)) %>%
ggraph(x = lon, y = lat) +
geom_edge_link(colour = "grey50") +
geom_node_point(aes(colour = as.factor(cluster1)), alpha = .9) +
theme_graph() +
scale_colour_discrete(na.value = "grey40") +
guides(colour = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
This kind of confirms the idea about the cluster’s location.
Obviously we can extend this solution.
eroads %>%
activate("nodes") %>%
mutate(cluster1 = ifelse(infomap %in% 1:10, infomap, NA)) %>%
ggraph(x = lon, y = lat) +
geom_edge_link(colour = "grey50") +
geom_node_point(aes(colour = as.factor(cluster1)), alpha = .9) +
theme_graph() +
scale_colour_discrete(na.value = "grey40") +
guides(colour = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
We can see that the clusters - at least those we plotted here - are somewhat regionally constrained.
Finally, we are probably interested in how many members each cluster has.
infomap_groups %>% map_int(nrow)
## [1] 16 16 15 15 14 14 14 14 13 13 12 12 12 12 12 12 12 12 11 11 11 11 11 11 11
## [26] 11 11 11 10 10 10 10 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 9 9
## [51] 9 9 9 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7
## [76] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 6
## [101] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4
## [126] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2
## [151] 2 2 2 2 2 2 2 2 2 2 2
This looks like most of the clusters are kind of small and mostly similar in size. The idea that regionally nodes are better connected among each other compared to nodes father away, does not sound too far fetched, so there is nothing too surprising here.
We can apply a different grouping algorithm as well to see whether this leads to a different result.
eroads <- eroads %>%
activate("nodes") %>%
mutate(greedy = group_fast_greedy())
ggraph(eroads, x = lon, y = lat) +
geom_edge_link() +
geom_node_point(aes(colour = as.factor(greedy)), alpha = .9) +
theme_graph() +
guides(colour = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
The result with the fast and greedy algorithm is very different than the result we got with the infomap algorithm. This is not unexpected, as with most kinds of data using different clustering algorithms will leads to different results, so in addition to thinking carefully about what kind of result you want there is probably also some trial and error involved.
Let’s have a closer look.
greedy_groups <- eroads %>%
activate("nodes") %>%
as_tibble() %>%
group_by(greedy) %>%
group_split()
We can get a quick overview of the cluster sizes.
greedy_groups %>% map_int(nrow)
## [1] 127 78 73 71 70 61 57 52 52 50 47 41 39 39 37 32 32 32 28
## [20] 22 20 15 11 10 8 8 7 5 5 4 4 4 4 3 3 2 2 2
## [39] 2 2 2 2 2 2 2 2 2
The groups are much larger here compared to what we got with infomap.
eroads %>%
activate("nodes") %>%
mutate(cluster1 = ifelse(greedy == 1, greedy, NA)) %>%
ggraph(x = lon, y = lat) +
geom_edge_link(colour = "grey50") +
geom_node_point(aes(colour = as.factor(cluster1)), alpha = .9) +
theme_graph() +
scale_colour_discrete(na.value = "grey40") +
guides(colour = FALSE) +
coord_map(projection = "albers", lat0 = 40, lat1 = 65) +
theme(plot.margin = margin(0, 0, 0, 0, "cm"))
Apparently the first cluster here is one focusing on France and adjacent regions. Thus, it looks like while clusters appear to be regionally constrained as well, they tend to be larger using the fast and greedy algorithm.
Here, we have peeked further into network analysis compared to the first and very basic tutorial I wrote. Depending on what kind of data you are dealing with you might want to take a deep dive into this topic as I have still only scratched the surface.