Lecture on Graph Theory:

Introduction and Applications of Graph Theory

What is Graph?

In mathematics, graph theory exists to study graphs, but what exactly is a graph?

Several deffinitions exist:

Graphs are mathematical structures used to study pairwise relationships between objects and entities.

A graph is a structure with a set of objects, these objects can be related to other objects in pairs. The objects correspond to mathematical concepts called vertices, and the connected or related pairs of vertices are called edges.

A graph can also be thought of as: A binary relation (=edges) between elements of a set (=vertices).

Vertices are also often called nodes or points.

Edges can be called links or lines.

Graphs use the terms vertex and edge commonly, while networks usually use node and link.

Examples:

Nodes = vertices (ex: vertices = {A, B, C, D, E})

Edges = links (ex: edges = {AB}, {AC}, {B,C}, {C,E})



Why Graphs?

Suppose you booked an Uber cab. One of the most important things that is critical to Uber’s functioning is its ability to match drivers with riders in an efficient way. Consider there are 6 possible rides that you can be matched with. So, how does Uber allocate a ride to you? We can make use of graphs to visualize how the process of allotting a ride might be.

Which Ride (driver) for the rider makes the most sense?



History of graphs and graph theory

Leonhard Euler and the Seven Bridges of Kongsberg Problem

History:

Can you take a walk through the town, visting each part of the town and crossing each bridge only once?

Rules: 1. No doubling back. 2. You can only cross a bridge one time. 3. You end at the same place you started.

Helpful terms:

Simple path vs Euler path

A path in a graph represents a way to get from an origin to a destination node by traversing the edges of the graph.

Simple path: a route around a graph that visits every vertex one is called a simple path.

Euler path: a route around a graph that visits every edge once is called an Euler path.

You can tell which graphs have a Euler path by counting how many vertices have an odd degree. The number of vertices of odd degree must be either zero or two, if not it is not a Euler path.

Two vertices are adjacent if they share a common edge (an edge joins the vertices).

The degree of a vertex is the total number of vertices that are adjacent to the vertex.

Before computers graph theory could become quite complex depending on the application.

Recently the use of graphs and graph theory have become increasingly popular due to advancements in computing power and applications in diverse fields.



Why study graph theory?

  1. Graphs provide a better way of dealing with abstract concepts like relationships and interactions. They also offer an intuitively visual way of thinking about these concepts. Graphs also form a natural basis for analyzing relationships in a Social context

  2. Graph Databases have become common computational tools and alternatives to SQL (structured query language) and NoSQL databases

  3. Graphs are used to model analytics workflows in the form of DAGs (Directed acyclic graphs)

  4. Some Neural Network Frameworks also use DAGs to model the various operations in different layers

  5. Graph Theory concepts are used to study and model Social Networks, Fraud patterns, Power consumption patterns, Virality and Influence in Social Media. Social Network Analysis (SNA) is probably the best known application of Graph Theory for Data Science

  6. It is used in Clustering algorithms – Specifically K-Means

  7. System Dynamics also uses some Graph Theory concepts – Specifically loops

  8. Path Optimization is a subset of the Optimization problem that also uses Graph concepts

  9. From a Computer Science perspective – Graphs offer computational efficiency. The Big O complexity for some algorithms is better for data arranged in the form of Graphs (compared to tabular data)

Source: https://www.analyticsvidhya.com/blog/2018/09/introduction-graph-theory-applications-python/


Key terms and concepts

Edges = lines or links, edges can have directions, signs, weights, functional equations, or locations.

Vertices = nodes or points, can have labels, weights, or locations.

Directed Graph = a graph that is made up of a set of vertices connected by edges, where the edges have a direction associated with them

Undirected Graph = a graph with a set of vertices or nodes that are connected, where all the edges are bidirectional, sometimes called an undirected network

Average path length = average number of steps along the shortest paths for all possible pairs of network nodes, measure of efficiency of information or mass transport on a network

Breadth first search (BFS) = algorithm that traverses a graph data structure, starts on a node and explores all neighbor nodes before moving to nodes at the next depth level

Depth first search (DFS) = similar to BFS, starts at a node on a graph data structure, searches as far as possible along each branch before backtracking

BFS and DFS are algorithms to search for nodes in a graph

Centrality = identify the most important vertices (nodes) within a graph, importance needs to be defined prior

Degree centrality = simple centrality measure that counts how many neighbors a node has

Closeness centrality = measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph

Betweenness centrality = measure of centrality based on shortest paths

Isolation = an isolated vertex in a graph with, if a graph has an isolated vertex then the graph is disconnected

Adjacency/Incident = vertices are called adjacent if they connected by an edge, two edges are called incident if they share a vertex

Network density = a measure of how many edges a graph has

Isomorphic graph/s = two graphs which contain the same number of graph vertices connected in the same way are said to be isomorphic

Bipartite graph = special kind of graph with the following properties:

The embbedded graph is an example of a bipartite graph because:


Graph theory in R!

Useful packages for graph theory in R Studio

  1. IGraph - a collection of network analysis tools with the emphasis on efficiency, portability and ease of use

  2. Rgraphviz(requires graphiz) - provides plotting capabilities for R graph objects

  3. ggplot2 - one of the leading packages for creating graphs in R

  4. raster - Geographic Data Analysis and Modeling Reading, writing, manipulating, analyzing and modeling of gridded spatial data

  5. reshape2 - easily transform data between wide and long formats

  6. ggraph (extension of ggplot) - extension of ggplot, but geared for use in graphs and networks

  7. tidygraph - can help manipulate data used in graph and network applications, and provide tidy interfaces for common graph algorithms

  8. networkD3 (interactive network graphs) - Creates D3 JavaScript network, tree, dendrogram, and Sankey graphs from R

  9. visNetwork (interactive network graphs) - used for network visualization

  10. sp - classes and methods for spatial data

  11. dismo - species distribution modeling

  12. NetIndices - Estimating network indices, including trophic structure of foodwebs in R


First we need to download a package to be able to make and visualize graphs

igraph is on CRAN and the igraph package can be found at: http://igraph.org/r/


Useful expressions in igraph

  1. V(dataset) = returns all vertices

  2. V(dataset)[#, #:#] = shows vertices in the specified positions

  3. V(dataset)[degree(dataset) < #] = finds vertices satisfying the specified condition

  4. V(dataset)[nei(‘vertex name’)] = shows vertex neighbors

  5. V(dataset)[‘vertex name’, ‘vertex name’] = selects given vertices

  6. E(dataset) = returns all edges

  7. E(dataset)[vertexset %–% vertexset] = shows all edges between two vertex sets


#install.packages c ("igraph", "reshape2", "ggplot2", "raster")

library(igraph) #to start we will just explore the igraph package
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

igraph in R demonstration


First lets create a basic network and visualize it to understand the vertices and edges

For the igraph demonstration we will first create our own data


graph1 <- graph (edges = c(1,2, 2,3, 3,4, 4,5, 1,5), n = 5) #Assign edges and number of vertices

plot(graph1) #Plot the network


Now lets take our first graph and add more edges


graph1 <- graph(edges = c(1,2, 2,3, 3,4, 4,5, 1,5, 2,4, 3,5, 5,2, 1,4, 2,4, 3,1), n =5)

plot(graph1)


Is this network directed or undirected? Why?


Now lets make a simple network that is undirected


graph2 <- graph (edges=c(1,2, 2,3, 3,1), n = 3, directed = F)

plot(graph2) 


The code below generates an undirected graph with three edges. The numbers are interpreted as vertex IDs, so the edges are 1–>2, 2–>3, 3–>1.

By default if we do not specify the network as directed or undirected igraph assumes directed


What if we are working with a complex network and wanted to know what the edges were?

We can call on the assigned network and it will report the edges


class(graph2)
## [1] "igraph"
graph2  #We can call on the assigned graph/network to see what the edges are
## IGRAPH 95b6fc8 U--- 3 3 -- 
## + edges from 95b6fc8:
## [1] 1--2 2--3 1--3
E(graph2)
## + 3/3 edges from 95b6fc8:
## [1] 1--2 2--3 1--3

Now lets create and visualize a directed network and make it have 15 vertices, but assign the edges between vertices as directed and not connect all the vertices


graph3 <- graph(edges = c(1,2, 2,3, 3,1, 5,6, 6,9, 9,13, 13,2, 5,1), n = 15, directed = T)

plot(graph3)

graph3
## IGRAPH 95c6901 D--- 15 8 -- 
## + edges from 95c6901:
## [1]  1-> 2  2-> 3  3-> 1  5-> 6  6-> 9  9->13 13-> 2  5-> 1

Like the above graph we created, we can specify what vertices we want to be isolated by providing a list of the isolated vetices


graph4 <- graph( c("1", "2", "2", "3", "2", "3", "1", "1"), 

             isolates=c("4", "5", "6", "7") )  

# In named graphs we can specify isolates by providing a list of their names.



plot(graph4, edge.arrow.size=.5, vertex.color="gold", vertex.size=15,  #Plotting asthetics

     vertex.frame.color="gray", vertex.label.color="black", 

     vertex.label.cex=0.8, vertex.label.dist=2, edge.curved=0.2) 


We can also assign names to vertices instead of just numbers, this could be useful for unique identifiers in a data set


Bridgegraph <- graph( c("Bridge1","Bridge2", "Bridge3","Bridge4", "Bridge2","Bridge3", "Bridge1", "Bridge4","Bridge5", "Bridge3","Bridge5", "Bridge5","Bridge7")) # named vertices
## Warning in matrix(edges, ncol = 2, byrow = TRUE): data length [13] is not a
## sub-multiple or multiple of the number of rows [7]
# When the edge list has vertex names, the number of edges/nodes is not needed

plot(Bridgegraph)

Bridgegraph
## IGRAPH 95de9a4 DN-- 6 7 -- 
## + attr: name (v/c)
## + edges from 95de9a4 (vertex names):
## [1] Bridge1->Bridge2 Bridge3->Bridge4 Bridge2->Bridge3 Bridge1->Bridge4
## [5] Bridge5->Bridge3 Bridge5->Bridge5 Bridge7->Bridge1
#Directly accessing edges and vertices
E(Bridgegraph) #The edges of the object
## + 7/7 edges from 95de9a4 (vertex names):
## [1] Bridge1->Bridge2 Bridge3->Bridge4 Bridge2->Bridge3 Bridge1->Bridge4
## [5] Bridge5->Bridge3 Bridge5->Bridge5 Bridge7->Bridge1
V(Bridgegraph) #The vertices of the object 
## + 6/6 vertices, named, from 95de9a4:
## [1] Bridge1 Bridge2 Bridge3 Bridge4 Bridge5 Bridge7

We can also create graphs using a dataframe or an adjacency matrix, lets start with a dataframe


#we're going to randomly generate a graph - first specify the names of the nodes and the number of edges we want
node_names = c("A", "B", "C", "D", "E", "F")
n_edges = 8

#to make graph, we're going to create 'n_edges' edges between the nodes
all_possible_edges_df = t(combn(node_names, m=2)) #find all possible combinations of the nodes

#create a graph using a data frame
#-------
set.seed(746)
edges_df = data.frame(all_possible_edges_df[sample(1:nrow(all_possible_edges_df), n_edges),], stringsAsFactors = FALSE) #randomly select 'n_edges' of these rows
edges_df
##   X1 X2
## 1  C  F
## 2  A  C
## 3  D  E
## 4  E  F
## 5  C  D
## 6  A  B
## 7  B  D
## 8  D  F
dfgraph = graph_from_data_frame(edges_df, directed = FALSE, vertices = node_names)

plot(dfgraph)


Now lets use an adjacency matrix


#create a graph using an adjacency matrix

#transform the data frame to be in the form of an adjacency matrix
edges_mat = matrix(nrow = length(node_names), ncol = length(node_names), data=0, dimnames = list(node_names, node_names))
edges_mat[as.matrix(edges_df)] = 1 #put 1s in the matrix to represent which edges are connected (0 is disconnected, 1 is connected)
edges_mat
##   A B C D E F
## A 0 1 1 0 0 0
## B 0 0 0 1 0 0
## C 0 0 0 1 0 1
## D 0 0 0 0 1 1
## E 0 0 0 0 0 1
## F 0 0 0 0 0 0
Agraph = graph_from_adjacency_matrix(edges_mat, mode=c("undirected"))
plot(Agraph)


Now lets compare the two graphs and see if they are isomorphic

Isomorphic graphs are: Two graphs which contain the same number of graph vertices connected in the same way are said to be isomorphic.


#compare the two graphs

isomorphic(dfgraph, Agraph)
## [1] TRUE

Reading in network files and working with network data

In the next section, we will work primarily with two small example data sets. Both contain data about media organizations. One involves a network of hyperlinks and mentions among news sources. The second is a network of links between media venues and consumers. While the example data used here is small, many of the ideas behind the analyses and visualizations we will generate apply to medium and large-scale networks.

Source: “https://kateto.net/networks-r-igraph


#Set your working directory

#setwd("C:\Users\aig02\Desktop\Graph Theory") #Set your working directory to the location of your data

# DATASET 1: edgelist 

nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T) #nodes/vertices
links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T) #edges

# Examine the data:
head(nodes)
##    id               media media.type type.label audience.size
## 1 s01            NY Times          1  Newspaper            20
## 2 s02     Washington Post          1  Newspaper            25
## 3 s03 Wall Street Journal          1  Newspaper            30
## 4 s04           USA Today          1  Newspaper            32
## 5 s05            LA Times          1  Newspaper            20
## 6 s06       New York Post          1  Newspaper            50
head(links)
##   from  to weight      type
## 1  s01 s02     10 hyperlink
## 2  s01 s02     12 hyperlink
## 3  s01 s03     22 hyperlink
## 4  s01 s04     21 hyperlink
## 5  s04 s11     22   mention
## 6  s05 s15     21   mention
nrow(nodes); length(unique(nodes$id))
## [1] 17
## [1] 17
nrow(links); nrow(unique(links[,c("from", "to")]))
## [1] 52
## [1] 49

Notice that there are more links than unique from-to combinations. That means we have cases in the data where there are multiple links between the same two nodes. We will collapse all links of the same type between the same two nodes by summing their weights, using aggregate() by “from”, “to”, & “type”. We don’t use simplify() here so as not to collapse different link types.


# Collapse multiple links of the same type between the same two nodes
# by summing their weights, using aggregate() by "from", "to", & "type":
# (we don't use "simplify()" here so as not to collapse different link types)
links <- aggregate(links[,3], links[,-3], sum)
links <- links[order(links$from, links$to),]
colnames(links)[4] <- "weight"
rownames(links) <- NULL

DATASET 2: matrix

Two-mode or bipartite graphs have two different types of actors and links that go across, but not within each type. Our second media example is a network of that kind, examining links between news sources and their consumers.


nodes2 <- read.csv("Dataset2-Media-User-Example-NODES.csv", header=T, as.is=T)
links2 <- read.csv("Dataset2-Media-User-Example-EDGES.csv", header=T, row.names=1)

# Examine the data:
head(nodes2)
##    id   media media.type media.name audience.size
## 1 s01     NYT          1  Newspaper            20
## 2 s02    WaPo          1  Newspaper            25
## 3 s03     WSJ          1  Newspaper            30
## 4 s04    USAT          1  Newspaper            32
## 5 s05 LATimes          1  Newspaper            20
## 6 s06     CNN          2         TV            56
head(links2)
##     U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## s02   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0
## s03   0   0   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0
## s04   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0   0   0
## s05   0   0   0   0   0   0   0   0   0   0   1   1   1   0   0   0   0
## s06   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   1
##     U18 U19 U20
## s01   0   0   0
## s02   0   0   1
## s03   0   0   0
## s04   0   0   0
## s05   0   0   0
## s06   0   0   0
#We can see that links2 is an adjacency matrix for a two-mode network:

# links2 is an adjacency matrix for a two-mode network:
links2 <- as.matrix(links2)
dim(links2)
## [1] 10 20
dim(nodes2)
## [1] 30  5

Turn the network data into igraph objects


#  ------->> DATASET 1 -------- 

# Converting the data to an igraph object:
# The graph.data.frame function, which takes two data frames: 'd' and 'vertices'.
# 'd' describes the edges of the network - it should start with two columns 
# containing the source and target node IDs for each network tie.
# 'vertices' should start with a column of node IDs.
# Any additional columns in either data frame are interpreted as attributes.

net <- graph_from_data_frame(d=links, vertices=nodes, directed=T) 

# Examine the resulting object:
class(net)
## [1] "igraph"
net 
## IGRAPH 961b5ee DNW- 17 49 -- 
## + attr: name (v/c), media (v/c), media.type (v/n), type.label
## | (v/c), audience.size (v/n), type (e/c), weight (e/n)
## + edges from 961b5ee (vertex names):
##  [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
##  [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04
# We can look at the nodes, edges, and their attributes:
E(net)  # Edges of the "net" object
## + 49/49 edges from 961b5ee (vertex names):
##  [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
##  [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04
V(net)  # vertices of the "net" object
## + 17/17 vertices, named, from 961b5ee:
##  [1] s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17
E(net)$type  # Edge attribute "type"
##  [1] "hyperlink" "hyperlink" "hyperlink" "mention"   "hyperlink"
##  [6] "hyperlink" "hyperlink" "hyperlink" "hyperlink" "hyperlink"
## [11] "hyperlink" "hyperlink" "mention"   "hyperlink" "hyperlink"
## [16] "hyperlink" "mention"   "mention"   "hyperlink" "mention"  
## [21] "mention"   "hyperlink" "hyperlink" "mention"   "hyperlink"
## [26] "hyperlink" "mention"   "mention"   "mention"   "hyperlink"
## [31] "mention"   "hyperlink" "mention"   "mention"   "mention"  
## [36] "hyperlink" "mention"   "hyperlink" "mention"   "hyperlink"
## [41] "mention"   "mention"   "mention"   "hyperlink" "hyperlink"
## [46] "hyperlink" "hyperlink" "mention"   "hyperlink"
V(net)$media # Vertex attribute "media"
##  [1] "NY Times"            "Washington Post"     "Wall Street Journal"
##  [4] "USA Today"           "LA Times"            "New York Post"      
##  [7] "CNN"                 "MSNBC"               "FOX News"           
## [10] "ABC"                 "BBC"                 "Yahoo News"         
## [13] "Google News"         "Reuters.com"         "NYTimes.com"        
## [16] "WashingtonPost.com"  "AOL.com"
plot(net, edge.arrow.size=.4,vertex.label=NA)

# Removing loops from the graph:
net <- simplify(net, remove.multiple = F, remove.loops = T) 

# If you need them, you can extract an edge list or a matrix from igraph networks.
as_edgelist(net, names=T)
##       [,1]  [,2] 
##  [1,] "s01" "s02"
##  [2,] "s01" "s03"
##  [3,] "s01" "s04"
##  [4,] "s01" "s15"
##  [5,] "s02" "s01"
##  [6,] "s02" "s03"
##  [7,] "s02" "s09"
##  [8,] "s02" "s10"
##  [9,] "s03" "s01"
## [10,] "s03" "s04"
## [11,] "s03" "s05"
## [12,] "s03" "s08"
## [13,] "s03" "s10"
## [14,] "s03" "s11"
## [15,] "s03" "s12"
## [16,] "s04" "s03"
## [17,] "s04" "s06"
## [18,] "s04" "s11"
## [19,] "s04" "s12"
## [20,] "s04" "s17"
## [21,] "s05" "s01"
## [22,] "s05" "s02"
## [23,] "s05" "s09"
## [24,] "s05" "s15"
## [25,] "s06" "s16"
## [26,] "s06" "s17"
## [27,] "s07" "s03"
## [28,] "s07" "s08"
## [29,] "s07" "s10"
## [30,] "s07" "s14"
## [31,] "s08" "s03"
## [32,] "s08" "s07"
## [33,] "s08" "s09"
## [34,] "s09" "s10"
## [35,] "s10" "s03"
## [36,] "s12" "s06"
## [37,] "s12" "s13"
## [38,] "s12" "s14"
## [39,] "s13" "s12"
## [40,] "s13" "s17"
## [41,] "s14" "s11"
## [42,] "s14" "s13"
## [43,] "s15" "s01"
## [44,] "s15" "s04"
## [45,] "s15" "s06"
## [46,] "s16" "s06"
## [47,] "s16" "s17"
## [48,] "s17" "s04"
as_adjacency_matrix(net, attr="weight")

## 17 x 17 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 17 column names 's01', 's02', 's03' ... ]]
##                                                      
## s01  . 22 22 21 .  .  .  .  .  .  .  .  .  . 20  .  .
## s02 23  . 21  . .  .  .  .  1  5  .  .  .  .  .  .  .
## s03 21  .  . 22 1  .  .  4  .  2  1  1  .  .  .  .  .
## s04  .  . 23  . .  1  .  .  .  . 22  3  .  .  .  .  2
## s05  1 21  .  . .  .  .  .  2  .  .  .  .  . 21  .  .
## s06  .  .  .  . .  .  .  .  .  .  .  .  .  .  . 21 21
## s07  .  .  1  . .  .  . 22  . 21  .  .  .  4  .  .  .
## s08  .  .  2  . .  . 21  . 23  .  .  .  .  .  .  .  .
## s09  .  .  .  . .  .  .  .  . 21  .  .  .  .  .  .  .
## s10  .  .  2  . .  .  .  .  .  .  .  .  .  .  .  .  .
## s11  .  .  .  . .  .  .  .  .  .  .  .  .  .  .  .  .
## s12  .  .  .  . .  2  .  .  .  .  .  . 22 22  .  .  .
## s13  .  .  .  . .  .  .  .  .  .  . 21  .  .  .  .  1
## s14  .  .  .  . .  .  .  .  .  .  1  . 21  .  .  .  .
## s15 22  .  .  1 .  4  .  .  .  .  .  .  .  .  .  .  .
## s16  .  .  .  . . 23  .  .  .  .  .  .  .  .  .  . 21
## s17  .  .  .  4 .  .  .  .  .  .  .  .  .  .  .  .  .
# data frames describing nodes and edges:
as_data_frame(net, what="edges")
##    from  to      type weight
## 1   s01 s02 hyperlink     22
## 2   s01 s03 hyperlink     22
## 3   s01 s04 hyperlink     21
## 4   s01 s15   mention     20
## 5   s02 s01 hyperlink     23
## 6   s02 s03 hyperlink     21
## 7   s02 s09 hyperlink      1
## 8   s02 s10 hyperlink      5
## 9   s03 s01 hyperlink     21
## 10  s03 s04 hyperlink     22
## 11  s03 s05 hyperlink      1
## 12  s03 s08 hyperlink      4
## 13  s03 s10   mention      2
## 14  s03 s11 hyperlink      1
## 15  s03 s12 hyperlink      1
## 16  s04 s03 hyperlink     23
## 17  s04 s06   mention      1
## 18  s04 s11   mention     22
## 19  s04 s12 hyperlink      3
## 20  s04 s17   mention      2
## 21  s05 s01   mention      1
## 22  s05 s02 hyperlink     21
## 23  s05 s09 hyperlink      2
## 24  s05 s15   mention     21
## 25  s06 s16 hyperlink     21
## 26  s06 s17   mention     21
## 27  s07 s03   mention      1
## 28  s07 s08   mention     22
## 29  s07 s10 hyperlink     21
## 30  s07 s14   mention      4
## 31  s08 s03 hyperlink      2
## 32  s08 s07   mention     21
## 33  s08 s09   mention     23
## 34  s09 s10   mention     21
## 35  s10 s03 hyperlink      2
## 36  s12 s06   mention      2
## 37  s12 s13 hyperlink     22
## 38  s12 s14   mention     22
## 39  s13 s12 hyperlink     21
## 40  s13 s17   mention      1
## 41  s14 s11   mention      1
## 42  s14 s13   mention     21
## 43  s15 s01 hyperlink     22
## 44  s15 s04 hyperlink      1
## 45  s15 s06 hyperlink      4
## 46  s16 s06 hyperlink     23
## 47  s16 s17   mention     21
## 48  s17 s04 hyperlink      4
as_data_frame(net, what="vertices")
##     name               media media.type type.label audience.size
## s01  s01            NY Times          1  Newspaper            20
## s02  s02     Washington Post          1  Newspaper            25
## s03  s03 Wall Street Journal          1  Newspaper            30
## s04  s04           USA Today          1  Newspaper            32
## s05  s05            LA Times          1  Newspaper            20
## s06  s06       New York Post          1  Newspaper            50
## s07  s07                 CNN          2         TV            56
## s08  s08               MSNBC          2         TV            34
## s09  s09            FOX News          2         TV            60
## s10  s10                 ABC          2         TV            23
## s11  s11                 BBC          2         TV            34
## s12  s12          Yahoo News          3     Online            33
## s13  s13         Google News          3     Online            23
## s14  s14         Reuters.com          3     Online            12
## s15  s15         NYTimes.com          3     Online            24
## s16  s16  WashingtonPost.com          3     Online            28
## s17  s17             AOL.com          3     Online            33

Check out ?igraph.plotting for more information on igraph plotting parameters


Improving network plots

Network and node descriptives

Density = the proportion of present edges from all possible edges in the network


# Density

edge_density(net, loops=F)
## [1] 0.1764706
ecount(net)/(vcount(net)*(vcount(net)-1)) #for a directed network
## [1] 0.1764706

Reciprocity = the proportion of reciprocated ties (for a directed network)


# Reciprocity

reciprocity(net)
## [1] 0.4166667
dyad_census(net) # Mutual, asymmetric, and null node pairs
## $mut
## [1] 10
## 
## $asym
## [1] 28
## 
## $null
## [1] 98
2*dyad_census(net)$mut/ecount(net) # Calculating reciprocity
## [1] 0.4166667

Transitivity = - global - ratio of triangles (direction disregarded) to connected triples - local - ratio of triangles to connected triples each vertex is part of


# Transitivity

transitivity(net, type="global")  # net is treated as an undirected network
## [1] 0.372549
transitivity(as.undirected(net, mode="collapse")) # same as above
## [1] 0.372549
transitivity(net, type="local")
##  [1] 0.2142857 0.4000000 0.1153846 0.1944444 0.5000000 0.2666667 0.2000000
##  [8] 0.1000000 0.3333333 0.3000000 0.3333333 0.2000000 0.1666667 0.1666667
## [15] 0.3000000 0.3333333 0.2000000
triad_census(net) # for directed networks
##  [1] 244 241  80  13  11  27  15  22   4   1   8   4   4   3   3   0
# Triad types (per Davis & Leinhardt):
# 
# 003  A, B, C, empty triad.
# 012  A->B, C 
# 102  A<->B, C  
# 021D A<-B->C 
# 021U A->B<-C 
# 021C A->B->C
# 111D A<->B<-C
# 111U A<->B->C
# 030T A->B<-C, A->C
# 030C A<-B<-C, A->C.
# 201  A<->B<->C.
# 120D A<-B->C, A<->C.
# 120U A->B<-C, A<->C.
# 120C A->B->C, A<->C.
# 210  A->B<->C, A<->C.
# 300  A<->B<->C, A<->C, completely connected.

Diameter = A network diameter is the longest geodesic distance (length of the shortest path between two nodes) in the network. In igraph, diameter() returns the distance, while get_diameter() returns the nodes along the first found path of that distance.


# Diameter (longest geodesic distance)
# Note that edge weights are used by default, unless set to NA.
diameter(net, directed=F, weights=NA)
## [1] 4
diameter(net, directed=F)
## [1] 28
diam <- get_diameter(net, directed=T)
diam
## + 7/17 vertices, named, from 961df52:
## [1] s12 s06 s17 s04 s03 s08 s07
# Note: vertex sequences asked to behave as a vector produce numeric index of nodes
class(diam)
## [1] "igraph.vs"
as.vector(diam)
## [1] 12  6 17  4  3  8  7
# Color nodes along the diameter:
vcol <- rep("gray40", vcount(net))
vcol[diam] <- "gold"
ecol <- rep("gray80", ecount(net))
ecol[E(net, path=diam)] <- "orange" 
# E(net, path=diam) finds edges along a path, here 'diam'
plot(net, vertex.color=vcol, edge.color=ecol, edge.arrow.mode=0)


Node degrees = The function degree() has a mode of in for in-degree, out for out-degree, and all or total for total degree.


# Node degrees
# 'degree' has a mode of 'in' for in-degree, 'out' for out-degree,
# and 'all' or 'total' for total degree. 
deg <- degree(net, mode="all")
plot(net, vertex.size=deg*3)

hist(deg, breaks=1:vcount(net)-1, main="Histogram of node degree")


# Degree distribution
deg.dist <- degree_distribution(net, cumulative=T, mode="all")
plot( x=0:max(deg), y=1-deg.dist, pch=19, cex=1.2, col="orange", 
      xlab="Degree", ylab="Cumulative Frequency")


Centrality functions (vertex level) and centralization functions (graph level). The centralization functions return res - vertex centrality, centralization, and theoretical_max - maximum centralization score for a graph of that size. The centrality function can run on a subset of nodes (set with the vids parameter). This is helpful for large graphs where calculating all centralities may be a resource-intensive and time-consuming task.


# Centrality & centralization

# Centrality functions (vertex level) and centralization functions (graph level).
# The centralization functions return "res" - vertex centrality, "centralization", 
# and "theoretical_max" - maximum centralization score for a graph of that size.
# The centrality functions can run on a subset of nodes (set with the "vids" parameter)

# Degree (number of ties)
degree(net, mode="in")
## s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 
##   4   2   6   4   1   4   1   2   3   4   3   3   2   2   2   1   4
centr_degree(net, mode="in", normalized=T)
## $res
##  [1] 4 2 6 4 1 4 1 2 3 4 3 3 2 2 2 1 4
## 
## $centralization
## [1] 0.1985294
## 
## $theoretical_max
## [1] 272
# Closeness (centrality based on distance to others in the graph)
# Inverse of the node's average geodesic distance to others in the network
closeness(net, mode="all", weights=NA) 
##        s01        s02        s03        s04        s05        s06 
## 0.03333333 0.03030303 0.04166667 0.03846154 0.03225806 0.03125000 
##        s07        s08        s09        s10        s11        s12 
## 0.03030303 0.02857143 0.02564103 0.02941176 0.03225806 0.03571429 
##        s13        s14        s15        s16        s17 
## 0.02702703 0.02941176 0.03030303 0.02222222 0.02857143
centr_clo(net, mode="all", normalized=T) 
## $res
##  [1] 0.5333333 0.4848485 0.6666667 0.6153846 0.5161290 0.5000000 0.4848485
##  [8] 0.4571429 0.4102564 0.4705882 0.5161290 0.5714286 0.4324324 0.4705882
## [15] 0.4848485 0.3555556 0.4571429
## 
## $centralization
## [1] 0.3753596
## 
## $theoretical_max
## [1] 7.741935
# Eigenvector (centrality proportional to the sum of connection centralities)
# Values of the first eigenvector of the graph adjacency matrix
eigen_centrality(net, directed=T, weights=NA)
## $vector
##       s01       s02       s03       s04       s05       s06       s07 
## 0.6638179 0.3314674 1.0000000 0.9133129 0.3326443 0.7468249 0.1244195 
##       s08       s09       s10       s11       s12       s13       s14 
## 0.3740317 0.3453324 0.5991652 0.7334202 0.7519086 0.3470857 0.2915055 
##       s15       s16       s17 
## 0.3314674 0.2484270 0.7503292 
## 
## $value
## [1] 3.006215
## 
## $options
## $options$bmat
## [1] "I"
## 
## $options$n
## [1] 17
## 
## $options$which
## [1] "LR"
## 
## $options$nev
## [1] 1
## 
## $options$tol
## [1] 0
## 
## $options$ncv
## [1] 0
## 
## $options$ldv
## [1] 0
## 
## $options$ishift
## [1] 1
## 
## $options$maxiter
## [1] 1000
## 
## $options$nb
## [1] 1
## 
## $options$mode
## [1] 1
## 
## $options$start
## [1] 1
## 
## $options$sigma
## [1] 0
## 
## $options$sigmai
## [1] 0
## 
## $options$info
## [1] 0
## 
## $options$iter
## [1] 7
## 
## $options$nconv
## [1] 1
## 
## $options$numop
## [1] 31
## 
## $options$numopb
## [1] 0
## 
## $options$numreo
## [1] 18
centr_eigen(net, directed=T, normalized=T) 
## $vector
##  [1] 0.6638179 0.3314674 1.0000000 0.9133129 0.3326443 0.7468249 0.1244195
##  [8] 0.3740317 0.3453324 0.5991652 0.7334202 0.7519086 0.3470857 0.2915055
## [15] 0.3314674 0.2484270 0.7503292
## 
## $value
## [1] 3.006215
## 
## $options
## $options$bmat
## [1] "I"
## 
## $options$n
## [1] 17
## 
## $options$which
## [1] "LR"
## 
## $options$nev
## [1] 1
## 
## $options$tol
## [1] 0
## 
## $options$ncv
## [1] 0
## 
## $options$ldv
## [1] 0
## 
## $options$ishift
## [1] 1
## 
## $options$maxiter
## [1] 1000
## 
## $options$nb
## [1] 1
## 
## $options$mode
## [1] 1
## 
## $options$start
## [1] 1
## 
## $options$sigma
## [1] 0
## 
## $options$sigmai
## [1] 0
## 
## $options$info
## [1] 0
## 
## $options$iter
## [1] 7
## 
## $options$nconv
## [1] 1
## 
## $options$numop
## [1] 31
## 
## $options$numopb
## [1] 0
## 
## $options$numreo
## [1] 18
## 
## 
## $centralization
## [1] 0.5071775
## 
## $theoretical_max
## [1] 16
# Betweenness (centrality based on a broker position connecting others)
# (Number of geodesics that pass through the node or the edge)
betweenness(net, directed=T, weights=NA)
##         s01         s02         s03         s04         s05         s06 
##  24.0000000   5.8333333 127.0000000  93.5000000  16.5000000  20.3333333 
##         s07         s08         s09         s10         s11         s12 
##   1.8333333  19.5000000   0.8333333  15.0000000   0.0000000  33.5000000 
##         s13         s14         s15         s16         s17 
##  20.0000000   4.0000000   5.6666667   0.0000000  58.5000000
edge_betweenness(net, directed=T, weights=NA)
##  [1] 10.833333 11.333333  8.333333  9.500000  4.000000 12.500000  3.000000
##  [8]  2.333333 24.000000 16.000000 31.500000 32.500000  9.500000  6.500000
## [15] 23.000000 65.333333 11.000000  6.500000 18.000000  8.666667  5.333333
## [22] 10.000000  6.000000 11.166667 15.000000 21.333333 10.000000  2.000000
## [29]  1.333333  4.500000 11.833333 16.833333  6.833333 16.833333 31.000000
## [36] 17.000000 18.000000 14.500000  7.500000 28.500000  3.000000 17.000000
## [43]  5.666667  9.666667  6.333333  1.000000 15.000000 74.500000
centr_betw(net, directed=T, normalized=T)
## $res
##  [1]  24.0000000   5.8333333 127.0000000  93.5000000  16.5000000
##  [6]  20.3333333   1.8333333  19.5000000   0.8333333  15.0000000
## [11]   0.0000000  33.5000000  20.0000000   4.0000000   5.6666667
## [16]   0.0000000  58.5000000
## 
## $centralization
## [1] 0.4460938
## 
## $theoretical_max
## [1] 3840

Graph theory in R continued


First load/install the required packages if neccessary:

#install.packages c ("igraph", "reshape2", "ggplot2", "raster")

Example - tortoise social relationships

Load the required packages for this example:

Now we’ll look at an example where we use movement data to create a network of social relationships. In this example, we’ll be working through the process of manipulating and processing data in order to put it in a format amenable to graphs.

The data we’re using comes from GPS tracked tortoises. We’ll be looking at the data for tortoises at one site (“SouthPah”) for 2018. The GPS trackers record a tortoise’s location once every hour. What we want to do in this example is make a network based on social interactions - we want a graph where the nodes are tortoises and the edges indicate that two tortoises had at least one interaction in 2018 (4/25 - 10/11). First, let’s load the data and take a look at it.

#load the data - this loads variables called "torts" and "tort_info"
load("torts.RData")
load("tort_info.RData")
#'torts' contains the tortoise tracking data for the SouthPah site in 2018. There were 23 tortoises
#that were GPS-tracked at this site. A location was recorded once every hour.

#'tort_info' contains basic information about each tortoise

#check out the data (I'm only displaying the columns that are relevant for this example)
head(as.data.frame(torts)[,c("tort", "datetime", "x", "y")])
##        tort            datetime        x       y
## 78402 BS572 2018-04-25 07:25:41 644799.6 3930122
## 78403 BS572 2018-04-25 08:27:50 644761.4 3930093
## 78404 BS572 2018-04-25 09:32:13 644675.3 3930066
## 78405 BS572 2018-04-25 10:32:51 644626.2 3930102
## 78406 BS572 2018-04-25 11:33:32 644629.7 3930103
## 78408 BS572 2018-04-25 12:34:23 644629.8 3930100
head(tort_info)
##    Tortoise NumEntries TotalTimeDays  FirstDate   LastDate NumBurrows
## 12    BS572       3613      169.1793 2018-04-25 2018-10-11         12
## 13    BS573       3036      141.0630 2018-04-25 2018-09-13         12
## 14    BS582       3007      169.0548 2018-04-25 2018-10-11         10
## 15    BS586       2901      141.0705 2018-05-23 2018-10-11         15
## 16    BS590       3066      140.9097 2018-05-23 2018-10-11         12
## 17    BS602       2527      113.9924 2018-06-19 2018-10-11          9
##        Site LikelySEX LastMCL Site_tort_info
## 12 SouthPah         F     234       Southpah
## 13 SouthPah         M     254       Southpah
## 14 SouthPah         F     229       Southpah
## 15 SouthPah         M     280       Southpah
## 16 SouthPah         M     280       Southpah
## 17 SouthPah         M     218       Southpah
rownames(tort_info) = tort_info$Tortoise

#plot the data, making each tortoise a different color
unique_torts = unique(torts$tort) #get the unique tortoise names
tort_cols = rainbow(length(unique_torts)) #get a different color for each tortoise
names(tort_cols) = unique_torts #assign the names of the tortoises to our vector of colors
plot(torts, col=tort_cols[torts$tort], cex = .5, pch=16) #plot the data

Step 1: Standardize the times

To make a graph of the social interactions, we need to be able to detect when two tortoises are close to each other, so we’ll calculate the distances between tortoises at all time steps. However, there’s a problem - the times don’t line up. For example, one tortoise may have a fix at 10:17 and another may have a fix at 10:43. To avoid this problem, we’ll standardize the times - in this case, both times will be set to 10:30. This will introduce some error, but we should still be able to use this to make inferences about the tortoises’ social interactions.

This makes use of some functions that I’ve already written - we won’t go over these, as they’re not relevant to this example.

#get a vector of times that represents the times we want to standardize our data to - 
#use the 'adj_path_get_datetimes()' function that I wrote previously
new_times = adj_path_get_datetimes(torts$datetime) #this function creates a vector times - these are the times we'll standardize our data to

#now we'll use the 'adj_path_times()' function to standardize the times for each tortoise
tort_adj_list = lapply(1:length(unique_torts), function(i){ #loop over the tortoise names, and standardize the times of the fixes
   this_tort = torts[torts$tort == unique_torts[i],] #get the data for only this tortoise
   tort_adj = adj_path_times(pts = this_tort@coords, #standardize the times using 'new_times' as the reference set of times
                             pts_times = this_tort$datetime, 
                             new_times = new_times, 
                             time_gap_limit = 72)
   return(tort_adj)
})
#tort_adj_list is a list of data frames. Give each data frame the name of the corresponding tortoise
names(tort_adj_list) = as.character(unique_torts)
head(tort_adj_list[[1]])
##   actual_id id            datetime type  x  y
## 2        -1  1 2018-04-25 00:30:00    0 NA NA
## 3        -1  2 2018-04-25 01:30:00    0 NA NA
## 4        -1  3 2018-04-25 02:30:00    0 NA NA
## 5        -1  4 2018-04-25 03:30:00    0 NA NA
## 6        -1  5 2018-04-25 04:30:00    0 NA NA
## 7        -1  6 2018-04-25 05:30:00    0 NA NA

Step 2: Calculate the distances at each time step.

Now that we have the times standardized, we can calculate the distance between tortoises at each time step. We’ll do this by calculating a distance matrix at each time step - this will tell us the distances between all pairs of tortoises

#get the distance matrix for all tortoises at each time
all_dists = lapply(1:length(new_times), function(i){ #loop over the vector of reference times (the times we standardized our data to)
   coords_list = lapply(1:length(tort_adj_list), function(j){
      return(tort_adj_list[[j]][i,c("x", "y")]) #for each tortoise, grab the 'i'th location
   })
   coords = do.call(rbind, coords_list) #rbind the locations together to get a data frame of the locations of each tortoise
   rownames(coords) = names(tort_adj_list) #assign the tortoise names as row names
   dist_mat = as.matrix(dist(coords)) #get the distance matrix between all tortoises at this moment in time
   diag(dist_mat) = NA #get rid of the diagonal (this will always be 0 - the diagonal gives the distance between a tortoise and itself.)
   dist_mat[upper.tri(dist_mat)] = NA #get rid of the upper triangle (this is the same as the lower triangle)
   return(dist_mat)
})
all_dists[[1000]][14:17,1:5]
##           BS572     BS573     BS582     BS586     BS590
## CN212  452.2037  360.0633 1199.6542 1238.5734 1459.1895
## CN214 1187.6784 1081.7976 1413.9938  687.1740  164.2659
## CN215  740.6569  585.9494 1290.5514  887.2567  804.4556
## CN216 1034.2068 1024.5905  869.9438  130.0217  759.4366

Step 3: Determine which pairs of tortoises interacted with each other

We now have the distance matrices for each time step. Now we’ll use this to find pairs of tortoises that were within 15m at some point in time. To do this, we’ll look at the distances between each pair of tortoises and keep track of the number of timestamps where they were within 15m - if this value is 1 or more, we’ll say that they interacted at some point. And the larger the number, the longer the tortoises spent together, so this gives us an idea of the strength of the relationship.

The type of calculation we want to do can be easily performed on rasters using the ‘raster’ package, so we’ll convert the list of distance matrices into a RasterBrick (essentially a bunch of rasters “stacked on top of each other). Then we can easily”drill down" through all of the distance matrices, looking at the distances between a given pair of tortoises, and counting how many times there’s a value less than 15.

rast_list = lapply(all_dists, raster) #convert each distance matrix into a raster
dists_rast = raster::brick(rast_list) #combine them to form a RasterBrick

#now for each cell we'll count the number of times the value is 15 or less.
n_close_rast = calc(dists_rast, function(vals_i){
   if(all(is.na(vals_i))){
      return(NA)
   } else { 
      return(sum(vals_i < 15, na.rm=TRUE)) #return the number of TRUEs that result from 'vals_i < 15'
   }
})
n_close = t(as.matrix(n_close_rast)) #convert the raster back to a matrix

rownames(n_close) = rownames(all_dists[[1]]) #the row and column names don't transfer to the raster, so reassign them
colnames(n_close) = colnames(all_dists[[1]])
n_close[5:9,9:13]
##       BS625 BS628 CN202 CN203 CN205
## BS590     0     0    22     0     0
## BS602     1     0     0     0     0
## BS604    92     0     0     0     0
## BS609    57     0     0     0     0
## BS625    NA     0     0     0     0

Step 4: Use what we’ve done so far to create a graph

Now we know which tortoises have ‘interacted’ (at least by our definition), so we’ll use that to make a graph - we’ll make the tortoises the vertices and draw edges between tortoises that interacted with each other.

#use the 'melt' function to reformat our matrix to a data frame
n_close_df = reshape2::melt(n_close)

#We only want to make a graph with the pairs that were close to each other at some point in time (1 or more moments in time where the pair was less than 15m apart)
n_close_df_nz = n_close_df[n_close_df$value > 0 & !is.na(n_close_df$value),]
head(n_close_df_nz)
##      Var1  Var2 value
## 24  BS572 BS573    95
## 144 BS602 BS604     1
## 165 BS586 BS609    85
## 190 BS602 BS625     1
## 191 BS604 BS625    92
## 192 BS609 BS625    57
#let's visualize the adjacency matrix
ggplot(n_close_df, aes(x=Var1, y=Var2, fill = as.integer(value > 0))) + 
   geom_tile(color="black")

#make a graph based on the "edges" we've identified
n_close_g1 = igraph::graph_from_data_frame(n_close_df_nz, directed = FALSE)
plot(n_close_g1, vertex.size = 11, vertex.label.cex = .65)

This tells us which tortoises interacted, but if we were to add edge and node attributes we could make a more informative graph

We’ll make another graph - for this one we’ll pass in a data frame that gives attributes for the vertices. I want to include the centroids of each tortoise’s points - this will let us see how the relationships are arranged in space.

#find the centroids and then merge them with the tortoise info that we already have
tort_centroids = aggregate(cbind(x,y) ~ tort, torts, mean)
tort_info_xy = merge(tort_info, tort_centroids, by.x = "Tortoise", by.y = "tort")

#create another graph, this time providing attribute data for the vertices
n_close_g2 = igraph::graph_from_data_frame(n_close_df_nz, directed = FALSE, vertices = tort_info_xy)

#one way to adjust plot parameters is to assign attributes with certain names to the edges or vertices - these will be used when plotting
#we'll make the edge width vary based on how many contacts the tortoises had (thicker means more)
E(n_close_g2)$width = adj_vals(log(E(n_close_g2)$value), new_min = 1, new_max = 15) #take the log of the values because the data is skewed to the right
#we'll color the vertices according to sex
tort_cols = c("M" = "cornflowerblue", "F" = "lightcoral")
V(n_close_g2)$color = tort_cols[V(n_close_g2)$LikelySEX]

#because we have vertex attributes of x and y, it will automatically put the vertices at those coordinates
plot(n_close_g2, 
     vertex.size = 11, 
     vertex.label.cex = .65)

If we don’t want it to plot it spatially, we can set the layout argument to be the function layout_with_fr(graph_name).

plot(n_close_g2, 
     vertex.size = 11, 
     vertex.label.cex = .65,
     layout = layout_with_fr(n_close_g2))