Introduction
Before we get started, if you’d like to follow along using our RScript you’ll need to install a few packages listed at the top. They may take some time, so the earlier the better!
Download BayesianNetworkScript.RWhy use Bayesian Networks?
These networks can show cause and effect as well as directional relationships by linking dependent covariates and separating independent covariates. The networks aim to be descriptive by showing large patterns and multivariate connections. They are also diagnostic, using reason, and predictive, describing latent variables and time series. Lastly, the networks are prescriptive via decision making, particularly under uncertainty.
How does this relate to Bayes theorem?
Recall that Bayes Theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event, or \(P\)(\(event\) | \(prior\) \(knowledge\) ).
In other words, Bayes Theory describes the probability of some assignment of a set of variables given the assignment of other evidence. A classic example is \(P\)(\(Sprinkler\), \(Wet Grass\) | \(Cloudy\)) - What is the probability that a sprinkler went off and the grass is wet, given that it is cloudy?
Directed acyclic graph (DAG)
These figures help to emphasize the conditioning inherent to Bayesian models and are especially helpful for graphically displaying models that contain multiple data sources.
- Some key terms:
- Nodes are unique random variables.
- Note: If a certain node is true (happens), the parent nodes are then dependent.
- Arcs, otherwise known as edges, show the conditional probability link between nodes.
- Factors are joint probability distributions (i.e. \(P\) (\(B\) | \(A\)) ), represented by edges.
- Conditional probability tables describe the probabilty of an event occuring given other events.
- Nodes are unique random variables.
Lizard Perching Example
Suppose you are a researcher interested in how perch size varies between two closely-related lizard species (Anolis sangrei and Anolis distichus). An important component of that question is whether or not perch size is conditionally dependent upon species.
Start by looking at your data:
## Species Diameter Height
## Sagrei :164 narrow:252 high:264
## Distichus:245 wide :157 low :145
Here’s the expected network structure:
Now we can fit the network to find the best parameter values using two different methods:
Maximum likelihood estimate
##
## Bayesian network parameters
##
## Parameters of node Diameter (multinomial distribution)
##
## Conditional probability table:
##
## Species
## Diameter Sagrei Distichus
## narrow 0.7195122 0.5469388
## wide 0.2804878 0.4530612
##
## Parameters of node Height (multinomial distribution)
##
## Conditional probability table:
##
## Species
## Height Sagrei Distichus
## high 0.7378049 0.5836735
## low 0.2621951 0.4163265
##
## Parameters of node Species (multinomial distribution)
##
## Conditional probability table:
## Sagrei Distichus
## 0.400978 0.599022
Bayesian parameter estimation
##
## Bayesian network parameters
##
## Parameters of node Diameter (multinomial distribution)
##
## Conditional probability table:
##
## Species
## Diameter Sagrei Distichus
## narrow 0.7188450 0.5468432
## wide 0.2811550 0.4531568
##
## Parameters of node Height (multinomial distribution)
##
## Conditional probability table:
##
## Species
## Height Sagrei Distichus
## high 0.7370821 0.5835031
## low 0.2629179 0.4164969
##
## Parameters of node Species (multinomial distribution)
##
## Conditional probability table:
## Sagrei Distichus
## 0.4012195 0.5987805
You should notice that these results are a little different. Why? Because they are two different ways for estimation!
Bayesian Networks must satisfy the local Markov property
Nodes are conditionally independent of its non-descendents given its parents. For example, the probability of sprinklers going off given it is cloudy and raining, or \(P\)(\(Sprinkler\) | \(Cloudy\), \(Rain\)), is equal to the probability of sprinklers going off given it is cloudy, or \(P\)( \(Sprinkler\) | \(Cloudy\)). From this, we can conclude that rain and sprinkler are not related. This simplifies the joint distributions of the network. It also reduces computation time.
If two variables aren’t conditionally independent, it can’t become a Bayesian Network. There are many different ways to test for this. Today we’ll look at mutual information and mention d-separation.
Let’s test our lizard dataset for mutual information, which quantifies how much information you can get from one variable from observation of another variable. We can use the ci.test() function to test whether the variables height, diameter, and species are conditionally independent. In other words, given the species, are height and diameter then independent?
##
## Mutual Information (disc.)
##
## data: Height ~ Diameter | Species
## mi = 2.0256, df = 2, p-value = 0.3632
## alternative hypothesis: true value is greater than 0
This test indicates that height, diameter, and species are not conditionally independent. However, we can test for d-separation.
## [1] TRUE
We get a different result using the de-separation method. However, it’s very likely that diameter and height aren’t independent variables. It’s important to know your dataset before creating a Bayesian network. The more information you have, the better equiped you are to accurately model your data.
Model learning
Two common types of model learning are constraint-based and score-based. Constraint based, or inductive causation, is used to find conditionally dependent nodes and link them. It is more accurate than score-based for small sample sizes. Score-based is used to assign a score with some kind of heuristic and optimize the network. It is often faster than constraint based.
Here’s an example using a damselfish dataset:
Suppose you are a researcher interested in three-spot damselfish (Dascyllus trimcaculatus) settlement densities as they relate to available habitat (i.e. empty anemones) as well as survival within different densities. There are many parameters involved (site, pulse (survival), observation, and density), and you may want to now how they depend on each other.
## site pulse obs density
## Cdina : 60 Min. :1.0 Min. : 1.0 Min. : 0.00
## Hin : 60 1st Qu.:2.0 1st Qu.: 3.0 1st Qu.: 3.50
## Hout : 60 Median :3.5 Median : 5.5 Median : 10.70
## Lin : 60 Mean :3.5 Mean : 5.5 Mean : 24.19
## Maa : 60 3rd Qu.:5.0 3rd Qu.: 8.0 3rd Qu.: 25.30
## Maha : 60 Max. :6.0 Max. :10.0 Max. :463.40
## (Other):240
Now we need to make the data easier to work with! If you inspect your dataset, you’ll notice that some of the data are characters. All values must be numeric or integer for learning algorithms to learn.
We can find the structure using a score-based, or hill-climbing algorithm:
We can also find the structure using a constraint-based algorithm, or incremental association Markov blanket. Recall that these are better for smaller, discrete datasets.
## Warning in indep.test(nodes, x, sx = mb, test = test, data = data, B =
## B, : fixed correlation coefficient greater than 1, probably due to floating
## point errors.
## Warning in indep.test(nodes, x, sx = mb, test = test, data = data, B =
## B, : fixed correlation coefficient greater than 1, probably due to floating
## point errors.
## Warning in indep.test(nodes, x, sx = mb, test = test, data = data, B =
## B, : fixed correlation coefficient greater than 1, probably due to floating
## point errors.
## Warning in indep.test(nodes, x, sx = mb, test = test, data = data, B =
## B, : fixed correlation coefficient greater than 1, probably due to floating
## point errors.
(These warnings refer to rounding-type issues)
You get the same result! We can fit the Maximum Likelihood Estimate method to the Bayesian Network. Note: The Bayes method requires a discrete network (for now!)
##
## Bayesian network parameters
##
## Parameters of node site (multinomial distribution)
##
## Conditional probability table:
## Cdina Hin Hout Lin Maa Maha OctC Socc Temae Vai
## 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
##
## Parameters of node pulse (Gaussian distribution)
##
## Conditional density: pulse | density
## Coefficients:
## (Intercept) density
## 3.384717723 0.004765286
## Standard deviation of the residuals: 1.696244
##
## Parameters of node obs (Gaussian distribution)
##
## Conditional density: obs
## Coefficients:
## (Intercept)
## 5.5
## Standard deviation of the residuals: 2.874678
##
## Parameters of node density (conditional Gaussian distribution)
##
## Conditional density: density | site + obs
## Coefficients:
## 0 1 2 3 4
## (Intercept) -4.487778 -4.905556 -4.900000 -17.285556 -3.095556
## obs 2.297475 2.351313 3.555152 7.818889 2.689495
## 5 6 7 8 9
## (Intercept) 2.586667 3.906667 -10.051222 -13.429889 -1.262222
## obs 9.097576 4.869394 6.915980 12.988040 1.024949
## Standard deviation of the residuals:
## 0 1 2 3 4 5
## 8.344115 8.183630 14.721292 33.100873 16.009208 81.989866
## 6 7 8 9
## 32.061196 26.176080 72.886232 6.981829
## Discrete parents' configurations:
## site
## 0 Cdina
## 1 Hin
## 2 Hout
## 3 Lin
## 4 Maa
## 5 Maha
## 6 OctC
## 7 Socc
## 8 Temae
## 9 Vai
Let’s take a look at the myxomatosis dataset:
This dataset contains information about a viral titer in blood samples from European rabbits (Oryctolagus cuniculus). As a researcher, you may be interested in the conditional relationships between the variables grade, day, and titer.
## grade day titer
## Min. :1.000 Min. : 2.000 Min. :1.958
## 1st Qu.:3.000 1st Qu.: 4.000 1st Qu.:5.400
## Median :4.000 Median : 8.000 Median :6.612
## Mean :3.604 Mean : 9.564 Mean :6.331
## 3rd Qu.:5.000 3rd Qu.:13.000 3rd Qu.:7.489
## Max. :5.000 Max. :28.000 Max. :9.021
We can look at the structure using our two model learning types! First, let’s use the score-based method:
Does this make sense? What if we use the contraint-based method?
This seems to be a more reasonable representation of the relationships. Let’s visualize the differences:
Right now we are just using our own inference to decide on the best model, but we can use the function bnlearn::score() to score the two methods to try and determine the better model.
Score-based model:
## [1] -1028.866
Constraint-based model:
## [1] -1026.812
Since the constraint-based method has a better score, we can fit that relationship (that grade is a function of day and titer) to the Bayesian network.
##
## Bayesian network parameters
##
## Parameters of node grade (Gaussian distribution)
##
## Conditional density: grade | day + titer
## Coefficients:
## (Intercept) day titer
## 5.6132230 0.0516698 -0.3954108
## Standard deviation of the residuals: 1.226514
##
## Parameters of node day (Gaussian distribution)
##
## Conditional density: day
## Coefficients:
## (Intercept)
## 9.563758
## Standard deviation of the residuals: 6.661022
##
## Parameters of node titer (Gaussian distribution)
##
## Conditional density: titer
## Coefficients:
## (Intercept)
## 6.33102
## Standard deviation of the residuals: 1.516702
The Bayesian framework can now be used to predict future events. For example, what is we want to know the probabilty that the viral titer has a concentration of over 5, given it’s been less than three days of infection?
## [1] 0.8174204
This can be helpful for predicting events!
What about the other way? What’s the probability that the individual has been infected for less than two days, given the grade is more than 3?
## [1] 0.0987979
Wow! This children nodes can be used to predict the parent nodes!
Here’s one last example, using a seed dataset:
This dataset comes from a study looking at the disappearance of two seed species (Polyscias fulva and Pseudospondias microcarpa) in Kibale National Park in Uganda. 9 variables may not seem large, but it can be difficult to map all the conditional relationships between the variables on our own. Let’s use our two model learning algorithms instead!
Let’s start with the score-based method:
Wow, lot’s of edges! Do you notice any nodes that are mutually independent? Remember that if two variables aren’t conditionally independent, it can’t become a Bayesian Network.
Now let’s use the constraint-based method:
Wow, these are wildly different! By why? Because hill climbing is less sensitive to individual error! Let’s visualize what they did differently:
Which one do you think is better? Why? Here are the scores for the two model learning outputs:
Score-based model:
## [1] -113709.2
Constraint-based model:
## Error in bnlearn::score(prednet2, data = pred, type = "bic-cg"): the graph is only partially directed.
Whoops! The constraint-based method can’t be scored because there isn’t enough information about the relationships between the nodes. Partially directed networks can’t be scored.
Here is another way to evaluate the model learning techniques using K-fold cross-validation:
Score-based model:
##
## k-fold cross-validation for Bayesian networks
##
## target network structure:
## [station][dist][species|station][seeds|dist:species]
## [available|dist:seeds][taken|species:available]
## number of folds: 10
## loss function: Log-Likelihood Loss (cond. Gauss.)
## expected loss: 9.408804
Constraint-based model:
##
## k-fold cross-validation for Bayesian networks
##
## target network structure:
## [dist][species][seeds][taken][available][station|species]
## number of folds: 10
## loss function: Log-Likelihood Loss (cond. Gauss.)
## expected loss: 10.88629
This technique looks at the expected loss of each model, rather than score it using the Bayesian Information Criterion (BIC).
Now we can fit the Bayesian network to our score-based model.
What is the probability that more than one seed was taken given a certain tree species.
## [1] 0.07206208
What if we have prior info about a dataset we want to include?
Let’s say we know species influences the number of seeds present….
It updates the network!
Real World Example
This is an example based on a real U.S. Fish and Wildlife Service Bayesian network. It was used to estimate polar bear presence in relation to disease prevalence and habitat quality. This network was entirely based on expert opinion.
First, the researchers set probabilties for disease risk between two categories: low and high.
## LOW HIGH
## [1,] 0.4 0.6
Then, the researchers set probabilities for habitat quality beween two categories: good and bad.
## GOOD BAD
## [1,] 0.8 0.2
Using these probabilities, the researchers were able to create a table of conditional probabilities relating to presence or absence of polar bears (Ursus maritimus) based on disease risk and habitat quality. In other words, given the abitat quality and disease risk, what is the probability that an animal is present at that site?
## , , Habitat_Quality = GOOD
##
## Disease
## Presence LOW HIGH
## TRUE 0.5 0.4
## FALSE 0.5 0.6
##
## , , Habitat_Quality = BAD
##
## Disease
## Presence LOW HIGH
## TRUE 0.3 0.2
## FALSE 0.7 0.8
Using this table of conditional probabilities, we can fit this to a Bayesian network:
##
## Bayesian network parameters
##
## Parameters of node Disease (multinomial distribution)
##
## Conditional probability table:
##
## LOW HIGH
## 0.4 0.6
##
## Parameters of node Habitat_Quality (multinomial distribution)
##
## Conditional probability table:
##
## GOOD BAD
## 0.8 0.2
##
## Parameters of node Presence (multinomial distribution)
##
## Conditional probability table:
##
## , , Habitat_Quality = GOOD
##
## Disease
## Presence LOW HIGH
## TRUE 0.5 0.4
## FALSE 0.5 0.6
##
## , , Habitat_Quality = BAD
##
## Disease
## Presence LOW HIGH
## TRUE 0.3 0.2
## FALSE 0.7 0.8
And finally, we can visualize the Bayesian network used to describe the conditional probability between nodes.
Here is the table that was actually used to inform management decisions regarding the IUCN Red List status for polar bears.
Using the expert-driven information, we can now ask ourselves: how likely is it that polar bears are present, given it’s a poor quality habitat?
## [1] 0.2548828
In Conclusion
Bayesian networks are closely tied to Bayes Theorem of conditional probability. They describe the conditional probability, or edges, between variables, or nodes, which can be useful when diagnosing or predicting events.
It’s important to note that without enough information, Bayesian networks often rely on expert opinion and can be misguided. Recall the lizard example from earlier. We hypothesize that both height and diameter would be conditionally independent on species. With further data exploration, however, we found that they aren’t independent variables. Bayesian networks can be a great tool for visualizing and describing the relationships between variables, but it’s use must be cautioned when dealing with small or incomplete dataset.
Resources
Aguilera, P. A., et al. “Bayesian networks in environmental modelling.” Environmental Modelling & Software 26.12 (2011): 1376-1388.
Amstrup, Steven C., Bruce G. Marcot, and David C. Douglas. “A Bayesian network modeling approach to forecasting the 21st century worldwide status of polar bears.” Arctic sea ice decline: observations, projections, mechanisms, and implications 180 (2008): 213-68.
Charniak, Eugene. “Bayesian networks without tears.” AI magazine 12.4 (1991): 50-50.
McCann, Robert K., Bruce G. Marcot, and Rick Ellis. “Bayesian belief networks: applications in ecology and natural resource management.” Canadian Journal of Forest Research 36.12 (2006): 3053-3062.
Milns, Isobel, Colin M. Beale, and V. Anne Smith. “Revealing ecological networks using Bayesian network inference algorithms.” Ecology 91.7 (2010): 1892-1899.
https://towardsdatascience.com/introduction-to-bayesian-networks-81031eeed94e
Challenge Questions
See line 104 in the attached R script. What does the closeness of the expected loss imply?
For the damsel fish example, why are the networks created by the hill-climbing and incremental association Markov blanket algorithms the same? (hint: compare hc() and iamb() using the dsep() and ci.test() functions!)
Why is dsep(damfit, ‘site’, ‘obs’, ‘density’) false? (hint: check out ‘Bayesian Networks without Tears’!)
For the seed predation, we had to remove variables associated with date to create the network. What are some way you could still include date? (i.e. is there a way to show dates outside the ‘Date’ class in R?)