Predicting Habitability of New England for Invasive Japanese Barberry

This guy... is an invasive species! The Japanese Barberry (Berberis thunbergii) is a flowering plant native to Japan and parts of eastern Asia that is cultivated ornately. The plant has found its way into the New England region of the USA and poses a threat to the environment by out competing indigenous plants, altering the pH and nitrogen levels within the soil, and providing a habitat for parasitic ticks to thrive.

This mini-project was an assignment in a machine learning course I took at the University of New Hampshire taught by Dr. Marek Petrik. The problem--which is still ongoing research--explores methods of predictive modeling to anticipate the spread of the Japanese Barberry throughout the New England region. What makes the assignment more fun is the type of data we were provided: presence only. data This means that each point represents the location the plant has been confirmed to grow, but the absence of points does not necessarily mean there is no plant present. That is, the data that is provided is assumed to have a sampling bias; recordings are based on were people tend to hike.

Of course, Marek did not give us all of the real data set, instead we were given two different bootstrapped data sets that were generated using the a Poisson distribution representing uniform and weighted distributions according to population samplings. Also provided was a *very* small subset of the original data set containing the reported population in those locations, including "absences". This data was provided as a method of calibrating our modeling technique. Lastly is the landscape data set, which contains cells that make up all of the New England region, each with their recorded temperature and precipitation values which were used for modeling.

Here's an example of the calibration data set. Each cell is given a unique ID based on the latitude and longitude, and also contains the temperature, precipitation and the number of berries found in that area. I would like to point out the "absence" data in the form of zero population count; it would be very difficult to definitively say that there are no berries in one of these cells. This data set was provided to calibrate the modeling technique, but real world data would be unlikely to have have this.

First we can plot the New England region, colored by the normalized temperature and precipitation values and include the presence data we have available:

Here are two plots of the New England region displayed by their two ecological factors, precipitation and temperature (normalized). Overlaid are the two sets of presence data based on population (left) and uniform (right) sampling.

It can be seen that the data does have some patterns of aggregation, with the density of points in the state of Massachusetts, but some observations extending towards northern Maine. Below we look at a correlation table of the calibration data to see the relation between the features we have and the populations of the calibration data:

Here we see that the reported population is related to the temperature, precipitation (derp) and also, latitude. There is also a similar correlation value for the id feature, which is due to how each cell is provided its value (e.g. north to south).

So, now we can look at the presence data that we have. Below is a summary of the ecological features. The top table shows distribution of the `temp` and `precip` features from the presence data, and the bottom table shows the distribution of those same features of the landscape data. We can see that the range of values for the presence data is smaller than the landscape data. A better way of visualizing this would be to plot them all together on a scatter plot.

Below we have all of the `uniform` and `population` data in red and blue, respectively, along with the `landscape` data in black. The axes represent temperature and precipitation, with lines representing the range of values for presence data:

The plot shows that the presence data tends to aggregate to a certain area within the feature space which we can think of as the 'Goldilocks zone'. Much of the landscape data exists within this zone, but also extends outwards even past the range of values that presence data has be observed in. This does not imply that the species *cannot* grow in cells with these ecological conditions, rather, they have just has not been observed to which could be due to multiple reasons, most notably is sampling bias. We make the assumption that we have presence data for particular areas because that is where the sampling occurred!

With this in mind, we can look at creating a probability density function (PDF) using all of the presence data based on the two ecological features. This will allow us to determine the probability that any of landscape cells have ecological values that belong distribution created by the Goldilocks zone.

The first figure shows in 2-dimensions the PDF of presence data with the limits of the plot being the range of temperature and precipitation values of the `landscape` data. Notice of all of the available feature space, the presence data tends to aggregate in the Goldilocks zone. The second plot displays the PDF in 3-dimensions, where the z-axis represents the density of points within that area of feature space.

Now that we have our (awesome looking) PDF, we need to decide the threshold probability value to use to determine if we should label any given landscape cell with presence or absence. To do this, we can use the calibration data. By finding the highest classification accuracy for all of the threshold values (in increments of `.001`), we can determine the optimal one.

The plot shows that the highest classification accuracy results when the threshold value is set to about `.021` which is between the 1st quartile and the median of the probabilities values. With this, we have a training classification accuracy of `88%`.

With the calibrated PDF, we can make predictions for all of the landscape cells we have. For each landscape cell, we determine what the probability is that it belongs to our PDF (i.e. the Goldilocks zone), if the probability is higher than our optimum threshold value then we predict that that cell is likely to harbor the plant.

With those predictions, we can display the New England region (grey background), along with the population (blue), uniform (red), calibration (yellow) and the predictions of where we would expect to see the species of plant (black). Note that only the calibration data that has a population of greater than zero is shown.

Looking at the plot we can see there are some known points predicted by the model along the northern border of Maine, and overall it looks okay.

If I were to come back to this project and make corrections, the two things I would do is look at using cross-validation to find the threshold value that reduces the overall variance, because currently is it based on the training data and therefore made the model overfit. Also, I would look at the ecological factors where this plant grows natively in Japan to help generate a more robust PDF.

There was some talk of generating and inverse PDF, which would allow us to bootstrap absence data which is a pretty interesting idea. However, I have issue with this at it makes the rather large assumption that absence data exists within the range of ecological values of the landscape data. We cannot know that for certain because it would require knowledge of whether or not a cell contained no berries, which is just not practical or perhaps even possible.

Until next time!