Underwater Currency Counter: Kick-Ass AI to Count Sand Dollars in Benthic Habitat Imagery

Starting the Fall of 2019 I was invited to participate in OceanHackWeek @ Seattle, Washington. HackWeeks are similar to Hack-a-athons but last an entire week (surprise!) and include tutorials each morning to provide new tools that may be useful to researchers for their own projects. This HackWeek's central theme was Oceanography and any of the sub-fields it encompasses!

For my group's project we elected to tackle an object recognition challenge that utilized images of benthic habitats obtained using a benthic sled. Our goal was to implement a workflow that used deep object recognition algorithms to be able to predict the number of sand dollars in each image. To make the objective achievable, we used a subset of the images (n=300), and only looked at a single class that appeared the most (e.g. sand dollars). We were very fortunate to have had data that was already annotated in PASCAL VOC format allowing use to quickly jump into creating a workflow. At first we considered Tensorflow as our framework but quickly realized that with the time we had, we needed to work quick and dirty. We found a few tutorials online and ultimately decided to use Matterport's Mask RCNN implementation using Keras and started from there.

The short analysis we did looked at how the size of the images, the number of images and the number of epochs affected the accuracy of the model. Overall our group was extremely satisfied with our results and what we had gained from working together during the week. I would highly encourage anyone who already has some proficiency in a coding language to participate in an hack-a-thon or a hackweek.

Here is the github repo we created that stores the iPython notebook instructions to replicate our results (remember to use a GPU 😅) and below is the final .ppt with our results and also the actual video presentation (always fun hearing your recorded voice).

Update - Spring 2020

After working on detecting anomalies in back-scatter mosaics and seeing the impressive results from a U-Net architecture, I thought I would come back to this project and add a little bit to it.

At OceanHackWeek my group had very little computational resources and had to rely on a single CPU to train our deep learning model 😅. Because of that, we could only use a small sub-set of what was available. Now that I have more resources and time (thanks Corona), I wanted to see how detecting sand dollars from the HabCam would do if approached as a semantic segmentation problem.

With access to over ten-thousand images and their corresponding PASCAL-VOC annotation files, I went through each one, found annotations that belonged to the sand dollar class and used their bounding box coordinates to fill in a mask for the image. So, anywhere there was a bounding box in the image, now has an area in the corresponding mask, but, the rest of the mask is left blank denoting a sort of catch all 'background' class.

Using the U-Net implementation, I trained a model with an EfficientNet-b0 as the encoder, with frozen, pre-trained ImageNet weights. Below are some figures showing the images and their bounding boxes on the left, and right we have the predicted mask where white denotes where a sand dollar is thought to be; I added the bounding boxes on the mask to help show whether it was right or not:

As you can see, the model does a pretty good job of one, identifying where there are sand dollars and two, actually creating masks for them despite the fact that it was trained on bounding boxes. Though it's not perfect of course; below are some graphs displaying the loss and some metrics throughout the training process:

Sample 1

Sample 2

Categorical Cross-Entropy loss

Precision

*Long and emphasized whistle*

Look at that over-fitting! Across 15 epochs the model only improved up until the third, which is denoted by the blue 'x'. The Recall and Precision metrics show something similar, but despite the obvious over-fitting, note where the values are. Overall, not bad. I think with a little bit more time tinkering with different encoders and some other hyper-parameters we could reduce that gap.

Until next time!