Detecting "Anomalies" in Backscatter Mosaics

Preface

For a final project in a computer vision course I took during my last full semester at UNH, I looked at using different deep learning algorithms to try and detect what some would call "anomalies" in backscatter mosaics. Previously, I worked with a group at Seattle's OceanHackWeek and used object detection algorithms to detect sand dollars in benthic sled imagery, so I figured this couldn't be too difficult (and probably more useful in the real-world applications too). The project idea was initially suggested by a start-up called TerraDepth and very much encouraged by my advisors at the Center for Coastal and Ocean Mapping.

This project is just meant to demonstrate the usefulness of deep learning and how it could be incorporated into a data processing pipeline, or maybe even onboard a remote system to assist in making decisions in real time. It could easily be expanded on and scaled up to work with more data (in fact, it would definitely benefit from it), but in its current form, it serves as just a simple demonstration.

TL;DR

This project looks at using different computer vision algorithms to classify anomalies found in backscatter mosaic imagery. I attempt to solve this problem by looking at it as three different tasks: image classification, object detection and semantic segmentation. From the results, it appears the CNN trained as an image classifier does by far the best at differentiating between the two classes, but in its current state, it lacks the ability to localize the anomaly within the image. However when equipped with a region proposal system, the Mask-RCNN performs very poorly at identifying all of the anomalies within the image. Of the predictions it does make, it appears to be correct some of the time, but the only metric considered indicates otherwise. When approached as a semantic segmentation task, the deep learning model performs much better than the Mask-RCNN at identifying regions within each image that are likely to contain anomalies. However, without real ground-truth segmentation maps for comparison, no quantitative results could be compiled.

Because the CNN was capable of performing with such high classification accuracy, this project is successfully able to answer the question of whether or not an anomaly can be distinguished within backscatter mosaics. The next step would be to identify the best method for localization, whether it be through a region proposal system, an encoder-decoder system, or perhaps something else entirely.

The project was merely meant to serve as an exercise in how computer vision tasks, specifically those that rely on deep learning, could be used to tackle a problem that exists within the field of ocean mapping. As any deep learning practitioner would attest, the amount of images, the quality of annotations and the time spent fine-tuning hyper-parameters are fundamental components of any project. The dataset provided is quite small (even as a toy dataset), annotations were done manually in a passive style and very little time was devoted to hyper-parameter tuning. However, overall this project does present techniques that some researchers might find useful, whether they work within this domain or another. The ocean is a massive and mostly unexplored part of our planet, without adopting the proper tools, it is likely to stay that way.

Introduction:

This project looks at using various computer vision algorithms to detect anomalies in seafloor backscatter data represented as imagery data.

During deep ocean exploration projects, the seafloor is scanned using a wide-spectrum of acoustical waves to identify the depth and attempt to understand the composition of the seafloor at any given point. Acoustic waves propagate from a transducer that is typically located at the surface, downwards through the water columns until it reaches the seafloor or any other surface. These waves are either reflected, absorbed, or transmitted depending on the density of the object(s) they collide with; relatively dense materials and objects made of metal will have a more intense reflective signal than say sand, mud or clay.

After the signal is reflected, it travels back to the surface where it is recorded by a receiver and converted into a digital signal. With proper calibration, return signals or ‘backscatter’ can be represented as a grayscale image, where the pixel intensity signifies the reflectance of the materials at that location on the seafloor (see figure 2).

Besides mapping, often time seafloor surveys are interested in finding man-made objects or geological features, but due to the sheer size of the ocean, it is akin to trying to find a very small needle in a very large haystack. Therefore, there is a specific need for an efficient way to tag locations within a survey site that are considered regions of interest and deserve additional attention. This project investigates the potential of being able to detect anomalies in a seafloor backscatter mosaic as three different computer vision tasks:

Image Classification – provides a single label for each image
Object Recognition – provides localized regions containing anomalies for each image
Semantic Segmentation – provides a pixel-level classification for each anomaly in an image

The first section of this report describes the dataset I used for this project and the process of organizing it in an appropriate format that is required for each of the computer visions tasks. The second section briefly describes each of the computer visions tasks, a look at the algorithms used to perform them, and a description of the experimental setup. The third section shows the results for each computer vision task, accompanied by a short discussion on each of the methods including their pros and cons.

All work was done using Python 3.5, with Keras (backend Tensorflow) as the deep learning framework. The deep learning model APIs came from Keras, Matterport's implementation of Mask-RCNN, and Quvbel's awesome repo for segmentation models. I also use Scikit-Image, Scikit-Learn, OpenCV, Numpy and Pandas to assist with feature extraction, machine learning, image processing and data manipulation.

1 – Data Wrangling

Members at the National Oceanic and Atmospheric Association (NOAA) provided the dataset used in this project, which consists of an image mosaic representing the backscatter intensity obtained while performing a survey of site H12642. The dataset also contains high-resolution image patches that represent the anomalies in detail, along with their locations stored in individual text files.

Figure 2 – The backscatter mosaic of site H12642 with an insert showing an ‘anomaly’, which appears to be some geomorph. Note that regions in black were not surveyed and are not included as part of the dataset. The entire mosaic itself is a 330-megapixel image that represents an area of approximately 45 million square meters.

Unfortunately, to perform anomaly detection there must be samples representing both positive and negative classes. Because there are only high-resolution positive samples but no high-resolution negative samples, the image patches were not used in this project. I did consider sampling the mosaic to obtain negative samples, but their resolution would be significantly lower than that of the positive samples, potentially creating a bias.

Instead I chose to create annotations for each anomaly found in the mosaic itself, thus I could have positive and negative samples with the same resolution. Using the location information stored in text files I was able to highlight the regions in the mosaic that contained anomalies, but because the resolution was significantly lower, some anomalies were not identifiable due to the fact that they were comprised of less pixels (see figure 3). This limited the number of positive samples drastically and led me to reconsider the definition of an anomaly for this project.

Figure 3 – A side-by-side comparison of the high-resolution patch representing a known anomaly within the site and the same anomaly as seen within the mosaic. The reason for the difference in resolution is simply because the mosaic covers a large area and to reduce the amount of memory, the image was downsampled and without the original data source, a higher resolution mosaic cannot be reproduce. Note that not all anomalies are this large, otherwise you wouldn't be able to see it.

An anomaly is defined based on the context of the situation: man-made objects are almost away considered anomalies, but not all geological features are, thus it depends on the site being surveyed. Based on this site, I re-defined an anomaly to be any man-made object or geological feature that is discernable by an annotator given the resolution of the mosaic.

The mosaic was broken into 200 equal sized tiles with dimensions of 512 x 512 pixels with no overlap between each tile, and any tile that were comprised of entirely zeros (e.g. outside of the survey area) was deleted. To label the location of anomalies in each tile I used LabelImg, a popular open-source annotation tool with a graphical user interface. Using LabelImg, I created bounding box annotations around every anomaly present within an image, saving all of their locations in a single corresponding .xml file in PASCAL VOC format. Note that it was important to annotate every anomaly because these annotations served as the basis for how I would obtain the appropriately formatted data for each of the computer vision tasks.

Figure 4 – An image of LabelImg in use. This popular and open-source annotation tool allows users to provide bounding box annotations for each image. For PASCAL VOC format, every anomaly’s location is stored in a single .xml with the same filename as its corresponding image file. This format of annotation is for object detection but can be leveraged to provide labels for other computer vision tasks to save time. Time spent annotating this dataset was approximately 5 hours, or roughly 14 episodes of Parks and Rec.

For image classification each sample is an image patch and its corresponding label, in this case either ‘anomaly’ or ‘normal’. By looking at the dimensions of the bounding box annotations across all tiles I found that the vast majority of them were less than 60 x 60 pixels (see figure 5). To create positive samples, I extracted a 60 x 60 pixel patches centered on the center pixel of each bounding box annotation, essentially cropping out all of the anomalies from each tile. With each of these cropped out tiles, I then randomly sampled 60 x 60 pixel patches to serve as the negative samples. To ensure quality negative samples, any patch that had more than ~5% zeros—which represented the non-surveyed regions of the mosaic or where a positive sample had already been cropped—were discarded. Doing this allowed me to collect 3,800 positive and negative samples for image classification.

Figure 5 – A plot showing the distribution of bounding box annotation dimensions. More than 98% of annotations are less than 60 x 60 pixels making it an idea size for patches to be extracted from the mosaic.

For object recognition, the annotations in the PASCAL VOC format did not need to be altered in any way. During the training process, positive samples are provided through the annotations, and negative samples are created on-the-fly by sampling regions outside of the bounding box annotations. Thus it is important that all anomalies are labelled otherwise negative samples created might accidently contain unlabeled positive samples.

For semantic segmentation, the bounding box annotations were used to make masks or, ‘pixel-level’ labels. Typically, in semantic segmentation, pixel-level labels are required for the entire image, where the value within each pixel represents the class category present in the corresponding image. A lot of time is spent to provide these pixel-level labels for datasets, but, there have been a number of studies that show that even with less than perfect annotations, deep learning algorithms can still learn the mappings between pixels and class labels. This saves effort on the annotator’s part while still providing a similar outcome. For this project I chose to use the bounding box annotations as masks for the positive samples, with all other pixels serving as the mask for negative samples (see figure 6).

Figure 6 – Plots showing the creation of annotated data for different computer vision tasks. Using the original tiles and their corresponding PASCAL VOC annotations, masks for positive and negative samples can be created in a straightforward manner. For image classification, patches centered on the center of each bounding box can be used to create positive samples, and once cropped out, negative samples can be generated by performing random sampling on the remaining tile. Any patch that has more than 5% zeros are removed as potential negative.

2 – Computer Vision Tasks

In this section, I describe each of the computer vision tasks, the algorithms used, and the experimental setup used to train the deep learning models.

Computer vision is a sub-field of computer science that attempts to give computers the ability to see the world in a way that is similar to how humans do. This of course requires humans to understand how humans see the world, which, considering the complexity of our brains is not a trivial task. However, after some success in understanding how the mammalian visual cortex system works during an experiment in the late 1950’s, the field was finally provided with a rough blueprint for what needed to be done. The same over-confidence that plagued the field of artificial intelligence found its way to computer vision as well, making many believe that getting a computer to ‘see’ the world would be a trivial problem that could be solved over a single summer. Fifty years later and it was still extraordinarily difficult to have a computer reliably differentiate between pictures of cats and dogs, let alone having fully autonomous self-driving cars.

That was until 2012, when a team of researchers from the University of Toronto won the ImageNet challenge, a large-scale image classification competition with millions of images for thousands of different class categories. Not only did they win, they crushed the competition using a forgotten algorithm known as the convolutional neural network (CNN). The algorithm is a variation of an artificial neural network (ANN) that was first developed in the late 80’s, but gained relatively little attention due to the amount data and computational resources needed to train it. Of course, it made a comeback after the gaming industry started pumping out powerful general-purpose GPUs affordable for the masses, along with large amounts of data collected from various domains.

Figure 7 – a diagram illustrating a generic CNN. Here you can see that network is composed of two separate parts: a convolutional-base that is used as a feature detector and extractor, and a fully connected layer that takes those features and attempts to classify what the image is. Although there are many different types of CNNs, they all share these two basic qualities.

The CNN is a form of ANN that uses a hierarchy of cascading linear filters to extract increasingly complex features that are then passed to the attached ANN to perform classifications. There are an infinite number of architectures that can be constructed, but all CNNs share these same components: a convolutional-base, followed by a fully connected layer. CNNs themselves only perform image classification; however, combined or modified, they can be used in conjunction with other algorithms to perform tasks like object recognition or semantic segmentation. For object detection, most algorithms have some mechanism for suggesting where the regions of interest are within an image, which it then passes to a CNN for classification. Because they are trained as a single end-to-end system, both components become better at their individual tasks simultaneously.

Figure 8 – A diagram illustrating the components that make up the Mask-RCNN architecture. The network uses some CNN as its ‘backbone’, which is used to provide feature maps and classifications for parts within the entire image. These feature maps contain highlighted regions that were detected by the linear filters in the backbone and are used with the region proposal network to provide precise coordinates for those objects highlighted in the image. These are then downsampled using a pooling method and based to a fully connected network, which provides classifications, and suggestions for the exact locations for the bounding box for each object.

For semantic segmentation, a convolutional-base is used as an encoder that extracts increasingly complex features just as normal, but instead of passing those features off to a fully connected layer for classification, they are passed to a decoder that resembles the convolutional-base but in reverse. At each layer of the decoder, input is provided as features from the previous layer, along with features from the corresponding layer in the encoder. The decoder is responsible for learning how to take these combined features and upsample them in such a way that leads to a pixel-wise mapping for the original image. Together, the encoder and decoder form an auto-encoder, a single end-to-end system that can be trained to perform semantic segmentation.

Figure 9 – a diagram illustrating a U-Net architecture for semantic segmentation tasks. The U-Net is just one style of architecture that perform pixel-wise classification using deep learning. It is composed of two branches: the encoder and the decoder. The encoder works identical to a CNN, but without a fully connected layer. Instead, it detects, extracts, and downsamples features just as normally, but passes them on to the decoding branch, which as the name implies attempts to learn the correct way for upsampling these features into a 2D pixel-wise mapping. It does this with the aid of feature maps from the corresponding layers in the encoder.

2.1 – Image Classification

As mentioned before, a CNN is the core algorithm for many other computer vision tasks. Thus if a CNN can be trained to differentiate between positive and negative samples within this project, it can be assumed that one, there exists a set of features that uniquely represent the class categories and two, they are learnable by a deep learning model. The task of localization in object detection or pixel-level mapping in semantic segmentation falls on the region proposal component or the decoding branch, not the convolutional-base. If the results from object detection or semantic segmentation are less than satisfactory, that is entirely on those additional mechanisms, not the convolutional-base.

For image classification, I first trained a CNN on the aforementioned dataset, and provided the classification accuracy for the test portion of the dataset that was withheld from training process. Next I used some visualization tools to examine how individual filters throughout the network are activated, providing insight on which features were learned. Lastly, I compare the test results from the CNN with the results obtained from the previous state-of-the-art method for image classification, which relies on handcrafted features.

For each of these computer vision tasks, I chose to use the ResNet-50 architecture to serve as the CNN. ResNet or “Residual Neural Network” is a family of architectures that were introduced in 2015 that helped deal with the “vanishing gradient” problem, an issue that arose when networks were built to be deeper in an effort to make them perform better. However, during back-propagation (the method that helps adjust each of the parameters in the network), the gradient would become infinitesimally small the deeper the network was, thus layers in the beginning of the network would not have any residual loss to work with. ResNet solved this problem by adding skip-layers to the network allowing the loss to be calculated and used to adjust parameter values in all layers of the network regardless of their location. There have been many network architectures that improved upon this design, but it still serves as a good network for a variety of tasks.

The CNN was fine-tuned on this dataset with pre-trained weights learned from ImageNet, this process is referred to as ‘transfer learning’ and has been shown to allow for networks to converge more quickly than when trained from scratch. However, the pre-trained weights are for the ImageNet challenge and expects the network to have 1,000 nodes in its output layer; for this project we only need two (anomaly and normal), so we can remove the ‘top’ layer, and replace it with a layer with only two nodes. Lastly, I inserted a dropout layer between the convolutional base and the sigmoid activation function to reduce the potential for overfitting, which is likely to occur considering the size of the dataset and the size of the network.

The loss function used is binary cross-entropy, with accuracy, recall and precision as metrics. The optimizer used was Adam, with an initial learning rate of 10-4 used with a callback that reduces the learning rate by some factor during the training process if the validation loss does not decrease after some number of epochs.

To supply data to the model during training I used Keras’ built-in image data generator, which takes in a dataframe containing the location of images and their labels, and performs all the necessary pre-processing before combining them into batches to be fed directly to the model. In this context, pre-processing consists of normalizing, resizing, and augmenting images (to reduce overfitting). Augmenting was done with ImgAug, an open-source augmentation pipeline tool that allows users to choose various augmentations depending on their dataset, and plugs directly into Keras’ image data generator.

Training, validation, and testing data were split in such a way to avoid the possibility of bias through data contamination. This was done by first splitting the tiles into training, validation and testing sets, and obtaining positive and negatives samples from each. Thus, no patch within the training set came from the same tile as a patch within the validation or testing set. When using deep learning models, it is important to have a validation and a test set because when adjusting hyper-parameters, there is a possibility that the user might accidentally cause the model to overfit to the validation set. Thus having another hold-out set will provide insight if the model is able to generalize, or just overfitting to the validation set. The training, validation and testing set are equal to 70%, 10%, and 20% (respectively) of the original dataset, where each set contained an equal number of positive and negative samples.

The model was trained for 100 epochs with a batch size of 16; with 3,800 samples, each epoch consisted of 238 steps. With each decrease of the validation loss, the weights of the model from that epoch were saved as an .h5 file with the use of a callback. In figure 10 you can see history of the training process.

Figure 10 – The results logged from training the image classification model. Note that one, the validation accuracy is always higher than the training accuracy, indicating that the model is underfitting meaning the model could be better with less regularization (augmentation, dropout) or with more training. Overall, the training looks successful, but results on the testing data will be a better indication.

Once the model was fully trained I used the visualization techniques that provide an inside look into what the linear filters within a fully-trained model respond to. The first technique takes in an image, and outputs the filters for each layer requested providing a look at how each filter within the selected layers respond to different parts of the same image. It also allows users to see how throughout the network, different layers specialize in detecting different types of features. Below are some of the resulting feature maps from a few layers, all of which are responding to the same image. The diverging color scheme shows highly activated regions in red and non-activated regions in blue.

As you can see in the beginning of the network, many filters are activated representing that they all play a role is detecting some features within the image. However, further into the network filters become more specialized and detect very few and specific features or, none at all.

Next I looked at what image highly activates different filters within different parts of the network. To do this, we choose some filter within a layer and created an image through an iterative process that updated its pixel values using a method called ‘gradient ascent’, which maximizes the loss for the chosen filter. By iteratively updating the image, we can get a view of what each filter will respond to the most thus providing insight on the types of features it has learned to detect.

Note in the early layers of the network, the filters respond to low-level features such as lines with different orientations and angles and even corners. As you move further through the network, you can see patterns and textures are detected, some of which clearly resemble the wave patterns found in backscatter mosaic. However, as you go even further through the network, the features detected are harder to interpret and seem to be very specific.

Figure 12 – multiple panels showing individual filters from specific layers in the CNN. Each panel represents a different layer in the network, and holds 25 different images. Each of these images is a visual representation of the maximum response from each of those 25 randomly sampled filters. Thus each image shows what the corresponding filter responds to the most within an image. It is assumed that the images containing only noise represent filters that did not learn any useful features during that training process.

What is interesting is that because the dataset is so small and lacks diversity (as compared to something like, ImageNet for example), some of the same features appear to be detected in different layers; this can be seen the last two panels, which correspond to filters found in layers 7 and 10. This could indicate that these specific features are frequently found with images containing anomalies, or, because there are slight variations between filters that those found in later layers are building on top off those found in earlier layers.

Next, I wanted to compare the classification accuracy of the trained CNN with the results obtained from the previous state-of-the-art method. For this, I used Haralick features with the Visual Bag of Words method to create a codebook that was then used to obtain a feature descriptor for each image.

Haralick features are a popular feature descriptor that characterizes the texture of an image. In the context of computer vision, texture refers to the pattern that makes up an object, not the physical feeling of it. Haralick features are created for an image by first creating a grayscale copy of the image and then creating a co-occurrence matrix, which encodes the number of times a pixel with the value of i is adjacent to a pixel with a value of j. Adjacency is defined as top, bottom, left and right, so four co-occurrence matrices are computed, one for each of those relationships. The matrix is then divided by the total number of occurrences, resulting in the probabilities for those adjacent relationships. From these matrices Haralick features are computed, which consists of 14 statistics that act as a feature descriptor for each image.

Figure 13 – a diagram illustrating the Haralick features algorithm. An image is first converted to grayscale, and then each pixel-value relationship is record within a co-occurrence matrix; for each adjacent relationship (e.g. top, bottom, left and right), a matrix is created in an attempt to account for rotational variance. These matrices are then normalized, giving the probability of each pixel adjacency relationship. From these, Haralick features are created, which consists of 14 different statistics to represent the entire image. Some variants includes 21 statistics.

These features are computed for each image within the training and testing set, however only the features from the training set are used to compute the codebook for the Visual Bag of Words algorithm. This algorithm is based on Bag of Words from Natural Language Processing; it creates a representation (feature descriptor) for each image based on the number of specific features it contains. These features (in this case, Haralick features) are found, extracted, and described from each image within the training dataset and used collectively to create a dictionary from which novel images of the same class category can be described via another feature descriptor. This is done with the use of K-Means clustering algorithm: using some value of K, clusters are formed around the features themselves and their location in feature space are recorded. With the location of these cluster centers, we can use them to create another feature descriptor for each image we provide it.

Thus, the feature descriptor for each image in the training data set is created using the original Haralick features and the newly created dictionary. For each image, a new feature descriptor in the form of a histogram representing the occurrences of features in each image is created. The histogram is normalized and saved to another dataframe. The same thing is done for the test set: features are located, extracted and described, and then used with the same dictionary to create a histogram for each image. It should be noted that there is no validation set within this experiment, thus the training and testing sets are equal to 80% and 20% (respectively) of the original dataset.

These final feature descriptors ard their corresponding labels were provided to four different machine learning algorithms: K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), and an ANN, all of which used default parameter values.

2.2 – Object Detection

In computer vision, object detection refers to first detecting the presence of objects of interest within an image, and then classifying them. When using deep learning algorithms, there are typically two components: a region proposal algorithm followed by convolutional neural network. Thus, all of the deep learning object detection architectures improve by finding more efficient ways to train the region proposal system in tandem with the attached CNN.

For this computer vision task, I used the well-known Matterport implementation of Mask-RCNN that was developed with Keras. Mask-RCNN, which stands for “Regional CNN” is a deep learning algorithm capable of performing object detection and instance segmentation, building off its predecessors RCNN, Fast-RCNN, and Faster-RCNN. Unlike RCNN, Faster-RCNN and Mask-RCNN do not just search the image for regions of interest, as it was that this method is orders of magnitude slower than competing methods, and instead use the locations and dimensions of the ground truth bounding boxes to assist in recommending regions to propose.

As mentioned before, the images and their annotations are already in the correct format (PASCAL VOC), so there was no need for any additional pre-processing. Images and their labels were split into training, validation and testing sets in the same way as described in the image classification section (70%, 10%, and 20%, respectively). However here I used the whole tiles, not patches from the tiles, meaning that the entire dataset consists of 200 images, an embarrassingly small number of images. Luckily, there were plenty of positive samples in most of the images, leading to "adequate" results as documented in the next section.

Images were pre-processed based on the specifications required when using pre-trained weights, which involved normalizing each image by subtracting from it the global mean pixel value (per channel). Because the image is grayscale, each image is stacked with three copies to emulate an RGB image. Images where then augmented using ImgAug, where only affine transformations (flips, flops, rotations, etc.) were applied.

The encoder or ‘backbone’ used was the ResNet-50 CNN, with pre-trained weights from the MS-COCO dataset used to initialize the model. These weights allow for transfer learning for the entire network, not just the CNN, so training on this dataset starts with fine-tuning both the encoder, and the region proposal system.

The loss functions used included RPN class, RPN bounding box losses, which are both specifically for the region proposal component of the algorithm, and a class, bounding box and mask loss function that serves the entire network. The metric used is mean Average Precision (mAP). The optimizer used was stochastic gradient descent (SGD), with a learning rate of 10-2, momentum of 0.9, and a weight decay of 10-5 (a form of regularization).

The model was trained for 100 epochs, using a batch size of one due to the size of each image and the amount of GPU memory available. Below are the results from the training recorded through TensorBoard.

Figure 14 – The results from training the Mask-RCNN, collected through TensorBoard. The top diagram shows the training loss for each epoch, and the bottom shows the validation loss for each epoch. Note the curve of the validation loss compared to the training loss, indicating some overfitting had occurred.

2.3 – Semantic Segmentation

In computer vision, semantic segmentation refers to providing a class category label for each individual pixel that comprises the image. Before deep learning methods, semantic segmentation was often done with the use of segmentation or ‘over-segmentation’ algorithms, which involved first segmenting the image into pieces by grouping pixel together based on similarity. Then each of those segments would be described using some feature descriptor and used to classify which class category that piece belonged to. Another method involved scanning the entire image through a roving-window operation and using all of the pixels within a local neighborhood as a patch to classify just the center-most pixel.

The first deep learning method used Fully-Convolutional Neural Networks (FCNNs), which uses the convolutional-base but replaces the fully-connected layer with a 2D output layer that matches the dimensions of the original image. This method works, but is less precise due to the jump from high-level features encoded in a single 1-dimensional vector to a 2-dimensional image. Over time, other methods made that jump more manageable by having the network gradually learn the upsampling required to convert the final feature vector into a pixel-wise mapping. These methods are assisted by combining the corresponding feature maps from the convolutional-base with the upsampled features obtained by performing a reversal of the corresponding linear filter (i.e. transpose convolution). Although no longer state-of-the-art, the U-Net architecture has made a lasting impression and is often used due to its simplicity and accuracy.

For semantic segmentation, the code I used for constructing the model came from qubvel’s GitHub repository, which includes a phenomenal API for developing different semantic segmentation models in Keras. I used a ResNet-50 as the encoder branch and supplied it with ImageNet weights; however, there were no existing weights for the decoder, so it was trained from scratch.

As described in the previous sections, for a deep learning model to learn, it requires labelled data to. For image classification, this requires a single label for each image, and for object detection, bounding box locations describing where objects are within each image. For semantic segmentation, every pixel in the image needs to be provided with a class category label, which can be a very expensive and time-consuming task for the annotator. Fortunately, there are already bounding box annotations for each anomaly from the object detection task, so one could rather easily repurpose these to generate pixel-wise mappings for each anomaly. For each image, we simply create an empty image composed of zeros marking the negative samples, and wherever a bounding box exists, we provide those pixels with a value of one. Typically when creating masks, they should be done as precisely as possible, but as mentioned previously, there are a number of papers that demonstrate a model can still perform well, even with less than perfect annotations to learn from.

As with image classification, images and their corresponding labels are fed directly to the model during training using a data generator. This takes in batches of image and label files, opens, pre-processes and augments them before handing them off to the model. Pre-processing in this context is the same process used when training the encoder on the ImageNet dataset since our model will be initialized with those same weights. Augmentation is done with ImgAug, which only applies affine transformations.

The loss is calculated using a weighted binary cross-entropy function, which attempts to lessen the negative effects caused by having an imbalance between positive and negative samples. Metrics include pixel accuracy, precision, recall and intersection of union (i.e. Jaccard Index). The optimizer used is Adam, with an initial learning rate of 10-3 that reduces by some factor after some number of epochs in which the validation loss does not decrease.

The model was ran for 20 epochs, using a batch size of one, again due to the size of the images and the amount of GPU memory available. Below are some figures highlighting the progress of training.

Figure 15 – The results from training the U-Net architecture. The model was trained for 100 epochs and as you can see, it converged around the 25th. The gap between the training and validation loss implies that the model overfit, but not drastically, as you can see the distance between the two at each epoch stays consistent.

3 – Results

For the image classification task, both the deep learning and handcrafted method were tested on 3,800 images, composed of positive and negative samples in near equal amounts. The deep learning method used a ResNet-50 architecture fine-tuned with pre-trained weights from ImageNet, and received a classification accuracy of 97% ; below is the confusion matrix displaying the predictions compared to the ground-truth.

The handcrafted method used Haralick features with the Visual Bag of Words algorithm to create a feature descriptor for each image. Using multiple machine learning algorithms (KNN, RF, SVM and ANN), we can compare each of their results against one another, as well as with the CNN. Below are the confusion matrices for each machine learning algorithm.

You can see that the RF and ANN did the best in terms of classification accuracy using handcrafted features, followed by KNN. The SVM did the worst, predicting every sample as an anomaly, which is likely due to the kernel used.

The Mask-RCNN used the original tiles not patches, thus the amount of data available is quite small (200 images total) and did not include a test set. Instead, for evaluation I chose to show the mean Average Precision for both the training and validation set. Below are some side-by-side comparisons between the original tile and its annotations (left) against what the model predicted (right). Note that all images come from the validation set.

16 1

16 2

16 3

16 1

1/3

The mAP for both of the sets are rather abysmal, which is likely due to a number of factors including the number of images in the dataset, the quality of the annotations, less than ideal values for different hyper-parameters, and even the network itself. Based on the side-by-side comparison between the images and predictions, one can see it does get some of the anomalies correct, but misses many.

Next is the semantic segmentation algorithm, which is a U-Net architecture using a ResNet-50 pre-trained on ImageNet weights as the encoder. Unfortunately, because the masks that were used as training labels for each image are not pixel perfect, trying to evaluation the model’s performance with them would lead to misleading results. Instead, I provide a few images at random to be evaluated by the user qualitatively. The image on the left in each panel is the original image with bounding box annotations in green superimposed, in the middle are the model’s predictions, and on the right are all the predictions that exceed a specific confidence level, highlighted equally. By having a confidence threshold, users can toggle with how sensitive they want the predictions to be depending on the situation.

17 1

17 2

17 3

17 1

1/3

4 – Conclusion

This project looks at using different computer vision algorithms to classify anomalies found in backscatter mosaic imagery. I attempt to solve them problem by looking at it as three different tasks: image classification, object detection and semantic segmentation. From the results, it appears the CNN trained as an image classifier does by far the best at differentiating between the two classes, but in its current state, it lacks the ability to localize the anomaly within the image. However when equipped with a region proposal system, the Mask-RCNN performs very poorly at identifying all of the anomalies within the image. Of the predictions it does make, it appears to be correct some of the time, but the only metric considered indicates otherwise. When approached as a semantic segmentation task, the deep learning performs much better than the Mask-RCNN at identifying regions within each image that are likely to contain anomalies. However, without real ground-truth segmentation maps for comparison, no quantitative results could be compiled.

The project was merely meant to serve as an exercise in how computer vision tasks, specifically those that rely on dep learning, could be used to tackle a problem that exists within the field of ocean mapping. As any deep learning practitioner would attest, the amount of images, the quality of annotations and the time spent fine-tuning hyper-parameters are fundamental components of any project. The dataset provided is quite small (even as a toy dataset), annotations were done manually in a passive style and very little time was devoted to hyper-parameter tuning. However, overall this project does present techniques that some researchers might find useful, whether they work within this domain or another. The ocean is a massive and mostly unexplored part of our planet, without adopting the proper tools, it is likely to stay that way.